carmen-cache is a low-level storage layer used in the carmen geocoder.
By default, binaries are provided for
64 bit OS X >= 10.8 and
64 bit Linux (>= Ubuntu Trusty). On those platforms no external dependencies are needed.
Other platforms will fall back to a source compile: see [Source Build](#Source Build) for details
To build from source run:
make yarn test
This will automatically:
- Install mason locally
- Install several C/C++ dependencies locally (in
mason_packages) via mason: bzip, rocksdb, protozero, and the clang++ compiler
To do a full rebuild run:
See CONTRIBUTING for how to release a new carmen-cache version.
When doing a forward (text -> coordinates) geocode, carmen goes through a few steps. The carmen README has a full rundown, but abbreviated for purposes of this module,
carmen does approximately as follows:
- break the query up into constituent substrings
- check to see which substrings are in which indexes
- look up the specific the tile coordinates of each occurrence of each substring in each matching index
- determine which occurrences of which substrings approximately spatially align
- for any plausible stackings of features, retrieve full feature geometry and verify the estimated spatial alignment
- format any results for response to the user
The steps in bold are implemented in this module in C++ for disk/memory compactness and speed; the rest are implemented in
As a concrete example, if a user were searching for "paris france," carmen might determine that the
country index contained the string "france" and the
city index contained multiple occurrences of "paris" (one for the real one in France, one for the one in Texas, etc.). carmen-cache would be responsible for retrieving the tile coordinates of all the tiles covering the country France, and the tile coordinates covering each Paris, and seeing which aligned with which; it would discover that the French Paris's tiles overlapped with some of the tiles in the France feature, making that combination a plausible stacking, while the Texas Paris's tiles would not align with France. It would return these results to
carmen for further verification.
carmen-cache exposes two implementations of the same interface, one read-write version called
MemoryCache and one disk-based read-only version called
RocksDBCache build on Facebook's RocksDB. The read-write version is used during
carmen's index-building process, at the end of which it's serialized into the read-only version for storage. At query time, the read-only version is used instead, as it's both faster and more memory-efficient.
Carmen-cache knows about the following kinds of data:
- keys: these are strings that might occur in a feature or a user query
- grids: each key is mapped to one or more grid values, each of which is a 64-bit representation of a tuple of score, textual relevance, grid coordinates (x and y), and feature ID representing the metadata about one occurrence of the given key within the index. The same key might map to more than one grid if, for example, more than one feature has the same name, or a single feature spans more than one tile. Grids are exposed to and expected from carmen as integers; see
test/grid.jsfor an example of how the these 5-tuples should be packed into an integer representation.
- languages: keys are optionally annotated with the language or languages for which the key is valid for the given feature. For example, a feature called "London" in English but "Londres" in French can be stored under both keys but with different language annotations. This way,
carmen-cachecan apply a penalty for results that match a given query in a language other than the one the user requested.
carmen-cacherepresents each language as a number, and has no internal concept of which real-world language each number maps to; it's
carmen's responsibility to perform that mapping.
MemoryCache supports setting of a key (and optional list of language numbers) to a list of grid numbers. It also supports a
pack operation, which writes out the read-write form into a henceforth-read-only version encoded on disk as a RocksDB database.
Both versions support a
list operation to retrieve all keys, a
get operation to retrieve a grid list for a given key and language set, and a
getMatching operation that can do either or both of:
- retrieve grids for all occurrences of a key with optional penalties applied for non-matching languages
- retrieve grids for all keys starting with a given prefix (useful for autocomplete queries)
The RocksDB representation of the cache condenses the data for on-disk storage as a RocksDB database. It is a key-value store:
Each key in the RocksDB database is a string, followed by a
|delimiter, followed by as many as 128 bits of data representing the language annotation. If this bit set is of zero length, it is interpreted as
2^128 - 1(i.e., all
1, which matches all languages). If it's any other length less than 128 bits, it's interpreted as the least significant (conceptually right-most) bits of a 128-bit integer (in other words, it's padded with
0on the most significant/conceptually-left-most side).
Each value is a compact representation of the set of grid integers for a given set of grids. This representation is obtained by sorting the integer representations of the grids in descending order, delta-encoding them (that is, storing all values after the first as the difference between it and its predecessor), and packing them as variable-length integers into a
protobufbuffer. Reading from this structure operates in reverse, expanding out all values after the first subtractively, and can be done lazily.
RocksDB representation contains an additional optimization to assist in autocomplete queries: it precomputes combined sorted lists of grids automatically for fixed-length prefixes of length 3 and length 6, so as to reduce the number of seeks and reads necessary to calculate autocomplete results for very short autocomplete queries. These precomputed versions are stored with a key that begins with
=2 (for shorter and longer prefixes, respectively), followed by the prefix string, followed by the
| delimiter and language bitmask as per usual. Prefixes include language annotations and are thus per-language-set just like other keys. This process is transparent to
carmen: these keys are calculated and populated automatically at
pack time, read automatically instead of reading the full
grid lists at
getMatching time if the requested key is sufficiently short, and hidden from, e.g.,
coalesce operation is what computes the possible stacking of combinations of substrings and returns the results to carmen. It can take advantage of the C++ threadpool to consider multiple possible stackings in parallel, and contains two implementations:
coalesceMulti. The former handles cases where a given query could be satisfied in its entirety by a single index, whereas the latter considers multi-index interactions.
coalesce expects a set of
phrasematch objects (see
carmen's source for what they contain), and returns a set of coalesce results via callback to
A brief diagrammatic overview of how
coalesceMulti works follows: