Compass::Core provides an abstraction layer on top of the wonderful Lucene Search Engine. Compass::Core also provides several additional features on top of Lucene, like two phase transaction management, fast updates, and optimizers. When trying to explain how Compass::Core works with the Search Engine, first we need to understand the Search Engine domain model.
Resource represents a collection of properties. You can think about it as a virtual document - a chunk of data, such as a web page, an email message, or a serialization of the Author object. A Resource is always associated with a single Alias and
several Resources can have the same Alias. A Property is just a place holder for a name and value (both strings). A Property within a Resource represents some kind of meta-data that is associated with the
Resource like the author name. In data-base terms, you can think of an Alias as a table, the Resource as a row in the table and Property as the column (with a value). Note: a Resource can have several properties with the same name.
Every Resource is associated with one or more id properties. They are required for Compass::Core to manage Resource loading based on ids and Resource
updates (a well known difficulty when using Lucene directly). Id properties are defined either explicitly in the Resource Mapping definition or implicitly in the OSEM definition.
For Lucene users, Compass Resource maps to Lucene Document and Compass Property maps to Lucene Field.
In order to create a Resource and a Property, you use the CompassSession which acts as a factory. CompassSession provides several API's:
createResource(String alias): Creates a Resource with the specified alias.
createProperty(String name, String value, Property.Store store, Property.Index index): Creates a Property with the specified name and value. As well as
the Property behavioural aspect within the Search Engine.
createProperty(String name, String value, Property.Store store, Property.Index index, Property.TermVector termVector): Creates a Property with the specified name and value. As well as the Property behavioural aspect within the Search Engine.
When creating a Property, you must specify the store and the index parameters.
Another option when creating Resource Property is to define resource mapping and within it a resource-id and resource-property mappings (please see the Resource Mapping Section). When defining the mappings, Compass can be smart enough to guess the type, index, store, and other options using the mappings, allowing the usage of the simple addProeprty(String propertyName, Object value) API of Resource (even auto converting the Object to the correct value using Compass converter architecture).
The following table specifies the available values for the store parameter:
Table 3.1.
| Store | Description |
|---|---|
Property.Store.NO | Do not store the property value in the index (won't be able to retrieve it later on). |
Property.Store.YES | Stores the original property value in the index. |
Property.Store.COMPRESS | Stores the original property value in the index in a compressed form. |
The following table specifies the available values for the index parameter:
Table 3.2.
| Index | Description |
|---|---|
Property.Index.NO | Do not index the property value. This property can thus not be searched, but one can still access its contents provided it is stored. |
Property.Index.TOKENIZED | Index the property value so it can be searched. An Analyzer will be used to tokenize and possibly further normalize the text before its terms will be stored in the index. |
Property.Index.UN_TOKENIZED | Index the property value without using an Analyzer, so it can be searched. As no analyzer is used, the value will be stored as a single term (perfect for id like properties). |
The following table specifies the available values for the termVector parameter:
Table 3.3.
| Term Vector | Description |
|---|---|
Property.TermVector.NO | Do not store any term vector information (the default behavior). |
Property.TermVector.YES | Store the term vectors of each document. A term vector is a list of the resources's terms and their number of occurences in that document. |
Property.TermVector.WITH_POSITIONS | Store the term vector + Token offset information. |
Property.TermVector.WITH_OFFSETS | Store the term vector + Token offset information. |
Property.TermVector.WITH_POSITIONS_OFFSETS | Store the term vector + Token position and offset information. |
The following code shows how you can create a Resource with Compass::Core and save it.
CompassSession session = compass.openSession();
CompassTransaction tx = session.beginTransaction();
Resource authorResource = session.createResource("author");
Property authorIdProp = session.createProperty("id", "AUTHOR0812",
Property.Store.YES, Property.Index.UN_TOKENIZED);
Property authorNameProp = session.createProperty("name",
"Jack London", Property.Store.YES, Property.Index.TOKENIZED);
authorResource.addProperty(authorIdProp);
authorResource.addProperty(authorNameProp);
session.save(resource);
tx.commit();
Compass::Core allows you to set the boosting factor for Resource and Property (through Lucene boosting feature). Boosting is the process of making a Resource
or a Property more or less "important" than others.
Initially, Resource and Property have no boost (actually, a boost of 1.0). You can set the boost level on a Resource (which propagates to all
the properties that have no boosting set) or on a specific Property. Higher values than 1.0 makes it more relevant and values lower than 1.0 make it less relevant.
Analyzers are components that pre-process input text. They are also used when searching (the search string has to be processed the same way that the indexed text was processed). Therefore, it is important to use the same Analyzer for both indexing and searching.
Compass::Core can be configured to have multiple analyzers, regsitered under different analyzer names. It has two internal analyzers: default and search, as defined in the Analyzers section in the Configuration chapter.
Analyzer is a Lucene class (which qualifies to org.apache.lucene.analysis.Analyzer class). Lucene core itself comes with several Analyzers and you can configure Compass::Core to work with either one of them. If we take the following sentence: "The quick brown fox jumped over the lazy dogs", we can see how the different Analyzers handle it:
whitespace (org.apache.lucene.analysis.WhitespaceAnalyzer)
[The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
simple (org.apache.lucene.analysis.SimpleAnalyzer)
[the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
stop (org.apache.lucene.analysis.StopAnalyzer)
[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
standard (org.apache.lucene.analysis.standard.StandardAnalyzer)
[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
Analyzers have a list of stop words which they exclude while analyzing, you can control both the Analyzer and the stop words using Compass::Core configuration parameters.
It is very important to understand how the Search Engine index is organized so we can than talk about transaction and optimizers. The following structure shows the Search Engine Index Structure:
---[index dir]/index | |-- [subIndex1] | | | |--- segments | |--- [segment1] | |--- [segment2] | |-- [subIndex2] | | | |--- segments | |--- [segment1] | |--- [segment2] | |--- [segment3] | ...
Every sub-index has it's own fully functional index structure (which maps to a single Lucene index). Each Resource alias is associated with a sub-index, and more than one alias can be mapped to a sub-index (using either resource mapping or OSEM). The Lucene index part holds a "meta data" file about the index (called segments) and 0 to N segment files. The segments can be a single file (if the compound setting is enabled) or multiple files (if the compound setting is disable). A segment is close to a fully functional index, which hold the actual inverted index data (see Lucene documentation for a detailed description of these concepts).
Compass:Core Search Engine abstraction provides support for transaction management on top of Lucene. The abstraction support common transaction levels: read_committed and serializable, as well as the special batch_insert one. Compass::Core provides two phase commit support for the common transaction levels only.
Compass::Core utilizes Lucene inter and outer process locking mechanism and uses them to establish it's transaction locking. Note that the transaction locking is on the "sub-index" level (the sub- index based index), which means that dirty operations only lock their respective sub-index index. So the more aliases map to the same index, the more aliases will be locked when performing dirty operations, yet the faster the searches will be. Lucene uses a special lock file to manage the inter and outer process locking which can be set in the Compass::Core configuration. You can manage the transaction timeout and polling interval using the Compass::Core configuration.
The Compass::Core transaction acquires a lock only when a dirty (i.e. create, save or delete) operation occurs, which makes "read only" transactions as fast as they should and can be.
Compass::Core provides support for read_committed transaction level. When starting a read_committed transaction, no locks are obtained. Read operation will not obtain a
lock either. A lock will be obtained only when a dirty operation is performed. The lock is obtained only on the index of the alias that is associated with the dirty operation, i.e the sub-index, and will lock
all other aliases that map to that sub-index. In Compass::Core, every transaction that performed one or more save or create operation, and committed successfully,
creates another segment in the respective index (different than how Lucene manages it's index), which helps in implementing quick transaction commits, as well as paving the way for a two phase commit support (and the reason behind having optimizers).
The serializable transaction level operates the same as the read_committed transaction level, except that when the transaction is opened/started, a lock is acquired
on all the sub-indexes. This causes the transactional operations to be sequential in nature (as well as being a performance killer).
A special transaction level, batch_insert utilizes the extremely fast batch indexing provided by Lucene. The transaction supports only create operation, but note that if
another Resource with the same alias and ids already exists in the system, you will have two instances of it in the index (in other words, create doesn't delete the
old Resource). You can control the batch_insert transaction using several settings which are explained in the Configuration section. An important note is that the
transaction is not a transaction which can be rolled back, since Lucene commits the changes during the batch indexing process, which means that a rollback operation won't rollback the
changes. The index is optimized when the transaction is committed, which means that all the segments are merged to one segment, in order to provide fast searching. The transaction is mainly used for background batch indexing.
As mentioned in the read_committed section, every dirty transaction that is committed successfully creates another segment in the respective index. The more segments the index has, the slower the fetching operations take. That's why it is important to keep the index optimized and with a controlled number of segments. We do this by merging small segments into larger segments.
In order to solve the problem, Compass::Core has a SearchEngineOptimizer which is responsible for keeping the number of segments at bay. When Compass is built using
CompassConfiguration, the SearchEngineOptimizer is started and when the Compass is closed, the SearchEngineOptimizer is
stopped.
Compass::Core provides support for scheduled optimizers. The scheduled optimizers uses Java Timer to control it's execution. SearchEngineOptimizer starts and stops the
timer when it starts and stops. There are several settings parameters that can be set to control the scheduling.
Note: each optimizer that Compass provides can be scheduled.
The AggressiveOptimizer uses Lucene optimization feature to optimize the index. Lucene optimization merges all the segments into one segment. You can set the limit of the number of
segments, after which the index is considered to need optimization (the aggressive optimizer merge factor).
The AdaptiveOptimizer uses optimize the segments while trying to manage the optimization time at bay. As an example, when we have a large segment in our index (for example, after we batched indexed the data), and we perform several interactive transactions, the
aggressive optimizer will than merge all the segments together, while the adaptive optimizer will only merge the new small segments. You can set the limit of the number of segments, after which the index is considered to need optimization (the adaptive optimizer merge factor).
Compass::Core also comes with a NullOptimizer, which performs no optimizations. It is mainly there if the hosting application developed it's own optimization which is maintained by other
means than the SearchEngineOptimizer. It also makes sense to use it when configuring a Compass instance with a batch_insert transaction.
Note that when using the NullOptimizer it makes no sense to use the scheduling feature, so remember to set the compass.engine.optimizer.schedule to false.