Customisation

Lunr.py ships with some sensible defaults to create indexes and search easily, but in some cases you may want to tweak how documents are indexed and search. You can do that in lunr.py by passing your own Builder instance to the lunr function.

Pipeline functions

When the builder processes your documents it splits (tokenises) the text, and applies a series of functions to each token. These are called pipeline functions.

The builder includes two pipelines, indexing and searching.

If you want to change the way lunr.py indexes the documents you’ll need to change the indexing pipeline.

For example, say you wanted to support the American and British way of spelling certain words, you could use a normalisation pipeline function to force one token into the other:

from lunr import lunr, get_default_builder
from lunr.pipeline import Pipeline

documents = [...]

builder = get_default_builder()
def normalise_spelling(token, i, tokens) {
    if str(token) == "gray":
        return token.update(lambda: "grey")
    else:
        return token

lunr.pipeline.Pipeline.register_function(normalise_spelling)
builder.pipeline.add(normalise_spelling)

idx = lunr(ref="id", fields=("title", "body"), documents=documents, builder=builder)

Note pipeline functions take the token being processed, its position in the token list, and the token list itself.

Skip a pipeline function for specific field names

The Pipeline.skip() method allows you to skip a pipeline function for specific field names. It takes the function itself (not its name or its registered name) and the field name to skip as arguments. This example skips the stop_word_filter pipeline function for the field fullName.

from lunr import lunr, get_default_builder, stop_word_filter

documents = [...]

builder = get_default_builder()

builder.pipeline.skip(stop_word_filter.stop_word_filter, ["fullName"])

idx = lunr(ref="id", fields=("fullName", "body"), documents=documents, builder=builder)

Importantly, if you are using language support, the above code will not work, since there is a separate builder for each language, and the pipeline functions are generated by the code and so cannot be imported. Instead, you can access them by name. For instance to skip the stop word filter and stemmer for French for the field titre, you could do this:

from lunr import lunr, get_default_builder, stop_word_filter

documents = [...]

builder = get_default_builder("fr")

for funcname in "stopWordFilter-fr", "stemmer-fr":
    builder.pipeline.skip(
        builder.pipeline.registered_functions[funcname], ["titre"]
    )

idx = lunr(ref="id", fields=("titre", "texte"), documents=documents, builder=builder)

The current language support registers the functions lunr-multi-trimmer-{lang}, stopWordFilter-{lang} and stemmer-{lang} but these are by convention only. You can access the full list through the registered_functions attribute of the pipeline, but this is not necessarily the list of actual pipeline steps, which is contained in a private field (though you can see them in the string representation of the pipeline).

Token meta-data

Lunr.py Token instances include meta-data information which can be used in pipeline functions. This meta-data is not stored in the index by default, but it can be by adding it to the builder’s metadata_whitelist property. This will include the meta-data in the search results:

from lunr import lunr, get_default_builder
from lunr.pipeline import Pipeline

builder = get_default_builder()

def token_length(token, i, tokens):
    token.metadata["token_length"] = len(str(token))
    return token

Pipeline.register_function(token_length)
builder.pipeline.add(token_length)
builder.metadata_whitelist.append("token_length")

idx = lunr("id", ("title", "body"), documents, builder=builder)

[result, _, _] = idx.search("green")
assert result["match_data"].metadata["green"]["title"]["token_length"] == [5]
assert result["match_data"].metadata["green"]["body"]["token_length"] == [5, 5]

Similarity tuning

The algorithm used by Lunr to calculate similarity between a query and a document can be tuned using two parameters. Lunr ships with sensible defaults, and these can be adjusted to provide the best results for a given collection of documents.

  • b: This parameter controls the importance given to the length of a document and its fields. This value must be between 0 and 1, and by default it has a value of 0.75. Reducing this value reduces the effect of different length documents on a term’s importance to that document.

  • k1: This controls how quickly the boost given by a common word reaches saturation. Increasing it will slow down the rate of saturation and lower values result in quicker saturation. The default value is 1.2. If the collection of documents being indexed have high occurrences of words that are not covered by a stop word filter, these words can quickly dominate any similarity calculation. In these cases, this value can be reduced to get more balanced results.

These values can be changed in the builder:

from lunr import lunr, get_default_builder

builder = get_default_builder()
builder.k1(1.3)
builder.b(0)

idx = lunr("id", ("title", "body"), documents, builder=builder)