Skip to content

Build and register custom text analyzers from Tantivy builtins #425

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 9, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
114 changes: 114 additions & 0 deletions docs/tutorials.md
Original file line number Diff line number Diff line change
Expand Up @@ -214,3 +214,117 @@ assert html_snippet == (
```


## Create a Custom Tokenizer (Text Analyzer)

Tantivy provides several built-in tokenizers and filters that
can be chained together to create new tokenizers (or
'text analyzers') that better fit your needs.

Tantivy-py lets you access these components, assemble them,
and register the result with an index.

Let's walk through creating and registering a custom text analyzer
to see how everything fits together.

### Example

First, let's create a text analyzer. As explained further down,
a text analyzer is a pipeline consisting of one tokenizer and
any number of token filters.

```python
from tantivy import (
TextAnalyzer,
TextAnalyzerBuilder,
Tokenizer,
Filter,
Index,
SchemaBuilder
)

my_analyzer: TextAnalyzer = (
TextAnalyzerBuilder(
# Create a `Tokenizer` instance.
# It instructs the builder about which type of tokenizer
# to create internally and with which arguments.
Tokenizer.regex(r"(?i)([a-z]+)")
)
.filter(
# Create a `Filter` instance.
# Like `Tokenizer`, this object provides instructions
# to the builder.
Filter.lowercase()
)
.filter(
# Define custom words.
Filter.custom_stopword(["www", "com"])
)
# Finally, build a TextAnalyzer
# chaining all tokenizer > [filter, ...] steps together.
.build()
)
```

We can check that our new analyzer is working as expected
by passing some text to its `.analyze()` method.

```python
# Will print: ['this', 'website', 'might', 'exist']
my_analyzer.analyze('www.this1website1might1exist.com')
```

The next step is to register our analyzer with an index. Let's
assume we already have one.

```python
index.register_tokenizer("custom_analyzer", my_analyzer)
```

To link an analyzer to a field in the index, pass the
analyzer name to the `tokenizer_name=` parameter of
the `SchemaBuilder`'s `add_text_field()` method.

Here is the schema that was used to construct our index:

```python
schema = (
tantivy.SchemaBuilder()
.add_text_field("content", tokenizer_name="custom_analyzer")
.build()
)
index = Index(schema)
```

Summary:

1. Use `TextAnalyzerBuilder`, `Tokenizer`, and `Filter` to build a `TextAnalyzer`
2. The analyzer's `.analyze()` method lets you use your analyzer as a tokenizer from Python.
3. Refer to your analyzer's name when building the index schema.
4. Use the same name when registering your analyzer on the index.


### On terminology: Tokenizer vs. Text Analyzer

Tantivy-py mimics Tantivy's interface as closely as possible.
This includes minor terminological inconsistencies, one of
which is how Tantivy distinguishes between 'tokenizers' and
'text analyzers'.

Quite simply, a 'tokenizer' segments text into tokens.
A 'text analyzer' is a pipeline consisting of one tokenizer
and zero or more token filters. The `TextAnalyzer` is the
primary object of interest when talking about how to
change Tantivy's tokenization behavior.

Slightly confusingly, though, the `Index` and `SchemaBuilder`
interfaces use 'tokenizer' to mean 'text analyzer'.

This inconsistency can be observed in `SchemaBuilder.add_text_field`, e.g. --

```
SchemaBuilder.add_text_field(..., tokenizer_name=<analyzer name>)`
```

-- and in the name of the `Index.register_tokenizer(...)` method, which actually
serves to register a *text analyzer*.

10 changes: 10 additions & 0 deletions src/index.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ use crate::{
schema::Schema,
searcher::Searcher,
to_pyerr,
tokenizer::TextAnalyzer as PyTextAnalyzer,
};
use tantivy as tv;
use tantivy::{
Expand Down Expand Up @@ -453,6 +454,15 @@ impl Index {

Ok((Query { inner: query }, errors))
}

/// Register a custom text analyzer by name. (Confusingly,
/// this is one of the places where Tantivy uses 'tokenizer' to refer to a
/// TextAnalyzer instance.)
///
// Implementation notes: Skipped indirection of TokenizerManager.
pub fn register_tokenizer(&self, name: &str, analyzer: PyTextAnalyzer) {
self.index.tokenizers().register(name, analyzer.analyzer);
}
}

impl Index {
Expand Down
6 changes: 6 additions & 0 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ mod schema;
mod schemabuilder;
mod searcher;
mod snippet;
mod tokenizer;

use document::{extract_value, extract_value_for_type, Document};
use facet::Facet;
Expand All @@ -20,6 +21,7 @@ use schema::{FieldType, Schema};
use schemabuilder::SchemaBuilder;
use searcher::{DocAddress, Order, SearchResult, Searcher};
use snippet::{Snippet, SnippetGenerator};
use tokenizer::{Filter, TextAnalyzer, TextAnalyzerBuilder, Tokenizer};

/// Python bindings for the search engine library Tantivy.
///
Expand Down Expand Up @@ -87,6 +89,10 @@ fn tantivy(_py: Python, m: &Bound<PyModule>) -> PyResult<()> {
m.add_class::<SnippetGenerator>()?;
m.add_class::<Occur>()?;
m.add_class::<FieldType>()?;
m.add_class::<Tokenizer>()?;
m.add_class::<TextAnalyzerBuilder>()?;
m.add_class::<Filter>()?;
m.add_class::<TextAnalyzer>()?;

m.add_wrapped(wrap_pymodule!(query_parser_error))?;

Expand Down
Loading