Build and register custom text analyzers from Tantivy builtins #425

a3lem · 2025-02-08T16:43:47Z

This PR adds the ability to build new text analyzers using Tantivy's builtin tokenizers and filters, as listed in the tokenizer module docs (https://docs.rs/tantivy/latest/tantivy/tokenizer/index.html). New analyzers are registered with an index by name. The registered name links analyzers to the tokenizer name specified by text fields in the index's schema.

Here is an example:

custom_analyzer = (
    tantivy.TextAnalyzerBuilder(tokenizer=tantivy.Tokenizer.whitespace())
    .filter(tantivy.Filter.lowercase())
    .build()
)

doc_text = "#03 8903 HELLO"
assert ["#03", "8903", "hello"] == custom_analyzer.analyze(doc_text)

schema = (
    tantivy.SchemaBuilder()
    .add_text_field("content", tokenizer_name="custom_analyzer")
    .build()
)

index = tantivy.Index(schema)
index.register_tokenizer("custom_analyzer", custom_analyzer)

Building a text analyzer happens through an interface that (more or less) follows Tantivy's design. I had to deviate from it in some places, usually because of some mismatch between Rust's way of doing things and what PyO3 permits.

Differences:

Tokenizer.simple(), Tokenizer.raw(), etc... instead of Tantivy's SimpleTokenizer, RawTokenizer, etc.
- Why: PyO3 does not permit traits in a pyclass's constructor. It would also mean having to define a Tokenizer base class in Python. I tried all sorts of indirection, but couldn't figure something out. In the end, an enum just seemed the simplest way to proceed.
Likewise, Filter.lowercase() instead of Tantivy's LowerCaser. Why: same reason as previous point.
index.register_tokenizer() instead of Tantivy's index.tokenizers().register(). The extra indirection didn't seem worth it to me.
Just as tantivy-py instantiates SchemaBuilder directly instead of as Schema.builder() (the Tantivy way), my design instantiates TextAnalyzerBuilderdirectly, instead of as TextAnalyzer.builder()

Extras:

Besides being able to create and register text analyzers, you can also use the analyzer from Python to tokenize text. This means it's easier to manually build queries without having to reimplement the tokenization process in python.

State of this PR:

Tests pass. Granted, I haven't written tests for every combination of tokenizer and filter, but so far I'm satisfied that it works. I'm happy to extend the set of tests though.
I've added a tutorial section to the docs.
I've updated the .pyi and added doc comments to the rust code.

Next steps:

Disclaimer: I'm new to Rust. I got some help from Claude and Deepseek but neither is amazing at Rust. My changes haven't broken any tests, but still... I think you're going to want to review this contribution with a critical eye.

Random:

While working on this, there was a PyO3 feature available in new releases that I wanted to use: enum constructor signatures. Upgrading PyO3 unleashed a bunch of compilation errors, so it seems for now we're stuck on the currently pinned version.

cjrh · 2025-02-08T21:08:23Z

This looks good, thank you. Let's get the tests fixed up and I'll merge it 👍🏼

a3lem · 2025-02-09T13:37:56Z

Awesome, thanks for looking at this so soon!

There was a doctest failing. (I forgot to use nox, whoops!) It should be fixed now.

When I run nox locally (on macos, arm64, two of the python versions fail):

3.10: some uv/pyenv/path-related issue specific to my system.
3.13: "the configured Python interpreter version (3.13) is newer than PyO3's maximum supported version (3.12)"

Let's rerun the tests

Code, tests, docs complete

9e59a60

Reformat unintended doctest in docs

0de1287

cjrh approved these changes Feb 9, 2025

View reviewed changes

cjrh merged commit 5071a6f into quickwit-oss:master Feb 9, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Build and register custom text analyzers from Tantivy builtins #425

Build and register custom text analyzers from Tantivy builtins #425

Uh oh!

a3lem commented Feb 8, 2025 •

edited

Loading

Uh oh!

cjrh commented Feb 8, 2025

Uh oh!

a3lem commented Feb 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Build and register custom text analyzers from Tantivy builtins #425

Build and register custom text analyzers from Tantivy builtins #425

Uh oh!

Conversation

a3lem commented Feb 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cjrh commented Feb 8, 2025

Uh oh!

a3lem commented Feb 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

a3lem commented Feb 8, 2025 •

edited

Loading

a3lem commented Feb 9, 2025 •

edited

Loading