Skip to content

Build and register custom text analyzers from Tantivy builtins #425

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 9, 2025

Conversation

a3lem
Copy link
Contributor

@a3lem a3lem commented Feb 8, 2025

This PR adds the ability to build new text analyzers using Tantivy's builtin tokenizers and filters, as listed in the tokenizer module docs (https://docs.rs/tantivy/latest/tantivy/tokenizer/index.html). New analyzers are registered with an index by name. The registered name links analyzers to the tokenizer name specified by text fields in the index's schema.

Here is an example:

custom_analyzer = (
    tantivy.TextAnalyzerBuilder(tokenizer=tantivy.Tokenizer.whitespace())
    .filter(tantivy.Filter.lowercase())
    .build()
)

doc_text = "#03 8903 HELLO"
assert ["#03", "8903", "hello"] == custom_analyzer.analyze(doc_text)

schema = (
    tantivy.SchemaBuilder()
    .add_text_field("content", tokenizer_name="custom_analyzer")
    .build()
)

index = tantivy.Index(schema)
index.register_tokenizer("custom_analyzer", custom_analyzer)

Building a text analyzer happens through an interface that (more or less) follows Tantivy's design. I had to deviate from it in some places, usually because of some mismatch between Rust's way of doing things and what PyO3 permits.

Differences:

  • Tokenizer.simple(), Tokenizer.raw(), etc... instead of Tantivy's SimpleTokenizer, RawTokenizer, etc.
    • Why: PyO3 does not permit traits in a pyclass's constructor. It would also mean having to define a Tokenizer base class in Python. I tried all sorts of indirection, but couldn't figure something out. In the end, an enum just seemed the simplest way to proceed.
  • Likewise, Filter.lowercase() instead of Tantivy's LowerCaser. Why: same reason as previous point.
  • index.register_tokenizer() instead of Tantivy's index.tokenizers().register(). The extra indirection didn't seem worth it to me.
  • Just as tantivy-py instantiates SchemaBuilder directly instead of as Schema.builder() (the Tantivy way), my design instantiates TextAnalyzerBuilderdirectly, instead of as TextAnalyzer.builder()

Extras:

  • Besides being able to create and register text analyzers, you can also use the analyzer from Python to tokenize text. This means it's easier to manually build queries without having to reimplement the tokenization process in python.

State of this PR:

  • Tests pass. Granted, I haven't written tests for every combination of tokenizer and filter, but so far I'm satisfied that it works. I'm happy to extend the set of tests though.
  • I've added a tutorial section to the docs.
  • I've updated the .pyi and added doc comments to the rust code.

Next steps:

  • Disclaimer: I'm new to Rust. I got some help from Claude and Deepseek but neither is amazing at Rust. My changes haven't broken any tests, but still... I think you're going to want to review this contribution with a critical eye.

Random:

  • While working on this, there was a PyO3 feature available in new releases that I wanted to use: enum constructor signatures. Upgrading PyO3 unleashed a bunch of compilation errors, so it seems for now we're stuck on the currently pinned version.

@cjrh
Copy link
Collaborator

cjrh commented Feb 8, 2025

This looks good, thank you. Let's get the tests fixed up and I'll merge it 👍🏼

@a3lem
Copy link
Contributor Author

a3lem commented Feb 9, 2025

Awesome, thanks for looking at this so soon!

There was a doctest failing. (I forgot to use nox, whoops!) It should be fixed now.

When I run nox locally (on macos, arm64, two of the python versions fail):

  • 3.10: some uv/pyenv/path-related issue specific to my system.
  • 3.13: "the configured Python interpreter version (3.13) is newer than PyO3's maximum supported version (3.12)"

Let's rerun the tests

@cjrh cjrh merged commit 5071a6f into quickwit-oss:master Feb 9, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants