Skip to content

Updated pythonize and pyo3 #401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 5, 2025
Merged

Conversation

jamesbraza
Copy link
Contributor

Closes #371

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

@wallies wallies requested a review from Copilot December 17, 2024 11:10
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

@cjrh
Copy link
Collaborator

cjrh commented Jan 8, 2025

The reason the pickle test fails is because of something that is different in the pythonize package in the new version. When a Document object is deserialized, values in the json object that were i64, become u64 after deserialization:

Comparing:

{"birth": [Date(2019-08-12T13:00:05Z)], "bytes": [Bytes([97, 98, 99])], "facet": [Facet(Facet(/europe/france))], "float": [F64(1.0)], "integer": [I64(5)], "json": [Object({"a": I64(1), "b": I64(-2)})], "title": [Str("hello world!")], "unsigned": [U64(1)]}

{"birth": [Date(2019-08-12T13:00:05Z)], "bytes": [Bytes([97, 98, 99])], "facet": [Facet(Facet(/europe/france))], "float": [F64(1.0)], "integer": [I64(5)], "json": [Object({"a": U64(1), "b": I64(-2)})], "title": [Str("hello world!")], "unsigned": [U64(1)]}

Look carefully at the types in the json object. Before serialization, I64(1), but after deserialization, U64. This seems to only happen with the JSON object. If that is commented (.add_json) in the test then the test passes.

@jamesbraza
Copy link
Contributor Author

Yeah I concur with your breakdown that the serialization/deserialization cycle is considering the JSON's a key to be unsigned now.

If that is commented (.add_json) in the test then the test passes.

Fwiw, going to assert orig.to_dict() == pickled.to_dict() leads to the test passing as well. So it's something within __eq__ specifically.


How are you getting that printed representation with I64/U64? Using repr or to_dict I don't see those values.

Also, Document has no __eq__ method, so I am not yet understanding how the __eq__ check is actually failing.

@cjrh
Copy link
Collaborator

cjrh commented Feb 17, 2025

If the assert .to_dict() change makes the test pass, I am ok to proceed with that. On the python side there is no difference between I64 and u64, nor in a json representation of those payloads if they ever get serialized for anything. What do you reckon?

@jamesbraza
Copy link
Contributor Author

I think it's important for __eq__ to work across serialization/deserialization.

What do you think of implementing a Document.__eq__ to invoke to_dict:

class Document:
    ...

    def __eq__(self, other) -> bool:
        if not isinstance(other, type(self)):
            return NotImplemented
        return self.to_dict() == other.to_dict()

Also, it may be preferred to do this in document.rs with the Rust API, as opposed to in Python only.

@cjrh
Copy link
Collaborator

cjrh commented Feb 17, 2025

Overriding __eq__ makes me nervous.

I think it's important for eq to work across serialization/deserialization.

Good point. Perhaps we should make a reproducer of the issue using only pythonize, and ask from help from that project.

@jamesbraza
Copy link
Contributor Author

I would be happy to do so, if you can point out how you printed the Document such that U64/I64 stuff showed up.

Also, just so I can understand, how do you know the issue is with pythonize vs pyo3?

@cjrh
Copy link
Collaborator

cjrh commented Feb 18, 2025

if you can point out how you printed the Document such that U64/I64 stuff showed up.

From memory I put print statements somewhere. Hopefully I still have that code somewhere. I will check tomorrow.

@cjrh
Copy link
Collaborator

cjrh commented Feb 18, 2025

To print out that data, I added a print inside __richcmp__ like this:

impl Document {
    fn __richcmp__<'py>(
        &self,
        other: &Self,
        op: CompareOp,
        py: Python<'py>
    ) -> PyResult<Bound<'py, PyAny>> {
        println!("\n\n\nComparing:\n\n{:?}\n\n{:?}", self.field_values, other.field_values);
        match op {
            CompareOp::Eq => {
                let v = (self == other).into_pyobject(py)?.to_owned().into_any();
                Ok(v)
            },
            CompareOp::Ne => {
                let v = (self != other).into_pyobject(py)?.to_owned().into_any();
                Ok(v)
            },
            _ => {
                let v = PyNotImplemented::get(py).to_owned().into_any();
                Ok(v)
            }
        }
    }
}

Tracing backwards from here, I also added some prints in the _internal_from_pythonized() method:

impl Document {
    #[staticmethod]
    fn _internal_from_pythonized(serialized: &Bound<PyAny>) -> PyResult<Self> {
        println!("\n\n\nDeserializing: {:?}", serialized);
        let out = pythonize::depythonize(serialized).map_err(to_pyerr);
        let out: Document = out.unwrap();
        println!("\n\n\nDeserialized: {:?}", out);
        println!("\n\n\nDeserialized: {:?}", out.field_values);
        Ok(out)
    }
}

So when I run the tests I get this stdout:

------------------------------------- Captured stdout call -------------------------------------



Deserializing: {'birth': [{'Date': 1565614805000000000}], 'bytes': [{'Bytes': [97, 98, 99]}], 'facet': [{'Facet': '/europe/france'}], 'float': [{'F64': 1.0}], 'integer': [{'I64': 5}], 'json': [{'Object': {'a': 1, 'b': 2}}], 'title': [{'Str': 'hello world!'}], 'unsigned': [{'U64': 1}]}



Deserialized: Document(birth=[2019-08-12],bytes=[[97, 98, 9],facet=[/europe/fr],float=[1],integer=[5],json=[{"a":1,"b"],title=[hello worl],unsigned=[1])



Deserialized: {"birth": [Date(2019-08-12T13:00:05Z)], "bytes": [Bytes([97, 98, 99])], "facet": [Facet(Facet(/europe/france))], "float": [F64(1.0)], "integer": [I64(5)], "json": [Object({"a": U64(1), "b": U64(2)})], "title": [Str("hello world!")], "unsigned": [U64(1)]}
orig: Document(birth=[2019-08-12],bytes=[[97, 98, 9],facet=[/europe/fr],float=[1],integer=[5],json=[{"a":1,"b"],title=[hello worl],unsigned=[1])
pickled: Document(birth=[2019-08-12],bytes=[[97, 98, 9],facet=[/europe/fr],float=[1],integer=[5],json=[{"a":1,"b"],title=[hello worl],unsigned=[1])



Comparing:

{"birth": [Date(2019-08-12T13:00:05Z)], "bytes": [Bytes([97, 98, 99])], "facet": [Facet(Facet(/europe/france))], "float": [F64(1.0)], "integer": [I64(5)], "json": [Object({"a": I64(1), "b": I64(2)})], "title": [Str("hello world!")], "unsigned": [U64(1)]}

{"birth": [Date(2019-08-12T13:00:05Z)], "bytes": [Bytes([97, 98, 99])], "facet": [Facet(Facet(/europe/france))], "float": [F64(1.0)], "integer": [I64(5)], "json": [Object({"a": U64(1), "b": U64(2)})], "title": [Str("hello world!")], "unsigned": [U64(1)]}

@jamesbraza
Copy link
Contributor Author

Okay I made jamesbraza#1, which:

  1. PR description states my current understanding of the problem (newer pythonize more aggressively making U64)
  2. Poses a somewhat hacky solution, basically checking if one can convert back to I64

I am not happy with this, but I am really in over my head here, so I made davidhewitt/pythonize#80 basically as a SOS.

@jamesbraza
Copy link
Contributor Author

Perhaps extract_value at https://github.com/quickwit-oss/tantivy-py/blob/0.22.0/src/document.rs#L37-L39 has something to do with it? I don't see a U64 case, more just talking out loud

@cjrh
Copy link
Collaborator

cjrh commented Mar 3, 2025

@jamesbraza It would be great if you could add a very short reproducer to your issue on pythonize, so that David can easily run it and see the change in type during roundtrip.

@cjrh
Copy link
Collaborator

cjrh commented Mar 3, 2025

and

but I am really in over my head here

Don't worry, we're all learning all the time. Thanks for your continued interest in helping here ❤️

@cjrh
Copy link
Collaborator

cjrh commented May 4, 2025

I pushed a commit with a new strategy: for the JSON type, deserialize all integers to I64. This solves our problem with the pickle test, and without breaking any handling for toplevel I64/U64 integers. This is probably the best we can do here.

I will also try to update this branch on top of current master.

@cjrh
Copy link
Collaborator

cjrh commented May 4, 2025

Successfully merged the main branch in.

Custom handling for JSON objects, always I64 for ints

Updated for upstream/master (tantivy 0.24.0)

Fix formatting
@jamesbraza
Copy link
Contributor Author

My apologies for the two month hiatus here, good to see you guys pushed this forward.

To catch myself up and concrete my understanding, just giving a tl;dr on this PR's story:

  1. With newer pythonize, we found I64 objects after deserialization become U64, breaking Document.__eq__.
  2. This happens is because newer pythonize::depythonize prioritizes deserialization to unsigned ints over signed ints, per fix large integer handling davidhewitt/pythonize#69.
  3. I posed Moves Document.__richcmp__ to convert U64 back to I64 jamesbraza/tantivy-py#1, which involves expanding Document.__richcmp__ to recursively cast unsigned integers to signed integers.
  4. In Request: normalization and depythonize davidhewitt/pythonize#80, David poses adding a config option for depythonize to prefer signed ints over unsigned ints.
  5. Caleb just added a solution of, upon deserialization of Object, recursively casting numbers to I64 (if possible), otherwise F64.

In practice this change is backwards compatible ❤️ as far as I am concerned. The only incompatibility is round trip ser/de's an old Document will convert any unsigned ints to signed ints in JSON, but that doesn't matter. I am down to ship this 🚢 .


In other news, it looks like pythonize::depythonize_bound was just removed in pythonize version 0.24 in davidhewitt/pythonize#83, so I just updated the code for that. This brings cargo build to pass.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the pythonize usage and pyo3 API calls, addressing dependency upgrades and refining JSON deserialization for documents. Key changes include:

  • Replacing pythonize::depythonize_bound with pythonize::depythonize in multiple modules.
  • Renaming and refactoring extraction functions to improve API clarity.
  • Introducing custom JSON deserialization functions for converting numeric values.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

File Description
src/schema.rs Updated pythonize usage to remove unnecessary cloning.
src/query.rs Refactored tuple extraction with a renamed function for clarity.
src/document.rs Revised pythonize conversion and added custom JSON deserialization.
Cargo.toml Upgraded dependency versions for pythonize and pyo3.
Comments suppressed due to low confidence (2)

src/query.rs:19

  • [nitpick] The function renaming to 'extract_bound' and the destructuring of the tuple improves clarity; ensure that this naming aligns with the rest of the codebase conventions to avoid confusion.
let (occur, query): (Occur, Query) = ob.extract()?

src/schema.rs:66

  • The removal of the clone indicates better performance, but please verify that passing a reference to depythonize correctly handles the Bound without unintended side effects.
pythonize::depythonize(&serialized).map_err(to_pyerr)

@cjrh cjrh merged commit fb27cf2 into quickwit-oss:master May 5, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failed to fetch wheel: tantivy==0.22.0 (Python 3.13.0)
3 participants