Skip to content

[VARIANT] Add support for the json_to_variant API #7783

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 43 commits into from
Jul 3, 2025

Conversation

harshmotw-db
Copy link
Contributor

@harshmotw-db harshmotw-db commented Jun 25, 2025

Which issue does this PR close?

Rationale for this change

Explained in the issue.

What changes are included in this PR?

This PR includes a json_to_variant API to parse JSON strings as Variants. json_to_variant takes as argument the input JSON string and an object builder of type VariantBuilder and builds the variant using builder. The resulting variant can be extracted using builder.finish() which consumes the builder and returns the Variant buffers.

Are these changes tested?

Unit Tests, and an example file.

Are there any user-facing changes?

Yes, the PR introduces the json_to_variant API.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 25, 2025
@harshmotw-db
Copy link
Contributor Author

cc @scovich

@alamb
Copy link
Contributor

alamb commented Jun 25, 2025

FYI @carpecodeum

This is amazing @harshmotw-db -- I was just thinking about this.

I started the CI checks and will check this PR more carefully tomorrow

@harshmotw-db harshmotw-db changed the title add json_to_variant [VARIANT] Add support for the json_to_variant API Jun 25, 2025
serde_json = "1.0"
# "arbitrary_precision" allows us to manually parse numbers. "preserve_order" does not automatically
# sort object keys
serde_json = { version = "1.0", features = ["arbitrary_precision", "preserve_order"] }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about the (undocumented) arbitrary_precision feature flag and (undocumented) Number::as_str method it unlocks 🤯

That could solve a lot of problems for variant decimals on the to_json path as well:
https://github.com/apache/arrow-rs/blob/main/parquet-variant/src/to_json.rs#L147-L162

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question tho -- do we need to support both modes of serde_json, given that the user (not arrow-rs) probably decides whether to use that feature flag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible some projects might prioritize performance and might not want the Decimal type at all. I would prefer doing that as a follow up though. If you agree, I'll go ahead and create an issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, for sure it's a follow-up. But we probably do need a story for how to handle that feature flag in a dependency (maybe part of the bigger story of how to handle a serde_json dependency in the first place).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 1040 to 1042
if n.is_i64() {
// Find minimum Integer width to fit
let i = n.as_i64().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if n.is_i64() {
// Find minimum Integer width to fit
let i = n.as_i64().unwrap();
if let Some(i) = n.is_i64() {
// Find minimum Integer width to fit

Comment on lines 1043 to 1051
if i as i8 as i64 == i {
Ok((i as i8).into())
} else if i as i16 as i64 == i {
Ok((i as i16).into())
} else if i as i32 as i64 == i {
Ok((i as i32).into())
} else {
Ok(i.into())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting approach. That double .. as .. as .. definitely caused a double take, but I guess it could be cheaper than TryFrom?

Suggested change
if i as i8 as i64 == i {
Ok((i as i8).into())
} else if i as i16 as i64 == i {
Ok((i as i16).into())
} else if i as i32 as i64 == i {
Ok((i as i32).into())
} else {
Ok(i.into())
}
let value = i8::try_from(i)
.map(Variant::from)
.or_else(|_| i16::try_from(i).map(Variant::from))
.or_else(|_| i32::try_from(i).map(Variant::from))
.unwrap_or_else(|| Variant::from(i));
Ok(value)

Comment on lines 36 to 38
pub struct SampleBoxBasedVariantBufferManager {
pub value_buffer: Box<[u8]>,
pub metadata_buffer: Box<[u8]>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What purpose would this serve? It seems strictly inferior to a vec for both normal and testing uses?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing the above thought about building up an arrow Array... maybe a SampleMultiVariantBufferManager that can stack a bunch of variants end to end, with a len method that allows to extract offsets?

Copy link
Contributor Author

@harshmotw-db harshmotw-db Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I do plan on adding an arrow-compliant sample buffer manager. I currently do not fully understand arrow though but I believe the trait is simple enough that it could be extended to such use cases. I will do that as a follow up if you agree with the general design.

@@ -0,0 +1,124 @@
use arrow_schema::ArrowError;

pub trait VariantBufferManager {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure I understand this trait.

It almost seems like a good building block for eventual columnar writes, if an implementation took a mutable reference to a vec, where each VariantBufferManager you build with it is only capable of appending new bytes to the end (leaving existing bytes unchanged). That way, one could build up an offset array along the way as part of building up arrow binary arrays for the metadata and value columns?

But I don't understand the whole "may be called several times" part, which seems to return the whole slice on every call? I guess that's because the variant builder doesn't do its work all at once, but still expects to receive the overall slice?

Finally, why split the ensure and borrow methods, when both take &mut self and it's almost certainly dangerous to "borrow" without an "ensure" first (can't guess the size in advance, it's variant data)? Why not just a pair of ensure_and_borrow_XXX_buffer methods that return a slice of exactly the requested size (no larger, no smaller)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's reasonable to return the whole slice, and have the Variant builder choose to write data wherever it chooses to. The trait was originally created for this PR where bytes are often being shifted to make way for headers etc. Let me know if you disagree or would recommend a different simple alternative.

As for why the methods are separate, I think you are generally correct and they could be combined. In the original PR, I was calling ensure_value_buffer_size only occasionally but that was because I was performing size checks in the variant library before calling the method and that check could be moved into the method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have consolidated the two methods.

Copy link
Contributor

@alamb alamb Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that currently the VariantBuilders need to write into a memory buffer anyways, I wonder what is this trait trying to abstract?

Like maybe the json_to_variant should take a VariantBuilder to write to and not try to manage buffers itself 🤔

Copy link
Contributor Author

@harshmotw-db harshmotw-db Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, my opinion is that the variant builders should write to buffers owned by the caller. So no more ValueBuffer and MetadataBuffer - just VariantBufferManager.
delta-io/delta-kernel-rs#1034
This PR just lays the groundwork for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having VariantBuilder write to buffers managed by somewhere else is a good idea and is useful more generally than json_to_variant

Perhaps we can make the json_to_variant API be in terms of the VariantBuilder and then modify the VariantBuilder in a separate PR to handle buffers owned by the caller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed a ticket to track the idea of writing to buffers owned by the caller here

Copy link
Contributor Author

@harshmotw-db harshmotw-db Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better if the VariantBuilder dependency was abstracted away? So, right now the functionality looks like: (1) here's the JSON string, (2) Here are my buffers (variant_buffer_manager), write the variant to the buffers.

The alternative is: (1) Here's the JSON string, (2) Here are my buffers, (3) this is the VariantBuilder built using my buffers (which I don't know why you need) => write the variants to these buffers.

I think the current approach VariantBufferManager should be reasonably extensible to every use case and also supports resizing on the fly. The example solution described in this ticket would likely be retry based which would have a higher performance overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, yeah, in the short run, I think it is good not to have dependency on VariantBufferManager since there are unanswered questions. I will implement the VariantBuilder-based API today

Copy link
Contributor Author

@harshmotw-db harshmotw-db Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb I have removed the dependency of the current PR on custom buffer management and implemented what you suggested.

use arrow_schema::ArrowError;
use parquet_variant::{json_to_variant, VariantBufferManager};

pub struct SampleVariantBufferManager {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from the box-based one above, and why not just use the vec-based one above instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I forgot to replace this

@alamb
Copy link
Contributor

alamb commented Jun 26, 2025

Here is some parallel art from @zeroshade in the go implementation:

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @harshmotw-db -- I took a quick look -- this is a great start. I'll look more carefully tomorrow

impl TryFrom<&Number> for Variant<'_, '_> {
type Error = ArrowError;

fn try_from(n: &Number) -> Result<Self, Self::Error> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow up sounds like a good plan to me

@@ -32,6 +32,8 @@ mod decimal;
mod list;
mod metadata;
mod object;
use rust_decimal::prelude::*;
use serde_json::Number;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be eventually be planning to avoid all use of serde_json for parsing / serializing.

It is fine to use serde_json for the initial version / getting things functionally working, but it is terribly inefficient compared to more optimized implementations like the tape decoder

Thus can we please try and keep anything serde_json specific out of Variant (like this From impl)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'll move much of this code from variant.rs to from_json.rs if that's what you meant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@harshmotw-db harshmotw-db requested review from alamb and scovich June 26, 2025 08:10
Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. A few small cleanups to consider before merge.

Comment on lines 52 to 53
let new_len = size.next_power_of_two();
self.value_buffer.resize(new_len, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need to do this -- the underlying vec is guaranteed to have reasonable amortized allocation costs:

Vec does not guarantee any particular growth strategy when reallocating when full, nor when reserve is called. ... Whatever strategy is used will of course guarantee O(1) amortized push.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this for demonstration purposes so custom non-vec implementations allocate reasonably sized buffers so they don't encounter O(n^2) complexity. Maybe, we should request power of two from the library itself when we integrate this deep into construction.

Comment on lines 108 to 115
fn build_list(arr: &[Value], builder: &mut ListBuilder) -> Result<(), ArrowError> {
for val in arr {
append_json(val, builder)?;
}
Ok(())
}

fn build_object<'a, 'b>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two functions only have one caller each now -- should we just fold their 3-4 lines of code into their call sites?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed it to try_fold. Lmk if you meant something else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, I just meant to get rid of the functions and move their code directly where the function used to be called from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@alamb
Copy link
Contributor

alamb commented Jul 1, 2025

Seems reasonable. And as a quick-follow, it's not that hard to manually parse a JSON numeric string literal to Variant (pathfinding already done). The grammar is super simple, basically:

For sure

Also, we have some prior art in these crates here:

https://github.com/apache/arrow-rs/blob/5505113d9745aba2cb46df2fd11a4b3d9672d5d2/arrow-cast/src/parse.rs#L854-L853

those require knowing the precison up front, but we could perhaps adapt the code

@harshmotw-db
Copy link
Contributor Author

@alamb @scovich Yeah, I think that makes sense. For now, we could do away with the decimal type in json_to_variant and always use double, and soon replace it when the custom parser is ready.

That being said, the preserve_order flag ensures that object keys are inserted into the Variant in the order that they appear. Removing it would result in a behavior difference between Arrow and Spark's parse_json expression. I don't think that makes a logical difference but the binaries would look different as the keys would be sorted in the Arrow library.

Just to clarify, the number parsing was never a hard problem - that just overcomes the rust_decimal dependency, but the fact that serde_json doesn't give you strings without the arbitrary_precision flag that needs to be done away with. But I guess the tape decoder PR implements all of JSON parsing internally so that's great.

@alamb
Copy link
Contributor

alamb commented Jul 1, 2025

That being said, the preserve_order flag ensures that object keys are inserted into the Variant in the order that they appear. Removing it would result in a behavior difference between Arrow and Spark's parse_json expression. I don't think that makes a logical difference but the binaries would look different as the keys would be sorted in the Arrow library.

I think this also is a property of the tape decoder -- since it doesn't parse into a tree / hash map structure it will present the fields in the order they appear in the json text

@harshmotw-db harshmotw-db requested a review from alamb July 2, 2025 21:30
@harshmotw-db
Copy link
Contributor Author

harshmotw-db commented Jul 2, 2025

@alamb I have now removed all dependency on the custom serde_json and ignored all decimal tests. See if the PR can be merged now.

@scovich
Copy link
Contributor

scovich commented Jul 2, 2025

Just to clarify, the number parsing was never a hard problem... But I guess the tape decoder PR implements all of JSON parsing internally so that's great.

The tape decoder presents JSON numeric values as strings. But they still need to be parsed. Meanwhile, I had a little too much fun playing with parsing code (playground).

It's a lot more efficient than the "try it and see" approach my pathfinding PR took. Probably a little too fancy tho (code size could cause instruction cache and branch prediction problems that actually hurt performance).

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please make the lifetimes more readable?

@@ -249,27 +245,6 @@ pub fn variant_to_json_string(variant: &Variant) -> Result<String, ArrowError> {
.map_err(|e| ArrowError::InvalidArgumentError(format!("UTF-8 conversion error: {e}")))
}

fn variant_decimal_to_json_value(decimal: &impl VariantDecimal) -> Result<Value, ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a macro could avoid both overhead and duplication? The only part that needs to change (besides the data types) is Decimal16 has a more complicated way of handling the integer value produced at the end.

@@ -254,7 +269,7 @@ fn test_json_to_variant_decimal16_max_scale() -> Result<(), ArrowError> {
fn test_json_to_variant_double_precision() -> Result<(), ArrowError> {
JsonToVariantTest {
json: "0.79228162514264337593543950335",
expected: Variant::Double(0.792_281_625_142_643_3_f64),
expected: Variant::Double(0.792_281_625_142_643_4_f64),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what made this one change in the latest commit, out of curiosity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't investigate too much, but I suppose the arbitrary_precision flag was causing slightly different results in parsing doubles. I didn't care because doubles lose precision anyway.

Comment on lines 132 to 134
struct ObjectFieldBuilder<'a, 'b, 'c> {
key: &'a str,
builder: &'b mut ObjectBuilder<'c, 'a>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard to interpret... can we use 'm and 'v?

Suggested change
struct ObjectFieldBuilder<'a, 'b, 'c> {
key: &'a str,
builder: &'b mut ObjectBuilder<'c, 'a>,
struct ObjectFieldBuilder<'m, 'v, 'r> {
key: &'r str,
builder: &'r mut ObjectBuilder<'m, 'v>,

(here, 'r is the lifetime of the references used to construct the field builder)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it to 's, 'o and 'v where 's is the lifetime of the 's is the lifetime of the string, 'o is the lifetime of [ObjectBuilder] and v is the lifetime of the variant buffers.

@alamb
Copy link
Contributor

alamb commented Jul 3, 2025

I merged up from main and fixed a clippy error to get a clean CI run

Given how long this PR has been outstanding and that it blocks a bunch of other work, I think we should try to merge it in and file follow ons asap. I am giving it another review now

}

#[test]
fn test_json_to_variant_object_very_large() -> Result<(), ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this single test currently takes over 22 seconds to complete

Screenshot 2025-07-03 at 2 11 16 PM

I will see if I can make it faster

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to make it faster with a small effort, but I did briefly look at some profiling and it is spending a very large amount of time validating the offsets in the variant

That is likely something we can improve on over time

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @scovich this PR is ready to go -- thank you for the work in this @harshmotw-db

I think the follow on items are:

  1. Support decimals better (use the smallest variant type like Int8 or Int64 is possible)
  2. Support round tripping of variants --> JSON --> Variants

@alamb
Copy link
Contributor

alamb commented Jul 3, 2025

@harshmotw-db can you please file a follow on ticket that describes the remaining work to support decimal in your mind?

@@ -33,10 +33,12 @@ mod decoder;
mod variant;
// TODO: dead code removal
mod builder;
mod from_json;
mod to_json;
#[allow(dead_code)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allow was bothering me , so here is a PR to clean it up:

@alamb alamb merged commit 81ab147 into apache:main Jul 3, 2025
12 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 3, 2025

Thanks again @harshmotw-db and @scovich

@alamb
Copy link
Contributor

alamb commented Jul 3, 2025

Here is a follow on PR to break out the json functions:

@harshmotw-db
Copy link
Contributor Author

harshmotw-db commented Jul 3, 2025

@alamb Do you think it is worth creating an arrow utility that abstracts away string to variant conversion? i.e. a function that takes in a StringArray (array of JSON strings some of which could be null), and returns a StructArray (array of corresponding variants where nulls are propagated directly)? If so, where should such functionality be in the codebase?

Also, adding this high-level functionality would also abstract away many of the changes we make to the lower-level library. If we make changes to VariantBuilder to require buffers from the caller, we could simply modify this high-level function and users' workflows would stay the same.

Edit: I have prototyped this separately. I am just looking for a home for this function.

@alamb
Copy link
Contributor

alamb commented Jul 3, 2025

@alamb Do you think it is worth creating an arrow utility that abstracts away string to variant conversion? i.e. a function that takes in a StringArray (array of JSON strings some of which could be null), and returns a StructArray (array of corresponding variants where nulls are propagated directly)? If so, where should such functionality be in the codebase?

Yes I think that is an important function

I suggest calling it a kernel and putting it in a parquet-variant-compute crate

Perhaps an API like this:

fn to_variant(array: &ArrayRef) -> StructArray {
...
}

Also, adding this high-level functionality would also abstract away many of the changes we make to the lower-level library. If we make changes to VariantBuilder to require buffers from the caller, we could simply modify this high-level function and users' workflows would stay the same.

This is a great idea

@alamb
Copy link
Contributor

alamb commented Jul 3, 2025

I envision the reverse cast as well

fn variant_to_string<O: OffsetSize>(array: &ArrayRef) -> GenericStringArray<O> {
...
}

fn variant_to_string_view(array: &ArrayRef) -> StringViewArray {
...
}

@alamb
Copy link
Contributor

alamb commented Jul 3, 2025

(I recommend a new ticket to track this idea BTW)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Variant: Read/Parse JSON value as Variant
3 participants