[VARIANT] Add support for the json_to_variant API #7783

harshmotw-db · 2025-06-25T21:46:16Z

Which issue does this PR close?

Closes Variant: Read/Parse JSON value as Variant #7425.

Rationale for this change

Explained in the issue.

What changes are included in this PR?

This PR includes a json_to_variant API to parse JSON strings as Variants. json_to_variant takes as argument the input JSON string and an object builder of type VariantBuilder and builds the variant using builder. The resulting variant can be extracted using builder.finish() which consumes the builder and returns the Variant buffers.

Are these changes tested?

Unit Tests, and an example file.

Are there any user-facing changes?

Yes, the PR introduces the json_to_variant API.

harshmotw-db · 2025-06-25T21:47:14Z

cc @scovich

alamb · 2025-06-25T21:59:52Z

FYI @carpecodeum

This is amazing @harshmotw-db -- I was just thinking about this.

I started the CI checks and will check this PR more carefully tomorrow

…from_json

parquet-variant/src/decoder.rs

parquet-variant/src/from_json.rs

scovich · 2025-06-25T22:41:18Z

parquet-variant/Cargo.toml

-serde_json = "1.0"
+# "arbitrary_precision" allows us to manually parse numbers. "preserve_order" does not automatically
+# sort object keys
+serde_json = { version = "1.0", features = ["arbitrary_precision", "preserve_order"] }


TIL about the (undocumented) arbitrary_precision feature flag and (undocumented) Number::as_str method it unlocks 🤯

That could solve a lot of problems for variant decimals on the to_json path as well:
https://github.com/apache/arrow-rs/blob/main/parquet-variant/src/to_json.rs#L147-L162

Question tho -- do we need to support both modes of serde_json, given that the user (not arrow-rs) probably decides whether to use that feature flag?

It's possible some projects might prioritize performance and might not want the Decimal type at all. I would prefer doing that as a follow up though. If you agree, I'll go ahead and create an issue.

Oh, for sure it's a follow-up. But we probably do need a story for how to handle that feature flag in a dependency (maybe part of the bigger story of how to handle a serde_json dependency in the first place).

delta-io/delta-kernel-rs#1036

See also

[Variant] make serde_json an optional dependency of parquet-variant #7775

parquet-variant/src/from_json.rs

scovich · 2025-06-25T23:25:06Z

parquet-variant/src/variant.rs

+        if n.is_i64() {
+            // Find minimum Integer width to fit
+            let i = n.as_i64().unwrap();


Suggested change

if n.is_i64() {

// Find minimum Integer width to fit

let i = n.as_i64().unwrap();

if let Some(i) = n.is_i64() {

// Find minimum Integer width to fit

scovich · 2025-06-25T23:28:45Z

parquet-variant/src/variant.rs

+            if i as i8 as i64 == i {
+                Ok((i as i8).into())
+            } else if i as i16 as i64 == i {
+                Ok((i as i16).into())
+            } else if i as i32 as i64 == i {
+                Ok((i as i32).into())
+            } else {
+                Ok(i.into())
+            }


Interesting approach. That double .. as .. as .. definitely caused a double take, but I guess it could be cheaper than TryFrom?

Suggested change

if i as i8 as i64 == i {

Ok((i as i8).into())

} else if i as i16 as i64 == i {

Ok((i as i16).into())

} else if i as i32 as i64 == i {

Ok((i as i32).into())

} else {

Ok(i.into())

}

let value = i8::try_from(i)

.map(Variant::from)

.or_else(|_| i16::try_from(i).map(Variant::from))

.or_else(|_| i32::try_from(i).map(Variant::from))

.unwrap_or_else(|| Variant::from(i));

Ok(value)

scovich · 2025-06-25T23:31:26Z

parquet-variant/src/variant_buffer_manager.rs

+pub struct SampleBoxBasedVariantBufferManager {
+    pub value_buffer: Box<[u8]>,
+    pub metadata_buffer: Box<[u8]>,


What purpose would this serve? It seems strictly inferior to a vec for both normal and testing uses?

Continuing the above thought about building up an arrow Array... maybe a SampleMultiVariantBufferManager that can stack a bunch of variants end to end, with a len method that allows to extract offsets?

Yes, I do plan on adding an arrow-compliant sample buffer manager. I currently do not fully understand arrow though but I believe the trait is simple enough that it could be extended to such use cases. I will do that as a follow up if you agree with the general design.

scovich · 2025-06-25T23:48:01Z

parquet-variant/src/variant_buffer_manager.rs

@@ -0,0 +1,124 @@
+use arrow_schema::ArrowError;
+
+pub trait VariantBufferManager {


I'm not quite sure I understand this trait.

It almost seems like a good building block for eventual columnar writes, if an implementation took a mutable reference to a vec, where each VariantBufferManager you build with it is only capable of appending new bytes to the end (leaving existing bytes unchanged). That way, one could build up an offset array along the way as part of building up arrow binary arrays for the metadata and value columns?

But I don't understand the whole "may be called several times" part, which seems to return the whole slice on every call? I guess that's because the variant builder doesn't do its work all at once, but still expects to receive the overall slice?

Finally, why split the ensure and borrow methods, when both take &mut self and it's almost certainly dangerous to "borrow" without an "ensure" first (can't guess the size in advance, it's variant data)? Why not just a pair of ensure_and_borrow_XXX_buffer methods that return a slice of exactly the requested size (no larger, no smaller)?

I think it's reasonable to return the whole slice, and have the Variant builder choose to write data wherever it chooses to. The trait was originally created for this PR where bytes are often being shifted to make way for headers etc. Let me know if you disagree or would recommend a different simple alternative.

As for why the methods are separate, I think you are generally correct and they could be combined. In the original PR, I was calling ensure_value_buffer_size only occasionally but that was because I was performing size checks in the variant library before calling the method and that check could be moved into the method.

I have consolidated the two methods.

Given that currently the VariantBuilders need to write into a memory buffer anyways, I wonder what is this trait trying to abstract?

Like maybe the json_to_variant should take a VariantBuilder to write to and not try to manage buffers itself 🤔

Oh, my opinion is that the variant builders should write to buffers owned by the caller. So no more ValueBuffer and MetadataBuffer - just VariantBufferManager.
delta-io/delta-kernel-rs#1034
This PR just lays the groundwork for that.

I think having VariantBuilder write to buffers managed by somewhere else is a good idea and is useful more generally than json_to_variant

Perhaps we can make the json_to_variant API be in terms of the VariantBuilder and then modify the VariantBuilder in a separate PR to handle buffers owned by the caller.

I filed a ticket to track the idea of writing to buffers owned by the caller here

[Variant] Support VariantBuilder to write to buffers owned by the caller #7805

Wouldn't it be better if the VariantBuilder dependency was abstracted away? So, right now the functionality looks like: (1) here's the JSON string, (2) Here are my buffers (variant_buffer_manager), write the variant to the buffers.

The alternative is: (1) Here's the JSON string, (2) Here are my buffers, (3) this is the VariantBuilder built using my buffers (which I don't know why you need) => write the variants to these buffers.

I think the current approach VariantBufferManager should be reasonably extensible to every use case and also supports resizing on the fly. The example solution described in this ticket would likely be retry based which would have a higher performance overhead.

Actually, yeah, in the short run, I think it is good not to have dependency on VariantBufferManager since there are unanswered questions. I will implement the VariantBuilder-based API today

@alamb I have removed the dependency of the current PR on custom buffer management and implemented what you suggested.

scovich · 2025-06-25T23:50:40Z

parquet-variant/tests/test_json_to_variant.rs

+use arrow_schema::ArrowError;
+use parquet_variant::{json_to_variant, VariantBufferManager};
+
+pub struct SampleVariantBufferManager {


How is this different from the box-based one above, and why not just use the vec-based one above instead?

Good catch! I forgot to replace this

…from_json

alamb · 2025-06-26T01:11:07Z

Here is some parallel art from @zeroshade in the go implementation:

feat(parquet/variant): Parse JSON into variant arrow-go#426

alamb

Thanks again @harshmotw-db -- I took a quick look -- this is a great start. I'll look more carefully tomorrow

alamb · 2025-06-26T01:13:39Z

parquet-variant/src/variant.rs

+impl TryFrom<&Number> for Variant<'_, '_> {
+    type Error = ArrowError;
+
+    fn try_from(n: &Number) -> Result<Self, Self::Error> {


follow up sounds like a good plan to me

alamb · 2025-06-26T01:14:05Z

parquet-variant/src/variant.rs

@@ -32,6 +32,8 @@ mod decimal;
 mod list;
 mod metadata;
 mod object;
+use rust_decimal::prelude::*;
+use serde_json::Number;


I think we should be eventually be planning to avoid all use of serde_json for parsing / serializing.

It is fine to use serde_json for the initial version / getting things functionally working, but it is terribly inefficient compared to more optimized implementations like the tape decoder

Thus can we please try and keep anything serde_json specific out of Variant (like this From impl)?

Oh, I'll move much of this code from variant.rs to from_json.rs if that's what you meant.

scovich

LGTM. A few small cleanups to consider before merge.

scovich · 2025-06-26T12:12:02Z

parquet-variant/src/variant_buffer_manager.rs

+            let new_len = size.next_power_of_two();
+            self.value_buffer.resize(new_len, 0);


I'm not sure we need to do this -- the underlying vec is guaranteed to have reasonable amortized allocation costs:

Vec does not guarantee any particular growth strategy when reallocating when full, nor when reserve is called. ... Whatever strategy is used will of course guarantee O(1) amortized push.

I did this for demonstration purposes so custom non-vec implementations allocate reasonably sized buffers so they don't encounter O(n^2) complexity. Maybe, we should request power of two from the library itself when we integrate this deep into construction.

parquet-variant/src/builder.rs

scovich · 2025-06-26T12:16:00Z

parquet-variant/src/from_json.rs

+fn build_list(arr: &[Value], builder: &mut ListBuilder) -> Result<(), ArrowError> {
+    for val in arr {
+        append_json(val, builder)?;
+    }
+    Ok(())
+}
+
+fn build_object<'a, 'b>(


These two functions only have one caller each now -- should we just fold their 3-4 lines of code into their call sites?

I have changed it to try_fold. Lmk if you meant something else.

Oh sorry, I just meant to get rid of the functions and move their code directly where the function used to be called from.

parquet-variant/tests/test_json_to_variant.rs

parquet-variant/src/from_json.rs

alamb · 2025-07-01T20:22:59Z

Seems reasonable. And as a quick-follow, it's not that hard to manually parse a JSON numeric string literal to Variant (pathfinding already done). The grammar is super simple, basically:

For sure

Also, we have some prior art in these crates here:

https://github.com/apache/arrow-rs/blob/5505113d9745aba2cb46df2fd11a4b3d9672d5d2/arrow-cast/src/parse.rs#L854-L853

those require knowing the precison up front, but we could perhaps adapt the code

harshmotw-db · 2025-07-01T21:28:25Z

@alamb @scovich Yeah, I think that makes sense. For now, we could do away with the decimal type in json_to_variant and always use double, and soon replace it when the custom parser is ready.

That being said, the preserve_order flag ensures that object keys are inserted into the Variant in the order that they appear. Removing it would result in a behavior difference between Arrow and Spark's parse_json expression. I don't think that makes a logical difference but the binaries would look different as the keys would be sorted in the Arrow library.

Just to clarify, the number parsing was never a hard problem - that just overcomes the rust_decimal dependency, but the fact that serde_json doesn't give you strings without the arbitrary_precision flag that needs to be done away with. But I guess the tape decoder PR implements all of JSON parsing internally so that's great.

alamb · 2025-07-01T21:43:48Z

That being said, the preserve_order flag ensures that object keys are inserted into the Variant in the order that they appear. Removing it would result in a behavior difference between Arrow and Spark's parse_json expression. I don't think that makes a logical difference but the binaries would look different as the keys would be sorted in the Arrow library.

I think this also is a property of the tape decoder -- since it doesn't parse into a tree / hash map structure it will present the fields in the order they appear in the json text

harshmotw-db · 2025-07-02T21:30:31Z

@alamb I have now removed all dependency on the custom serde_json and ignored all decimal tests. See if the PR can be merged now.

scovich · 2025-07-02T22:15:27Z

Just to clarify, the number parsing was never a hard problem... But I guess the tape decoder PR implements all of JSON parsing internally so that's great.

The tape decoder presents JSON numeric values as strings. But they still need to be parsed. Meanwhile, I had a little too much fun playing with parsing code (playground).

It's a lot more efficient than the "try it and see" approach my pathfinding PR took. Probably a little too fancy tho (code size could cause instruction cache and branch prediction problems that actually hurt performance).

scovich

LGTM, but please make the lifetimes more readable?

scovich · 2025-07-02T22:18:01Z

parquet-variant/src/to_json.rs

@@ -249,27 +245,6 @@ pub fn variant_to_json_string(variant: &Variant) -> Result<String, ArrowError> {
        .map_err(|e| ArrowError::InvalidArgumentError(format!("UTF-8 conversion error: {e}")))
 }

-fn variant_decimal_to_json_value(decimal: &impl VariantDecimal) -> Result<Value, ArrowError> {


Seems like a macro could avoid both overhead and duplication? The only part that needs to change (besides the data types) is Decimal16 has a more complicated way of handling the integer value produced at the end.

scovich · 2025-07-02T22:19:13Z

parquet-variant/tests/test_json_to_variant.rs

@@ -254,7 +269,7 @@ fn test_json_to_variant_decimal16_max_scale() -> Result<(), ArrowError> {
 fn test_json_to_variant_double_precision() -> Result<(), ArrowError> {
    JsonToVariantTest {
        json: "0.79228162514264337593543950335",
-        expected: Variant::Double(0.792_281_625_142_643_3_f64),
+        expected: Variant::Double(0.792_281_625_142_643_4_f64),


what made this one change in the latest commit, out of curiosity?

I didn't investigate too much, but I suppose the arbitrary_precision flag was causing slightly different results in parsing doubles. I didn't care because doubles lose precision anyway.

scovich · 2025-07-02T22:23:13Z

parquet-variant/src/from_json.rs

+struct ObjectFieldBuilder<'a, 'b, 'c> {
+    key: &'a str,
+    builder: &'b mut ObjectBuilder<'c, 'a>,


This is hard to interpret... can we use 'm and 'v?

Suggested change

struct ObjectFieldBuilder<'a, 'b, 'c> {

key: &'a str,

builder: &'b mut ObjectBuilder<'c, 'a>,

struct ObjectFieldBuilder<'m, 'v, 'r> {

key: &'r str,

builder: &'r mut ObjectBuilder<'m, 'v>,

(here, 'r is the lifetime of the references used to construct the field builder)

I've changed it to 's, 'o and 'v where 's is the lifetime of the 's is the lifetime of the string, 'o is the lifetime of [ObjectBuilder] and v is the lifetime of the variant buffers.

parquet-variant/src/from_json.rs

…om_json

alamb · 2025-07-03T18:09:36Z

I merged up from main and fixed a clippy error to get a clean CI run

Given how long this PR has been outstanding and that it blocks a bunch of other work, I think we should try to merge it in and file follow ons asap. I am giving it another review now

Co-authored-by: Ryan Johnson <[email protected]>

alamb · 2025-07-03T18:12:25Z

parquet-variant/tests/test_json_to_variant.rs

+}
+
+#[test]
+fn test_json_to_variant_object_very_large() -> Result<(), ArrowError> {


this single test currently takes over 22 seconds to complete

I will see if I can make it faster

I wasn't able to make it faster with a small effort, but I did briefly look at some profiling and it is spending a very large amount of time validating the offsets in the variant

That is likely something we can improve on over time

alamb

I agree with @scovich this PR is ready to go -- thank you for the work in this @harshmotw-db

I think the follow on items are:

Support decimals better (use the smallest variant type like Int8 or Int64 is possible)
Support round tripping of variants --> JSON --> Variants

alamb · 2025-07-03T18:36:41Z

@harshmotw-db can you please file a follow on ticket that describes the remaining work to support decimal in your mind?

alamb · 2025-07-03T18:42:56Z

parquet-variant/src/lib.rs

@@ -33,10 +33,12 @@ mod decoder;
 mod variant;
 // TODO: dead code removal
 mod builder;
+mod from_json;
 mod to_json;
 #[allow(dead_code)]


This allow was bothering me , so here is a PR to clean it up:

[Variant] Remove dead code, add comments #7861

alamb · 2025-07-03T18:43:30Z

Thanks again @harshmotw-db and @scovich

alamb · 2025-07-03T19:14:19Z

Here is a follow on PR to break out the json functions:

[Variant] Move JSON related functionality to different crate. #7800

harshmotw-db · 2025-07-03T20:35:34Z

@alamb Do you think it is worth creating an arrow utility that abstracts away string to variant conversion? i.e. a function that takes in a StringArray (array of JSON strings some of which could be null), and returns a StructArray (array of corresponding variants where nulls are propagated directly)? If so, where should such functionality be in the codebase?

Also, adding this high-level functionality would also abstract away many of the changes we make to the lower-level library. If we make changes to VariantBuilder to require buffers from the caller, we could simply modify this high-level function and users' workflows would stay the same.

Edit: I have prototyped this separately. I am just looking for a home for this function.

alamb · 2025-07-03T21:05:33Z

@alamb Do you think it is worth creating an arrow utility that abstracts away string to variant conversion? i.e. a function that takes in a StringArray (array of JSON strings some of which could be null), and returns a StructArray (array of corresponding variants where nulls are propagated directly)? If so, where should such functionality be in the codebase?

Yes I think that is an important function

I suggest calling it a kernel and putting it in a parquet-variant-compute crate

Perhaps an API like this:

fn to_variant(array: &ArrayRef) -> StructArray {
...
}

Also, adding this high-level functionality would also abstract away many of the changes we make to the lower-level library. If we make changes to VariantBuilder to require buffers from the caller, we could simply modify this high-level function and users' workflows would stay the same.

This is a great idea

alamb · 2025-07-03T21:07:22Z

I envision the reverse cast as well

fn variant_to_string<O: OffsetSize>(array: &ArrayRef) -> GenericStringArray<O> {
...
}

fn variant_to_string_view(array: &ArrayRef) -> StringViewArray {
...
}

alamb · 2025-07-03T21:07:35Z

(I recommend a new ticket to track this idea BTW)

add json_to_variant

87b438d

github-actions bot added the parquet Changes to the parquet crate label Jun 25, 2025

comment fix

a946ac6

Added another sample buffer managger

bca3b81

harshmotw-db changed the title ~~add json_to_variant~~ [VARIANT] Add support for the json_to_variant API Jun 25, 2025

harshmotw-db added 4 commits June 25, 2025 15:16

minor refactoring

339e880

Merge branch 'main' of https://github.com/harshmotw-db/arrow-rs into …

fe798c3

…from_json

Merge branch 'main' of https://github.com/harshmotw-db/arrow-rs into …

882d3a7

…from_json

incorporated new changes

3c18fdf

scovich reviewed Jun 25, 2025

View reviewed changes

harshmotw-db added 3 commits June 25, 2025 18:09

minor changes

67a83fe

Merge branch 'main' of https://github.com/harshmotw-db/arrow-rs into …

c9aa519

…from_json

test fix based on recent commit

dede88d

harshmotw-db added 2 commits June 25, 2025 18:13

constant fix

fa3befc

fix

38bac59

alamb reviewed Jun 26, 2025

View reviewed changes

harshmotw-db mentioned this pull request Jun 26, 2025

[Variant] Write Variant construction output to buffers provided by the caller delta-io/delta-kernel-rs#1034

Open

harshmotw-db added 6 commits June 25, 2025 23:47

addressed comments

cd530ee

fix

57b3eb0

fixed VariantBufferManager

71b7d6f

deduped a bit of code

031c916

deduplicated code

d4fc876

moved serde code out of variant.rs

c41af4e

harshmotw-db requested review from alamb and scovich June 26, 2025 08:10

scovich approved these changes Jun 26, 2025

View reviewed changes

incorporated Ryan's latest comments

ecaf557

alamb mentioned this pull request Jul 1, 2025

[Variant] make serde_json an optional dependency of parquet-variant #7775

Open

harshmotw-db added 4 commits July 2, 2025 13:33

partially removed decimal dependency

388f188

merge

7407776

removed decimal type from variant json parsing

3b42d91

comment fix

43d6ea5

harshmotw-db requested a review from alamb July 2, 2025 21:30

scovich approved these changes Jul 2, 2025

View reviewed changes

refined lifetimes

e9deda9

alamb mentioned this pull request Jul 3, 2025

[Variant] feat: Add optional serde_json dependency of parquet-variant #7845

Open

alamb added 2 commits July 3, 2025 14:04

Fix clippy, remove unused dependency

eb11890

Merge remote-tracking branch 'apache/main' into harsh-motwani_data/fr…

3531540

…om_json

Update parquet-variant/src/from_json.rs

ea5b573

Co-authored-by: Ryan Johnson <[email protected]>

alamb reviewed Jul 3, 2025

View reviewed changes

alamb approved these changes Jul 3, 2025

View reviewed changes

alamb reviewed Jul 3, 2025

View reviewed changes

alamb merged commit 81ab147 into apache:main Jul 3, 2025
12 checks passed

alamb mentioned this pull request Jul 3, 2025

[Variant] Introduce parquet-variant-json crate #7862

Open

		@@ -0,0 +1,124 @@
		use arrow_schema::ArrowError;

		pub trait VariantBufferManager {

		let new_len = size.next_power_of_two();
		self.value_buffer.resize(new_len, 0);

[VARIANT] Add support for the json_to_variant API #7783

[VARIANT] Add support for the json_to_variant API #7783

Uh oh!

Conversation

harshmotw-db commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

harshmotw-db commented Jun 25, 2025

Uh oh!

alamb commented Jun 25, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harshmotw-db Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harshmotw-db Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harshmotw-db Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harshmotw-db Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jun 26, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

harshmotw-db commented Jun 25, 2025 •

edited

Loading

harshmotw-db Jun 26, 2025 •

edited

Loading

alamb Jun 26, 2025 •

edited

Loading

harshmotw-db Jun 26, 2025 •

edited

Loading

harshmotw-db Jun 27, 2025 •

edited

Loading

harshmotw-db Jun 27, 2025 •

edited

Loading

harshmotw-db commented Jul 2, 2025 •

edited

Loading