-
Notifications
You must be signed in to change notification settings - Fork 965
[VARIANT] Add support for the json_to_variant API #7783
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @scovich |
FYI @carpecodeum This is amazing @harshmotw-db -- I was just thinking about this. I started the CI checks and will check this PR more carefully tomorrow |
parquet-variant/Cargo.toml
Outdated
serde_json = "1.0" | ||
# "arbitrary_precision" allows us to manually parse numbers. "preserve_order" does not automatically | ||
# sort object keys | ||
serde_json = { version = "1.0", features = ["arbitrary_precision", "preserve_order"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL about the (undocumented) arbitrary_precision
feature flag and (undocumented) Number::as_str
method it unlocks 🤯
That could solve a lot of problems for variant decimals on the to_json path as well:
https://github.com/apache/arrow-rs/blob/main/parquet-variant/src/to_json.rs#L147-L162
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question tho -- do we need to support both modes of serde_json, given that the user (not arrow-rs) probably decides whether to use that feature flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible some projects might prioritize performance and might not want the Decimal type at all. I would prefer doing that as a follow up though. If you agree, I'll go ahead and create an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, for sure it's a follow-up. But we probably do need a story for how to handle that feature flag in a dependency (maybe part of the bigger story of how to handle a serde_json dependency in the first place).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet-variant/src/variant.rs
Outdated
if n.is_i64() { | ||
// Find minimum Integer width to fit | ||
let i = n.as_i64().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if n.is_i64() { | |
// Find minimum Integer width to fit | |
let i = n.as_i64().unwrap(); | |
if let Some(i) = n.is_i64() { | |
// Find minimum Integer width to fit |
parquet-variant/src/variant.rs
Outdated
if i as i8 as i64 == i { | ||
Ok((i as i8).into()) | ||
} else if i as i16 as i64 == i { | ||
Ok((i as i16).into()) | ||
} else if i as i32 as i64 == i { | ||
Ok((i as i32).into()) | ||
} else { | ||
Ok(i.into()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting approach. That double .. as .. as ..
definitely caused a double take, but I guess it could be cheaper than TryFrom?
if i as i8 as i64 == i { | |
Ok((i as i8).into()) | |
} else if i as i16 as i64 == i { | |
Ok((i as i16).into()) | |
} else if i as i32 as i64 == i { | |
Ok((i as i32).into()) | |
} else { | |
Ok(i.into()) | |
} | |
let value = i8::try_from(i) | |
.map(Variant::from) | |
.or_else(|_| i16::try_from(i).map(Variant::from)) | |
.or_else(|_| i32::try_from(i).map(Variant::from)) | |
.unwrap_or_else(|| Variant::from(i)); | |
Ok(value) |
pub struct SampleBoxBasedVariantBufferManager { | ||
pub value_buffer: Box<[u8]>, | ||
pub metadata_buffer: Box<[u8]>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What purpose would this serve? It seems strictly inferior to a vec for both normal and testing uses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuing the above thought about building up an arrow Array
... maybe a SampleMultiVariantBufferManager
that can stack a bunch of variants end to end, with a len
method that allows to extract offsets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I do plan on adding an arrow-compliant sample buffer manager. I currently do not fully understand arrow though but I believe the trait is simple enough that it could be extended to such use cases. I will do that as a follow up if you agree with the general design.
@@ -0,0 +1,124 @@ | |||
use arrow_schema::ArrowError; | |||
|
|||
pub trait VariantBufferManager { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure I understand this trait.
It almost seems like a good building block for eventual columnar writes, if an implementation took a mutable reference to a vec, where each VariantBufferManager
you build with it is only capable of appending new bytes to the end (leaving existing bytes unchanged). That way, one could build up an offset array along the way as part of building up arrow binary arrays for the metadata and value columns?
But I don't understand the whole "may be called several times" part, which seems to return the whole slice on every call? I guess that's because the variant builder doesn't do its work all at once, but still expects to receive the overall slice?
Finally, why split the ensure and borrow methods, when both take &mut self
and it's almost certainly dangerous to "borrow" without an "ensure" first (can't guess the size in advance, it's variant data)? Why not just a pair of ensure_and_borrow_XXX_buffer
methods that return a slice of exactly the requested size (no larger, no smaller)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's reasonable to return the whole slice, and have the Variant builder choose to write data wherever it chooses to. The trait was originally created for this PR where bytes are often being shifted to make way for headers etc. Let me know if you disagree or would recommend a different simple alternative.
As for why the methods are separate, I think you are generally correct and they could be combined. In the original PR, I was calling ensure_value_buffer_size
only occasionally but that was because I was performing size checks in the variant library before calling the method and that check could be moved into the method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have consolidated the two methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that currently the VariantBuilders need to write into a memory buffer anyways, I wonder what is this trait trying to abstract?
Like maybe the json_to_variant should take a VariantBuilder
to write to and not try to manage buffers itself 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, my opinion is that the variant builders should write to buffers owned by the caller. So no more ValueBuffer
and MetadataBuffer
- just VariantBufferManager
.
delta-io/delta-kernel-rs#1034
This PR just lays the groundwork for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having VariantBuilder write to buffers managed by somewhere else is a good idea and is useful more generally than json_to_variant
Perhaps we can make the json_to_variant
API be in terms of the VariantBuilder
and then modify the VariantBuilder
in a separate PR to handle buffers owned by the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed a ticket to track the idea of writing to buffers owned by the caller here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better if the VariantBuilder
dependency was abstracted away? So, right now the functionality looks like: (1) here's the JSON string, (2) Here are my buffers (variant_buffer_manager
), write the variant to the buffers.
The alternative is: (1) Here's the JSON string, (2) Here are my buffers, (3) this is the VariantBuilder built using my buffers (which I don't know why you need) => write the variants to these buffers.
I think the current approach VariantBufferManager
should be reasonably extensible to every use case and also supports resizing on the fly. The example solution described in this ticket would likely be retry based which would have a higher performance overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, yeah, in the short run, I think it is good not to have dependency on VariantBufferManager
since there are unanswered questions. I will implement the VariantBuilder-based API today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb I have removed the dependency of the current PR on custom buffer management and implemented what you suggested.
use arrow_schema::ArrowError; | ||
use parquet_variant::{json_to_variant, VariantBufferManager}; | ||
|
||
pub struct SampleVariantBufferManager { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this different from the box-based one above, and why not just use the vec-based one above instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I forgot to replace this
Here is some parallel art from @zeroshade in the go implementation: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @harshmotw-db -- I took a quick look -- this is a great start. I'll look more carefully tomorrow
parquet-variant/src/variant.rs
Outdated
impl TryFrom<&Number> for Variant<'_, '_> { | ||
type Error = ArrowError; | ||
|
||
fn try_from(n: &Number) -> Result<Self, Self::Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow up sounds like a good plan to me
parquet-variant/src/variant.rs
Outdated
@@ -32,6 +32,8 @@ mod decimal; | |||
mod list; | |||
mod metadata; | |||
mod object; | |||
use rust_decimal::prelude::*; | |||
use serde_json::Number; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be eventually be planning to avoid all use of serde_json for parsing / serializing.
It is fine to use serde_json for the initial version / getting things functionally working, but it is terribly inefficient compared to more optimized implementations like the tape decoder
Thus can we please try and keep anything serde_json specific out of Variant (like this From impl)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I'll move much of this code from variant.rs
to from_json.rs
if that's what you meant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. A few small cleanups to consider before merge.
let new_len = size.next_power_of_two(); | ||
self.value_buffer.resize(new_len, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need to do this -- the underlying vec is guaranteed to have reasonable amortized allocation costs:
Vec does not guarantee any particular growth strategy when reallocating when full, nor when reserve is called. ... Whatever strategy is used will of course guarantee O(1) amortized push.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this for demonstration purposes so custom non-vec implementations allocate reasonably sized buffers so they don't encounter O(n^2) complexity. Maybe, we should request power of two from the library itself when we integrate this deep into construction.
parquet-variant/src/from_json.rs
Outdated
fn build_list(arr: &[Value], builder: &mut ListBuilder) -> Result<(), ArrowError> { | ||
for val in arr { | ||
append_json(val, builder)?; | ||
} | ||
Ok(()) | ||
} | ||
|
||
fn build_object<'a, 'b>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two functions only have one caller each now -- should we just fold their 3-4 lines of code into their call sites?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed it to try_fold. Lmk if you meant something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry, I just meant to get rid of the functions and move their code directly where the function used to be called from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
For sure Also, we have some prior art in these crates here: those require knowing the precison up front, but we could perhaps adapt the code |
@alamb @scovich Yeah, I think that makes sense. For now, we could do away with the decimal type in json_to_variant and always use double, and soon replace it when the custom parser is ready. That being said, the Just to clarify, the number parsing was never a hard problem - that just overcomes the rust_decimal dependency, but the fact that |
I think this also is a property of the tape decoder -- since it doesn't parse into a tree / hash map structure it will present the fields in the order they appear in the json text |
@alamb I have now removed all dependency on the custom serde_json and ignored all decimal tests. See if the PR can be merged now. |
The tape decoder presents JSON numeric values as strings. But they still need to be parsed. Meanwhile, I had a little too much fun playing with parsing code (playground). It's a lot more efficient than the "try it and see" approach my pathfinding PR took. Probably a little too fancy tho (code size could cause instruction cache and branch prediction problems that actually hurt performance). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but please make the lifetimes more readable?
parquet-variant/src/to_json.rs
Outdated
@@ -249,27 +245,6 @@ pub fn variant_to_json_string(variant: &Variant) -> Result<String, ArrowError> { | |||
.map_err(|e| ArrowError::InvalidArgumentError(format!("UTF-8 conversion error: {e}"))) | |||
} | |||
|
|||
fn variant_decimal_to_json_value(decimal: &impl VariantDecimal) -> Result<Value, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a macro could avoid both overhead and duplication? The only part that needs to change (besides the data types) is Decimal16 has a more complicated way of handling the integer value produced at the end.
@@ -254,7 +269,7 @@ fn test_json_to_variant_decimal16_max_scale() -> Result<(), ArrowError> { | |||
fn test_json_to_variant_double_precision() -> Result<(), ArrowError> { | |||
JsonToVariantTest { | |||
json: "0.79228162514264337593543950335", | |||
expected: Variant::Double(0.792_281_625_142_643_3_f64), | |||
expected: Variant::Double(0.792_281_625_142_643_4_f64), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what made this one change in the latest commit, out of curiosity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't investigate too much, but I suppose the arbitrary_precision flag was causing slightly different results in parsing doubles. I didn't care because doubles lose precision anyway.
parquet-variant/src/from_json.rs
Outdated
struct ObjectFieldBuilder<'a, 'b, 'c> { | ||
key: &'a str, | ||
builder: &'b mut ObjectBuilder<'c, 'a>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is hard to interpret... can we use 'm
and 'v
?
struct ObjectFieldBuilder<'a, 'b, 'c> { | |
key: &'a str, | |
builder: &'b mut ObjectBuilder<'c, 'a>, | |
struct ObjectFieldBuilder<'m, 'v, 'r> { | |
key: &'r str, | |
builder: &'r mut ObjectBuilder<'m, 'v>, |
(here, 'r
is the lifetime of the references used to construct the field builder)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed it to 's, 'o and 'v where 's is the lifetime of the 's is the lifetime of the string, 'o is the lifetime of [ObjectBuilder
] and v
is the lifetime of the variant buffers.
I merged up from main and fixed a clippy error to get a clean CI run Given how long this PR has been outstanding and that it blocks a bunch of other work, I think we should try to merge it in and file follow ons asap. I am giving it another review now |
Co-authored-by: Ryan Johnson <[email protected]>
} | ||
|
||
#[test] | ||
fn test_json_to_variant_object_very_large() -> Result<(), ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't able to make it faster with a small effort, but I did briefly look at some profiling and it is spending a very large amount of time validating the offsets in the variant
That is likely something we can improve on over time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @scovich this PR is ready to go -- thank you for the work in this @harshmotw-db
I think the follow on items are:
- Support decimals better (use the smallest variant type like Int8 or Int64 is possible)
- Support round tripping of variants --> JSON --> Variants
@harshmotw-db can you please file a follow on ticket that describes the remaining work to support decimal in your mind? |
@@ -33,10 +33,12 @@ mod decoder; | |||
mod variant; | |||
// TODO: dead code removal | |||
mod builder; | |||
mod from_json; | |||
mod to_json; | |||
#[allow(dead_code)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allow was bothering me , so here is a PR to clean it up:
Thanks again @harshmotw-db and @scovich |
Here is a follow on PR to break out the json functions: |
@alamb Do you think it is worth creating an arrow utility that abstracts away string to variant conversion? i.e. a function that takes in a StringArray (array of JSON strings some of which could be null), and returns a StructArray (array of corresponding variants where nulls are propagated directly)? If so, where should such functionality be in the codebase? Also, adding this high-level functionality would also abstract away many of the changes we make to the lower-level library. If we make changes to Edit: I have prototyped this separately. I am just looking for a home for this function. |
Yes I think that is an important function I suggest calling it a Perhaps an API like this: fn to_variant(array: &ArrayRef) -> StructArray {
...
}
This is a great idea |
I envision the reverse cast as well fn variant_to_string<O: OffsetSize>(array: &ArrayRef) -> GenericStringArray<O> {
...
}
fn variant_to_string_view(array: &ArrayRef) -> StringViewArray {
...
} |
(I recommend a new ticket to track this idea BTW) |
Which issue does this PR close?
Rationale for this change
Explained in the issue.
What changes are included in this PR?
This PR includes a
json_to_variant
API to parse JSON strings as Variants.json_to_variant
takes as argument the input JSON string and an objectbuilder
of typeVariantBuilder
and builds the variant usingbuilder
. The resulting variant can be extracted usingbuilder.finish()
which consumes the builder and returns the Variant buffers.Are these changes tested?
Unit Tests, and an example file.
Are there any user-facing changes?
Yes, the PR introduces the
json_to_variant
API.