You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm looking at ML / Data science use cases, as well. Basically, I want a library that can ETL some data and feed it into ndarray or various machine learning libraries, so interoperability is a big part of what I'm looking for.
Some other features beyond what's already been mentioned:
I think that things like scaling, normalization, feature encoding do not necessarily belong to the (core) dataframe library @jblondin. I see them more in a Scikit-learnish port, that uses dataframe as a first-citizen input type.
What do you think?
I think that things like scaling, normalization, feature encoding do not necessarily belong to the (core) dataframe library @jblondin. I see them more in a Scikit-learnish port, that uses dataframe as a first-citizen input type.
What do you think?
Good point. While I don't think we should be beholden to trying to mimic the Python way of doing things, in the interest of the 'prefer small crates' Rust philosophy, most of my points should be in a separate ML-focused preprocessing crate. Similarly, we might want to put the time-series-specific features @galuhsahid mentions in a separate crate as well.
Thank you for bringing that up! This thread may be useful for defining some crate boundaries as well as needed use cases.
compile-time checks on common manipulations (e.g. as access to columns by index name), steering as far away as possible from a "stringy" API.
This is something I really struggled with. it would be lovely to do df["age"].mean() and have it not compile if "age" is not a valid column label, but there is no way to do this in Rust. The closes I got was to define a trait
Hi @jesskfullwood another solution that could work is if you lazily evaluate your table/dataframe; though you might not get as ergonomic as df["age"].mean().
If you had a Column which has a type, you could:
pubstructColumn{data:ArrayRef,data_type:DataType,// where this is an enum of different types}pubtraitAggregationFn{
mean(&self) -> Result<f64>;
sum(&self) -> Result<f64>,// of course this can be any output result}implAggregationFnforColumn{
mean(&self) -> Result<f64> {ifself.data_type.is_numeric(){Ok(self.data.mean())// assuming this is implemented somewhere as a kernel}else{Err(MyError("cannot calculate mean of non-numeric column"}}}
@nevi-me This is the way I originally did it and is basically the approach that Arrow takes. But it is quite limiting and largely negates the point of using Rust IMO. The DataFrame doesn't 'know' what is contained within so cannot statically check whether a given operation (e.g. "fetch this column") is valid. This is the major problem I have with pandas et al.
You are also limited in the types a given Column can contain, because each type must be enumerated within the DataType enum ahead of time. Essentially this limits you to just primitive types. I think it would be much nicer to be able to have e.g. enums like
enum Sex { Male, Female, NotStated }
within a Column rather than falling back to
is_male: bool
Re lazy-evaluation, I think that is a separate topic. If you had a hypothetical typesafe dataframe, one could imagine buiding up operations into a type a la how Futures work, e.g.
join(df1, df2, UserId1, UserId2) // frame1, frame2, join col 1, join col 2
could either directly evaluate the join resulting in a new Frame<...>, or build up a Join<...> type which could be executed at a later point. The latter is how Frameless works.
One benefit of lazy evaluation is that in theory the query can be optimized similar to a database so that you only execute the parts strictly necessary to generate the result you ask for.
ETA: I should mention, the optimization layer has conveniently already been written for us: Weld.
@jesskfullwood I do think it's possible to create typesafe dataframe wrapper around Arrow (I'm currently working on it).
Adding custom data types (e.g. enums) might be a bit more difficult -- I'm not currently sure how to handle types outside of Arrow's (or at least the Rust Arrow implementation's) primitive data types. I think it should be theoretically possible, though, with Arrow's union, list, and struct frameworks.
As a more general use case question, what are our needs, datatype-wise, beyond the typical primitives / strings? @jesskfullwood brings up an interesting use case with enums (or really any arbitrary type), but we'd have to figure out how that would work with our interoperability goals.
I strongly agree with @jesskfullwood - having a list/enum of acceptable/primitive types feels like an anti-pattern to me.
We should be able to handle arbitrary Rust types. The question becomes: how can we make this play nicely with Apache Arrow?
A possible solution would be to use a trait, where a Rust struct/enum provides methods that convert it to a memory layout that uses Apache Arrow primitives. It basically tells us how to lay it down in memory using the capabilities offered by Apache Arrow.
This might be a little tiresome to do at first, but we could probably get to the point when we can automate it for most types using a #[derive(ArrowCompatible)] macro.
Activity
hwchen commentedon Mar 25, 2019
My own use case is definitely on the etl side. So things that are important to me are:
map
/apply
I'm sure there's some I'm missing, but these come to mind first
LukeMathWalker commentedon Mar 25, 2019
My main use case concerns ML workloads:
ndarray
);galuhsahid commentedon Mar 30, 2019
Hi, hope y'all don't mind me chiming in - I'm very interested in dataframe libraries for Rust, & agreed I think Rust could make a great ETL tool!
I think my main use cases have been covered by the previous comments. Some other ones:
jblondin commentedon Apr 19, 2019
I'm looking at ML / Data science use cases, as well. Basically, I want a library that can ETL some data and feed it into
ndarray
or various machine learning libraries, so interoperability is a big part of what I'm looking for.Some other features beyond what's already been mentioned:
LukeMathWalker commentedon Apr 20, 2019
I think that things like scaling, normalization, feature encoding do not necessarily belong to the (core) dataframe library @jblondin. I see them more in a Scikit-learnish port, that uses dataframe as a first-citizen input type.
What do you think?
jblondin commentedon Apr 20, 2019
Good point. While I don't think we should be beholden to trying to mimic the Python way of doing things, in the interest of the 'prefer small crates' Rust philosophy, most of my points should be in a separate ML-focused preprocessing crate. Similarly, we might want to put the time-series-specific features @galuhsahid mentions in a separate crate as well.
Thank you for bringing that up! This thread may be useful for defining some crate boundaries as well as needed use cases.
jblondin commentedon Apr 20, 2019
@LukeMathWalker I'm not sure I understand exactly what this means. What would this entail?
LukeMathWalker commentedon Apr 21, 2019
I have definitely been too concise there @jblondin, my fault.
I meant
quoting myself from #1.
jesskfullwood commentedon May 1, 2019
This is something I really struggled with. it would be lovely to do
df["age"].mean()
and have it not compile if "age" is not a valid column label, but there is no way to do this in Rust. The closes I got was to define a traitThen use a macro
which expands to something like
Then do
df[Age].mean()
. Which works but is pretty ugly and unintuitive.frameless
has a "symbol" syntax which allows a much cleaner interface like (quoting the docs):Time for an RFC? 😄
jblondin commentedon May 1, 2019
That's basically what the
tablespace
macro in agnes does:I'd agree that a cleaner, simpler, approach would be preferred, but I'm not exactly sure how to go about doing that 😄
nevi-me commentedon May 1, 2019
Hi @jesskfullwood another solution that could work is if you lazily evaluate your table/dataframe; though you might not get as ergonomic as
df["age"].mean()
.If you had a
Column
which has a type, you could:jesskfullwood commentedon May 2, 2019
@nevi-me This is the way I originally did it and is basically the approach that
Arrow
takes. But it is quite limiting and largely negates the point of using Rust IMO. TheDataFrame
doesn't 'know' what is contained within so cannot statically check whether a given operation (e.g. "fetch this column") is valid. This is the major problem I have withpandas
et al.You are also limited in the types a given
Column
can contain, because each type must be enumerated within theDataType
enum ahead of time. Essentially this limits you to just primitive types. I think it would be much nicer to be able to have e.g. enums likewithin a
Column
rather than falling back toRe lazy-evaluation, I think that is a separate topic. If you had a hypothetical typesafe dataframe, one could imagine buiding up operations into a type a la how
Future
s work, e.g.could either directly evaluate the join resulting in a new
Frame<...>
, or build up aJoin<...>
type which could be executed at a later point. The latter is howFrameless
works.One benefit of lazy evaluation is that in theory the query can be optimized similar to a database so that you only execute the parts strictly necessary to generate the result you ask for.
ETA: I should mention, the optimization layer has conveniently already been written for us: Weld.
jblondin commentedon May 2, 2019
@jesskfullwood I do think it's possible to create typesafe dataframe wrapper around Arrow (I'm currently working on it).
Adding custom data types (e.g. enums) might be a bit more difficult -- I'm not currently sure how to handle types outside of Arrow's (or at least the Rust Arrow implementation's) primitive data types. I think it should be theoretically possible, though, with Arrow's union, list, and struct frameworks.
As a more general use case question, what are our needs, datatype-wise, beyond the typical primitives / strings? @jesskfullwood brings up an interesting use case with enums (or really any arbitrary type), but we'd have to figure out how that would work with our interoperability goals.
LukeMathWalker commentedon May 2, 2019
I strongly agree with @jesskfullwood - having a list/enum of acceptable/primitive types feels like an anti-pattern to me.
We should be able to handle arbitrary Rust types. The question becomes: how can we make this play nicely with Apache Arrow?
A possible solution would be to use a trait, where a Rust struct/enum provides methods that convert it to a memory layout that uses Apache Arrow primitives. It basically tells us how to lay it down in memory using the capabilities offered by Apache Arrow.
This might be a little tiresome to do at first, but we could probably get to the point when we can automate it for most types using a
#[derive(ArrowCompatible)]
macro.LukeMathWalker commentedon May 3, 2019
Btw, I didn't know about
frameless
- super cool! Thanks @jesskfullwood 😄