You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Dmitriy Morozov <in...@gmail.com> on 2016/03/03 09:39:06 UTC

Re: Arrow examples

Hi Wes,

Thanks for raising the ticket. So it seems like Spark 2.0 will not have
support for Arrow.
Also does SPARK-13534 cover Arrow serialization for Spark's JAVA API, or do
we need to raise a separate ticket for that?

As of now, I only have a high-level understanding of Arrow and it's data
structure but I'm willing to dive deeper and provide any help I can, mainly
in testing, Java serializer or additional examples. Let me know how I can
help.

Thanks,
Dima

On 1 March 2016 at 00:46, Wes McKinney <we...@cloudera.com> wrote:

> hi Dmitriy,
>
> I created the following JIRA
> https://issues.apache.org/jira/browse/SPARK-13534 related to PySpark
> which seems relevant. I would be happy to collaborate with you on
> this. Since I understand that the Spark developers are exploring an
> in-memory columnar layout for Spark DataFrames/Datasets and Spark SQL
> any conversion code we write right now may end up being temporary.
> Hopefully the Spark columnar memory layout will end up being very
> nearly the same as the official Arrow layout so that limited or no
> conversion will be necessary.
>
> Thanks
> Wes
>
> On Wed, Feb 24, 2016 at 12:38 PM, Dmitriy Morozov <in...@gmail.com>
> wrote:
> > Hello everyone,
> >
> > I'm just starting with Arrow. I'd like to see how good Arrow at caching
> > when used in conjunction with Allixio (Tachyon). The use case that I'm
> > going to validate involves reading data from Spark's DataFrame, storing
> in
> > Tachyon in Arrow and then reading back into DataFrame. I checked the
> source
> > code of Arrow but couldn't find any examples or tests. Can anyone guide
> me
> > please where should I start looking at in order to convert DataFrame to a
> > Arrow struct?
> >
> > Thanks!
> > Dmitriy
>



-- 
Kind regards,
Dima

Re: Arrow examples

Posted by Wes McKinney <we...@cloudera.com>.

Serializing Spark DataFrame in either Java or Scala would suffice for the
use case, but there may be follow-on JIRAs to make the Arrow adapters more
accessible. pandas only needs access to flat schemas for now, for example,
so nested Spark SQL schemas could be handled in follow-up work.

Note: this is somewhat dependent on the separate thread around the metadata
specification -- ideally Spark SQL would be able to adapt its schema
metadata to a form that any Arrow consumer can use.

- Wes

On Thu, Mar 3, 2016 at 12:39 AM, Dmitriy Morozov <in...@gmail.com> wrote:

> Hi Wes,
>
> Thanks for raising the ticket. So it seems like Spark 2.0 will not have
> support for Arrow.
> Also does SPARK-13534 cover Arrow serialization for Spark's JAVA API, or do
> we need to raise a separate ticket for that?
>
> As of now, I only have a high-level understanding of Arrow and it's data
> structure but I'm willing to dive deeper and provide any help I can, mainly
> in testing, Java serializer or additional examples. Let me know how I can
> help.
>
> Thanks,
> Dima
>
> On 1 March 2016 at 00:46, Wes McKinney <we...@cloudera.com> wrote:
>
> > hi Dmitriy,
> >
> > I created the following JIRA
> > https://issues.apache.org/jira/browse/SPARK-13534 related to PySpark
> > which seems relevant. I would be happy to collaborate with you on
> > this. Since I understand that the Spark developers are exploring an
> > in-memory columnar layout for Spark DataFrames/Datasets and Spark SQL
> > any conversion code we write right now may end up being temporary.
> > Hopefully the Spark columnar memory layout will end up being very
> > nearly the same as the official Arrow layout so that limited or no
> > conversion will be necessary.
> >
> > Thanks
> > Wes
> >
> > On Wed, Feb 24, 2016 at 12:38 PM, Dmitriy Morozov <in...@gmail.com>
> > wrote:
> > > Hello everyone,
> > >
> > > I'm just starting with Arrow. I'd like to see how good Arrow at caching
> > > when used in conjunction with Allixio (Tachyon). The use case that I'm
> > > going to validate involves reading data from Spark's DataFrame, storing
> > in
> > > Tachyon in Arrow and then reading back into DataFrame. I checked the
> > source
> > > code of Arrow but couldn't find any examples or tests. Can anyone guide
> > me
> > > please where should I start looking at in order to convert DataFrame
> to a
> > > Arrow struct?
> > >
> > > Thanks!
> > > Dmitriy
> >
>
>
>
> --
> Kind regards,
> Dima
>