You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Jeetendra Kumar Jaiswal <je...@impetus.co.in.INVALID> on 2019/12/05 09:53:20 UTC

Java - Spark dataframe to Arrow format

Hi Dev Team,

Can someone please let me know how to convert spark data frame to Arrow format. I am coding in Java.

Java documentation of Arrow just has function API information. It is little hard to develop without proper documentation.

Is there a way to directly convert spark dataframe to Arrow format dataframes.

Thanks,
Jeetendra

________________________________






NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.

Re: Java - Spark dataframe to Arrow format

Posted by Micah Kornfield <em...@gmail.com>.

There was a discussion/proposal a while ago on the spark mailing list to
use the Arrow memory format natively within spark [1], but the proposal was
scaled back to exposing vectorized APIs only IIUC.

Looking quickly at the links Wes provided, one option for potential
speed-up could be a dynamically generated ArrowWriter [2].  I'm not sure
how much of a performance benefit this would provide (and it also doesn't
solve memory overhead issues).  I think this type of functionality should
be discussed on the Spark mailing list.

For the Parquet parsing, there are some open pull requests and some
discussion on exposing more of the C++ functionality through JNI bindings
which when combined with ArrowColumnVector might provide a feasible way to
reduce overhead.

Maybe members of the Spark community comment on how we can better
collaborate to further reduce overhead?

Thanks,
Micah

[1] https://issues.apache.org/jira/browse/SPARK-27396
[2]
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala

On Fri, Dec 6, 2019 at 5:11 AM GaoXiang Wang <wg...@gmail.com> wrote:

> Hi Wes and Liya,
>
> Appreciate your feedback and information.
>
> Looking forward to a more efficient integration between Arrow and Spark on
> the Java/Scala level. I would like to make my contribution if I can help in
> any way during my free time.
>
> Thank you very much.
>
>
> *Best Regards,WANG GAOXIANG*
> * (Eric) *
> National University of Singapore Graduate ::
> API Craft Singapore Co-organiser ::
> Singapore Python User Group Co-organiser
> *+6597685360 (P) :: wgx731@gmail.com <wg...@gmail.com> (E) ::
> **https://medium.com/@wgx731
> <ht...@wgx731> **(W)*
>
>
> On Fri, Dec 6, 2019 at 6:17 PM Fan Liya <li...@gmail.com> wrote:
>
> > Hi folks,
> >
> > Thanks for your clarification.
> >
> > I also think this is a universal requirement (including Java UDF in Arrow
> > format).
> >
> > The Java converter provided by Spark is inefficient, due to two reasons
> > (IMO)
> >
> > 1. There are frequent memory copies between on-heap and off-heap memory.
> > 2. The Spark API is in a row-oriented view (Iterator of InternalRow), so
> we
> > need to perform some column/row conversion, and we cannot copy data in
> > batch.
> >
> > To solve the problem, maybe we need something equivalent to pandas in
> Java
> > (I think pandas acts as a bridge between PyArrow and PySpark).
> > In addition, we need to integrate it in Arrow and Spark.
> >
> > Best,
> > Liya Fan
> >
> > On Fri, Dec 6, 2019 at 2:14 AM Chen Li <cn...@fb.com> wrote:
> >
> > > We have a similar use case, and we use ArrowConverters.scala mentioned
> by
> > > Wes. However, the overhead of the conversion is kinda high.
> > > ------------------------------
> > > *From:* Wes McKinney <we...@gmail.com>
> > > *Sent:* Thursday, December 5, 2019 6:53 AM
> > > *To:* dev <de...@arrow.apache.org>
> > > *Cc:* Fan Liya <li...@gmail.com>;
> > > jeetendra.jaiswal@impetus.co.in.invalid
> > > <je...@impetus.co.in.invalid>
> > > *Subject:* Re: Java - Spark dataframe to Arrow format
> > >
> > > hi folks,
> > >
> > > I understand the question to be about serialization.
> > >
> > > see
> > >
> > > *
> > >
> >
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java
> > > *
> > >
> >
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala
> > > *
> > >
> >
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala
> > >
> > > This code is used to convert between Spark Data Frames and Arrow
> > > columnar format for UDF evaluation purposes
> > >
> > > On Thu, Dec 5, 2019 at 6:58 AM GaoXiang Wang <wg...@gmail.com> wrote:
> > > >
> > > > Hi Jeetendra and Liya,
> > > >
> > > > I am actually having a similar use case. We have some data stored as
> > > *parquet
> > > > format in HDFS* and would like to make use of Apache Arrow to improve
> > > > compute performance if possible. Right now, I didn't see there is a
> > > direct
> > > > way to do in Java with Spark.
> > > >
> > > > I have search the Spark documentation, it looks like python support
> is
> > > > added after 2.3.0 (
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_sql-2Dpyspark-2Dpandas-2Dwith-2Darrow.html&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=JX5y-LzqAOulZIcSbMRGYA&m=E587baUuoFDKcpKDDIF4Su8nBHs0I9bGTBqEhtErCuY&s=n28eUF_7egcwK6LLh63Wra3oWTzZWBlB6en3xCxDEdE&e=
> > > ),
> > > > any plan from Apache Arrow team to provide *Spark integration for
> > Java*?
> > > >
> > > > Thank you very much.
> > > >
> > > >
> > > > *Best Regards,WANG GAOXIANG*
> > > > * (Eric) *
> > > > National University of Singapore Graduate ::
> > > > API Craft Singapore Co-organiser ::
> > > > Singapore Python User Group Co-organiser
> > > > *+6597685360 (P) :: wgx731@gmail.com <wg...@gmail.com> (E) ::
> > > > **
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_-40wgx731&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=JX5y-LzqAOulZIcSbMRGYA&m=E587baUuoFDKcpKDDIF4Su8nBHs0I9bGTBqEhtErCuY&s=thoJd3JhOJ8HBCsAJTzhnfw91reStRfH0pUj9v-v5xE&e=
> > > > <
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_-40wgx731&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=JX5y-LzqAOulZIcSbMRGYA&m=E587baUuoFDKcpKDDIF4Su8nBHs0I9bGTBqEhtErCuY&s=thoJd3JhOJ8HBCsAJTzhnfw91reStRfH0pUj9v-v5xE&e=
> > > > **(W)*
> > > >
> > > >
> > > > On Thu, Dec 5, 2019 at 6:58 PM Fan Liya <li...@gmail.com>
> wrote:
> > > >
> > > > > Hi Jeetendra,
> > > > >
> > > > > I am not sure if I understand your question correctly.
> > > > >
> > > > > Arrow is an in-memory columnar data format, and Spark has its own
> > > in-memory
> > > > > data format for DataFrame, which is invisible to end users.
> > > > > So the Spark user has no control over the underlying in-memory
> > layout.
> > > > >
> > > > > If you really want to convert a DataFrame into Arrow format, maybe
> > you
> > > can
> > > > > save the results of a Spark job to some external store (e.g. in ORC
> > > > > format), and then load it back to memory in Arrow format (if this
> is
> > > what
> > > > > you want).
> > > > >
> > > > > Best,
> > > > > Liya Fan
> > > > >
> > > > > On Thu, Dec 5, 2019 at 5:53 PM Jeetendra Kumar Jaiswal
> > > > > <je...@impetus.co.in.invalid> wrote:
> > > > >
> > > > > > Hi Dev Team,
> > > > > >
> > > > > > Can someone please let me know how to convert spark data frame to
> > > Arrow
> > > > > > format. I am coding in Java.
> > > > > >
> > > > > > Java documentation of Arrow just has function API information. It
> > is
> > > > > > little hard to develop without proper documentation.
> > > > > >
> > > > > > Is there a way to directly convert spark dataframe to Arrow
> format
> > > > > > dataframes.
> > > > > >
> > > > > > Thanks,
> > > > > > Jeetendra
> > > > > >
> > > > > > ________________________________
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > NOTE: This message may contain information that is confidential,
> > > > > > proprietary, privileged or otherwise protected by law. The
> message
> > is
> > > > > > intended solely for the named addressee. If received in error,
> > please
> > > > > > destroy and notify the sender. Any use of this email is
> prohibited
> > > when
> > > > > > received in error. Impetus does not represent, warrant and/or
> > > guarantee,
> > > > > > that the integrity of this communication has been maintained nor
> > > that the
> > > > > > communication is free of errors, virus, interception or
> > interference.
> > > > > >
> > > > >
> > >
> >
>

Re: Java - Spark dataframe to Arrow format

Posted by GaoXiang Wang <wg...@gmail.com>.

Hi Wes and Liya,

Appreciate your feedback and information.

Looking forward to a more efficient integration between Arrow and Spark on
the Java/Scala level. I would like to make my contribution if I can help in
any way during my free time.

Thank you very much.


*Best Regards,WANG GAOXIANG*
* (Eric) *
National University of Singapore Graduate ::
API Craft Singapore Co-organiser ::
Singapore Python User Group Co-organiser
*+6597685360 (P) :: wgx731@gmail.com <wg...@gmail.com> (E) ::
**https://medium.com/@wgx731
<ht...@wgx731> **(W)*


On Fri, Dec 6, 2019 at 6:17 PM Fan Liya <li...@gmail.com> wrote:

> Hi folks,
>
> Thanks for your clarification.
>
> I also think this is a universal requirement (including Java UDF in Arrow
> format).
>
> The Java converter provided by Spark is inefficient, due to two reasons
> (IMO)
>
> 1. There are frequent memory copies between on-heap and off-heap memory.
> 2. The Spark API is in a row-oriented view (Iterator of InternalRow), so we
> need to perform some column/row conversion, and we cannot copy data in
> batch.
>
> To solve the problem, maybe we need something equivalent to pandas in Java
> (I think pandas acts as a bridge between PyArrow and PySpark).
> In addition, we need to integrate it in Arrow and Spark.
>
> Best,
> Liya Fan
>
> On Fri, Dec 6, 2019 at 2:14 AM Chen Li <cn...@fb.com> wrote:
>
> > We have a similar use case, and we use ArrowConverters.scala mentioned by
> > Wes. However, the overhead of the conversion is kinda high.
> > ------------------------------
> > *From:* Wes McKinney <we...@gmail.com>
> > *Sent:* Thursday, December 5, 2019 6:53 AM
> > *To:* dev <de...@arrow.apache.org>
> > *Cc:* Fan Liya <li...@gmail.com>;
> > jeetendra.jaiswal@impetus.co.in.invalid
> > <je...@impetus.co.in.invalid>
> > *Subject:* Re: Java - Spark dataframe to Arrow format
> >
> > hi folks,
> >
> > I understand the question to be about serialization.
> >
> > see
> >
> > *
> >
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java
> > *
> >
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala
> > *
> >
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala
> >
> > This code is used to convert between Spark Data Frames and Arrow
> > columnar format for UDF evaluation purposes
> >
> > On Thu, Dec 5, 2019 at 6:58 AM GaoXiang Wang <wg...@gmail.com> wrote:
> > >
> > > Hi Jeetendra and Liya,
> > >
> > > I am actually having a similar use case. We have some data stored as
> > *parquet
> > > format in HDFS* and would like to make use of Apache Arrow to improve
> > > compute performance if possible. Right now, I didn't see there is a
> > direct
> > > way to do in Java with Spark.
> > >
> > > I have search the Spark documentation, it looks like python support is
> > > added after 2.3.0 (
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_sql-2Dpyspark-2Dpandas-2Dwith-2Darrow.html&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=JX5y-LzqAOulZIcSbMRGYA&m=E587baUuoFDKcpKDDIF4Su8nBHs0I9bGTBqEhtErCuY&s=n28eUF_7egcwK6LLh63Wra3oWTzZWBlB6en3xCxDEdE&e=
> > ),
> > > any plan from Apache Arrow team to provide *Spark integration for
> Java*?
> > >
> > > Thank you very much.
> > >
> > >
> > > *Best Regards,WANG GAOXIANG*
> > > * (Eric) *
> > > National University of Singapore Graduate ::
> > > API Craft Singapore Co-organiser ::
> > > Singapore Python User Group Co-organiser
> > > *+6597685360 (P) :: wgx731@gmail.com <wg...@gmail.com> (E) ::
> > > **
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_-40wgx731&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=JX5y-LzqAOulZIcSbMRGYA&m=E587baUuoFDKcpKDDIF4Su8nBHs0I9bGTBqEhtErCuY&s=thoJd3JhOJ8HBCsAJTzhnfw91reStRfH0pUj9v-v5xE&e=
> > > <
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_-40wgx731&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=JX5y-LzqAOulZIcSbMRGYA&m=E587baUuoFDKcpKDDIF4Su8nBHs0I9bGTBqEhtErCuY&s=thoJd3JhOJ8HBCsAJTzhnfw91reStRfH0pUj9v-v5xE&e=
> > > **(W)*
> > >
> > >
> > > On Thu, Dec 5, 2019 at 6:58 PM Fan Liya <li...@gmail.com> wrote:
> > >
> > > > Hi Jeetendra,
> > > >
> > > > I am not sure if I understand your question correctly.
> > > >
> > > > Arrow is an in-memory columnar data format, and Spark has its own
> > in-memory
> > > > data format for DataFrame, which is invisible to end users.
> > > > So the Spark user has no control over the underlying in-memory
> layout.
> > > >
> > > > If you really want to convert a DataFrame into Arrow format, maybe
> you
> > can
> > > > save the results of a Spark job to some external store (e.g. in ORC
> > > > format), and then load it back to memory in Arrow format (if this is
> > what
> > > > you want).
> > > >
> > > > Best,
> > > > Liya Fan
> > > >
> > > > On Thu, Dec 5, 2019 at 5:53 PM Jeetendra Kumar Jaiswal
> > > > <je...@impetus.co.in.invalid> wrote:
> > > >
> > > > > Hi Dev Team,
> > > > >
> > > > > Can someone please let me know how to convert spark data frame to
> > Arrow
> > > > > format. I am coding in Java.
> > > > >
> > > > > Java documentation of Arrow just has function API information. It
> is
> > > > > little hard to develop without proper documentation.
> > > > >
> > > > > Is there a way to directly convert spark dataframe to Arrow format
> > > > > dataframes.
> > > > >
> > > > > Thanks,
> > > > > Jeetendra
> > > > >
> > > > > ________________________________
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > NOTE: This message may contain information that is confidential,
> > > > > proprietary, privileged or otherwise protected by law. The message
> is
> > > > > intended solely for the named addressee. If received in error,
> please
> > > > > destroy and notify the sender. Any use of this email is prohibited
> > when
> > > > > received in error. Impetus does not represent, warrant and/or
> > guarantee,
> > > > > that the integrity of this communication has been maintained nor
> > that the
> > > > > communication is free of errors, virus, interception or
> interference.
> > > > >
> > > >
> >
>

Re: Java - Spark dataframe to Arrow format

Posted by Fan Liya <li...@gmail.com>.

Hi folks,

Thanks for your clarification.

I also think this is a universal requirement (including Java UDF in Arrow
format).

The Java converter provided by Spark is inefficient, due to two reasons
(IMO)

1. There are frequent memory copies between on-heap and off-heap memory.
2. The Spark API is in a row-oriented view (Iterator of InternalRow), so we
need to perform some column/row conversion, and we cannot copy data in
batch.

To solve the problem, maybe we need something equivalent to pandas in Java
(I think pandas acts as a bridge between PyArrow and PySpark).
In addition, we need to integrate it in Arrow and Spark.

Best,
Liya Fan

On Fri, Dec 6, 2019 at 2:14 AM Chen Li <cn...@fb.com> wrote:

> We have a similar use case, and we use ArrowConverters.scala mentioned by
> Wes. However, the overhead of the conversion is kinda high.
> ------------------------------
> *From:* Wes McKinney <we...@gmail.com>
> *Sent:* Thursday, December 5, 2019 6:53 AM
> *To:* dev <de...@arrow.apache.org>
> *Cc:* Fan Liya <li...@gmail.com>;
> jeetendra.jaiswal@impetus.co.in.invalid
> <je...@impetus.co.in.invalid>
> *Subject:* Re: Java - Spark dataframe to Arrow format
>
> hi folks,
>
> I understand the question to be about serialization.
>
> see
>
> *
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java
> *
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala
> *
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala
>
> This code is used to convert between Spark Data Frames and Arrow
> columnar format for UDF evaluation purposes
>
> On Thu, Dec 5, 2019 at 6:58 AM GaoXiang Wang <wg...@gmail.com> wrote:
> >
> > Hi Jeetendra and Liya,
> >
> > I am actually having a similar use case. We have some data stored as
> *parquet
> > format in HDFS* and would like to make use of Apache Arrow to improve
> > compute performance if possible. Right now, I didn't see there is a
> direct
> > way to do in Java with Spark.
> >
> > I have search the Spark documentation, it looks like python support is
> > added after 2.3.0 (
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_sql-2Dpyspark-2Dpandas-2Dwith-2Darrow.html&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=JX5y-LzqAOulZIcSbMRGYA&m=E587baUuoFDKcpKDDIF4Su8nBHs0I9bGTBqEhtErCuY&s=n28eUF_7egcwK6LLh63Wra3oWTzZWBlB6en3xCxDEdE&e=
> ),
> > any plan from Apache Arrow team to provide *Spark integration for Java*?
> >
> > Thank you very much.
> >
> >
> > *Best Regards,WANG GAOXIANG*
> > * (Eric) *
> > National University of Singapore Graduate ::
> > API Craft Singapore Co-organiser ::
> > Singapore Python User Group Co-organiser
> > *+6597685360 (P) :: wgx731@gmail.com <wg...@gmail.com> (E) ::
> > **
> https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_-40wgx731&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=JX5y-LzqAOulZIcSbMRGYA&m=E587baUuoFDKcpKDDIF4Su8nBHs0I9bGTBqEhtErCuY&s=thoJd3JhOJ8HBCsAJTzhnfw91reStRfH0pUj9v-v5xE&e=
> > <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_-40wgx731&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=JX5y-LzqAOulZIcSbMRGYA&m=E587baUuoFDKcpKDDIF4Su8nBHs0I9bGTBqEhtErCuY&s=thoJd3JhOJ8HBCsAJTzhnfw91reStRfH0pUj9v-v5xE&e=
> > **(W)*
> >
> >
> > On Thu, Dec 5, 2019 at 6:58 PM Fan Liya <li...@gmail.com> wrote:
> >
> > > Hi Jeetendra,
> > >
> > > I am not sure if I understand your question correctly.
> > >
> > > Arrow is an in-memory columnar data format, and Spark has its own
> in-memory
> > > data format for DataFrame, which is invisible to end users.
> > > So the Spark user has no control over the underlying in-memory layout.
> > >
> > > If you really want to convert a DataFrame into Arrow format, maybe you
> can
> > > save the results of a Spark job to some external store (e.g. in ORC
> > > format), and then load it back to memory in Arrow format (if this is
> what
> > > you want).
> > >
> > > Best,
> > > Liya Fan
> > >
> > > On Thu, Dec 5, 2019 at 5:53 PM Jeetendra Kumar Jaiswal
> > > <je...@impetus.co.in.invalid> wrote:
> > >
> > > > Hi Dev Team,
> > > >
> > > > Can someone please let me know how to convert spark data frame to
> Arrow
> > > > format. I am coding in Java.
> > > >
> > > > Java documentation of Arrow just has function API information. It is
> > > > little hard to develop without proper documentation.
> > > >
> > > > Is there a way to directly convert spark dataframe to Arrow format
> > > > dataframes.
> > > >
> > > > Thanks,
> > > > Jeetendra
> > > >
> > > > ________________________________
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > NOTE: This message may contain information that is confidential,
> > > > proprietary, privileged or otherwise protected by law. The message is
> > > > intended solely for the named addressee. If received in error, please
> > > > destroy and notify the sender. Any use of this email is prohibited
> when
> > > > received in error. Impetus does not represent, warrant and/or
> guarantee,
> > > > that the integrity of this communication has been maintained nor
> that the
> > > > communication is free of errors, virus, interception or interference.
> > > >
> > >
>

Re: Java - Spark dataframe to Arrow format

Posted by Chen Li <cn...@fb.com>.

We have a similar use case, and we use ArrowConverters.scala mentioned by Wes. However, the overhead of the conversion is kinda high.
________________________________
From: Wes McKinney <we...@gmail.com>
Sent: Thursday, December 5, 2019 6:53 AM
To: dev <de...@arrow.apache.org>
Cc: Fan Liya <li...@gmail.com>; jeetendra.jaiswal@impetus.co.in.invalid <je...@impetus.co.in.invalid>
Subject: Re: Java - Spark dataframe to Arrow format

hi folks,

I understand the question to be about serialization.

see

* https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java
* https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala
* https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala

This code is used to convert between Spark Data Frames and Arrow
columnar format for UDF evaluation purposes

On Thu, Dec 5, 2019 at 6:58 AM GaoXiang Wang <wg...@gmail.com> wrote:
>
> Hi Jeetendra and Liya,
>
> I am actually having a similar use case. We have some data stored as *parquet
> format in HDFS* and would like to make use of Apache Arrow to improve
> compute performance if possible. Right now, I didn't see there is a direct
> way to do in Java with Spark.
>
> I have search the Spark documentation, it looks like python support is
> added after 2.3.0 (
> https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_sql-2Dpyspark-2Dpandas-2Dwith-2Darrow.html&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=JX5y-LzqAOulZIcSbMRGYA&m=E587baUuoFDKcpKDDIF4Su8nBHs0I9bGTBqEhtErCuY&s=n28eUF_7egcwK6LLh63Wra3oWTzZWBlB6en3xCxDEdE&e= ),
> any plan from Apache Arrow team to provide *Spark integration for Java*?
>
> Thank you very much.
>
>
> *Best Regards,WANG GAOXIANG*
> * (Eric) *
> National University of Singapore Graduate ::
> API Craft Singapore Co-organiser ::
> Singapore Python User Group Co-organiser
> *+6597685360 (P) :: wgx731@gmail.com <wg...@gmail.com> (E) ::
> **https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_-40wgx731&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=JX5y-LzqAOulZIcSbMRGYA&m=E587baUuoFDKcpKDDIF4Su8nBHs0I9bGTBqEhtErCuY&s=thoJd3JhOJ8HBCsAJTzhnfw91reStRfH0pUj9v-v5xE&e=
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_-40wgx731&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=JX5y-LzqAOulZIcSbMRGYA&m=E587baUuoFDKcpKDDIF4Su8nBHs0I9bGTBqEhtErCuY&s=thoJd3JhOJ8HBCsAJTzhnfw91reStRfH0pUj9v-v5xE&e= > **(W)*
>
>
> On Thu, Dec 5, 2019 at 6:58 PM Fan Liya <li...@gmail.com> wrote:
>
> > Hi Jeetendra,
> >
> > I am not sure if I understand your question correctly.
> >
> > Arrow is an in-memory columnar data format, and Spark has its own in-memory
> > data format for DataFrame, which is invisible to end users.
> > So the Spark user has no control over the underlying in-memory layout.
> >
> > If you really want to convert a DataFrame into Arrow format, maybe you can
> > save the results of a Spark job to some external store (e.g. in ORC
> > format), and then load it back to memory in Arrow format (if this is what
> > you want).
> >
> > Best,
> > Liya Fan
> >
> > On Thu, Dec 5, 2019 at 5:53 PM Jeetendra Kumar Jaiswal
> > <je...@impetus.co.in.invalid> wrote:
> >
> > > Hi Dev Team,
> > >
> > > Can someone please let me know how to convert spark data frame to Arrow
> > > format. I am coding in Java.
> > >
> > > Java documentation of Arrow just has function API information. It is
> > > little hard to develop without proper documentation.
> > >
> > > Is there a way to directly convert spark dataframe to Arrow format
> > > dataframes.
> > >
> > > Thanks,
> > > Jeetendra
> > >
> > > ________________________________
> > >
> > >
> > >
> > >
> > >
> > >
> > > NOTE: This message may contain information that is confidential,
> > > proprietary, privileged or otherwise protected by law. The message is
> > > intended solely for the named addressee. If received in error, please
> > > destroy and notify the sender. Any use of this email is prohibited when
> > > received in error. Impetus does not represent, warrant and/or guarantee,
> > > that the integrity of this communication has been maintained nor that the
> > > communication is free of errors, virus, interception or interference.
> > >
> >

Re: Java - Spark dataframe to Arrow format

Posted by Wes McKinney <we...@gmail.com>.

hi folks,

I understand the question to be about serialization.

see

* https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java
* https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala
* https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala

This code is used to convert between Spark Data Frames and Arrow
columnar format for UDF evaluation purposes

On Thu, Dec 5, 2019 at 6:58 AM GaoXiang Wang <wg...@gmail.com> wrote:
>
> Hi Jeetendra and Liya,
>
> I am actually having a similar use case. We have some data stored as *parquet
> format in HDFS* and would like to make use of Apache Arrow to improve
> compute performance if possible. Right now, I didn't see there is a direct
> way to do in Java with Spark.
>
> I have search the Spark documentation, it looks like python support is
> added after 2.3.0 (
> https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html),
> any plan from Apache Arrow team to provide *Spark integration for Java*?
>
> Thank you very much.
>
>
> *Best Regards,WANG GAOXIANG*
> * (Eric) *
> National University of Singapore Graduate ::
> API Craft Singapore Co-organiser ::
> Singapore Python User Group Co-organiser
> *+6597685360 (P) :: wgx731@gmail.com <wg...@gmail.com> (E) ::
> **https://medium.com/@wgx731
> <ht...@wgx731> **(W)*
>
>
> On Thu, Dec 5, 2019 at 6:58 PM Fan Liya <li...@gmail.com> wrote:
>
> > Hi Jeetendra,
> >
> > I am not sure if I understand your question correctly.
> >
> > Arrow is an in-memory columnar data format, and Spark has its own in-memory
> > data format for DataFrame, which is invisible to end users.
> > So the Spark user has no control over the underlying in-memory layout.
> >
> > If you really want to convert a DataFrame into Arrow format, maybe you can
> > save the results of a Spark job to some external store (e.g. in ORC
> > format), and then load it back to memory in Arrow format (if this is what
> > you want).
> >
> > Best,
> > Liya Fan
> >
> > On Thu, Dec 5, 2019 at 5:53 PM Jeetendra Kumar Jaiswal
> > <je...@impetus.co.in.invalid> wrote:
> >
> > > Hi Dev Team,
> > >
> > > Can someone please let me know how to convert spark data frame to Arrow
> > > format. I am coding in Java.
> > >
> > > Java documentation of Arrow just has function API information. It is
> > > little hard to develop without proper documentation.
> > >
> > > Is there a way to directly convert spark dataframe to Arrow format
> > > dataframes.
> > >
> > > Thanks,
> > > Jeetendra
> > >
> > > ________________________________
> > >
> > >
> > >
> > >
> > >
> > >
> > > NOTE: This message may contain information that is confidential,
> > > proprietary, privileged or otherwise protected by law. The message is
> > > intended solely for the named addressee. If received in error, please
> > > destroy and notify the sender. Any use of this email is prohibited when
> > > received in error. Impetus does not represent, warrant and/or guarantee,
> > > that the integrity of this communication has been maintained nor that the
> > > communication is free of errors, virus, interception or interference.
> > >
> >

Re: Java - Spark dataframe to Arrow format

Posted by GaoXiang Wang <wg...@gmail.com>.

Hi Jeetendra and Liya,

I am actually having a similar use case. We have some data stored as *parquet
format in HDFS* and would like to make use of Apache Arrow to improve
compute performance if possible. Right now, I didn't see there is a direct
way to do in Java with Spark.

I have search the Spark documentation, it looks like python support is
added after 2.3.0 (
https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html),
any plan from Apache Arrow team to provide *Spark integration for Java*?

Thank you very much.


*Best Regards,WANG GAOXIANG*
* (Eric) *
National University of Singapore Graduate ::
API Craft Singapore Co-organiser ::
Singapore Python User Group Co-organiser
*+6597685360 (P) :: wgx731@gmail.com <wg...@gmail.com> (E) ::
**https://medium.com/@wgx731
<ht...@wgx731> **(W)*


On Thu, Dec 5, 2019 at 6:58 PM Fan Liya <li...@gmail.com> wrote:

> Hi Jeetendra,
>
> I am not sure if I understand your question correctly.
>
> Arrow is an in-memory columnar data format, and Spark has its own in-memory
> data format for DataFrame, which is invisible to end users.
> So the Spark user has no control over the underlying in-memory layout.
>
> If you really want to convert a DataFrame into Arrow format, maybe you can
> save the results of a Spark job to some external store (e.g. in ORC
> format), and then load it back to memory in Arrow format (if this is what
> you want).
>
> Best,
> Liya Fan
>
> On Thu, Dec 5, 2019 at 5:53 PM Jeetendra Kumar Jaiswal
> <je...@impetus.co.in.invalid> wrote:
>
> > Hi Dev Team,
> >
> > Can someone please let me know how to convert spark data frame to Arrow
> > format. I am coding in Java.
> >
> > Java documentation of Arrow just has function API information. It is
> > little hard to develop without proper documentation.
> >
> > Is there a way to directly convert spark dataframe to Arrow format
> > dataframes.
> >
> > Thanks,
> > Jeetendra
> >
> > ________________________________
> >
> >
> >
> >
> >
> >
> > NOTE: This message may contain information that is confidential,
> > proprietary, privileged or otherwise protected by law. The message is
> > intended solely for the named addressee. If received in error, please
> > destroy and notify the sender. Any use of this email is prohibited when
> > received in error. Impetus does not represent, warrant and/or guarantee,
> > that the integrity of this communication has been maintained nor that the
> > communication is free of errors, virus, interception or interference.
> >
>

Re: Java - Spark dataframe to Arrow format

Posted by Fan Liya <li...@gmail.com>.

Hi Jeetendra,

I am not sure if I understand your question correctly.

Arrow is an in-memory columnar data format, and Spark has its own in-memory
data format for DataFrame, which is invisible to end users.
So the Spark user has no control over the underlying in-memory layout.

If you really want to convert a DataFrame into Arrow format, maybe you can
save the results of a Spark job to some external store (e.g. in ORC
format), and then load it back to memory in Arrow format (if this is what
you want).

Best,
Liya Fan

On Thu, Dec 5, 2019 at 5:53 PM Jeetendra Kumar Jaiswal
<je...@impetus.co.in.invalid> wrote:

> Hi Dev Team,
>
> Can someone please let me know how to convert spark data frame to Arrow
> format. I am coding in Java.
>
> Java documentation of Arrow just has function API information. It is
> little hard to develop without proper documentation.
>
> Is there a way to directly convert spark dataframe to Arrow format
> dataframes.
>
> Thanks,
> Jeetendra
>
> ________________________________
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>