You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Priyanka Gomatam <Pr...@microsoft.com.INVALID> on 2019/04/22 15:22:19 UTC

Is there a way to read a Parquet File as ColumnarBatch?

Hi,
I am new to Spark and have been playing around with the Parquet reader code. I have two questions:

  1.  I saw the code that starts at DataSourceScanExec class, and moves on to the ParquetFileFormat class and does a VectorizedParquetRecordReader. I tried doing a spark.read.parquet(...) and debugged through the code, but for some reason it never hit the breakpoints I placed in these classes. Perhaps I am doing something wrong, but is there a certain versioning for parquet readers that I am missing out on? How do I make the code take the DataSourceScanExec -> ... -> ParquetReader ... -> VectorizedParqeutRecordRead ... route?
  2.  If I do manage to make it take the above path, I see there is a point at which the data is filled into ColumnarBatch objects, has anyone tried returning all the data as ColumnarBatch? Is there any reading material you can point me to?
Thanks in advance, this will be super helpful for me!

Re: Is there a way to read a Parquet File as ColumnarBatch?

Posted by Jacek Laskowski <ja...@japila.pl>.

Hi Priyanka,

I've been exploring this part of Spark SQL and could help a little bit.

> but for some reason it never hit the breakpoints I placed in these
classes.

Was this for local[*]? I ran
"SPARK_SUBMIT_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005"
./bin/spark-shell" and attached IDEA to debug the code. I used Spark 2.4.1
(with Scala 2.12) and it worked fine for the following queries:

spark.range(5).write.save("hello")
spark.read.parquet("hello").show

> has anyone tried returning all the data as ColumnarBatch? Is there any
reading material you can point me to?

You may find some information in the internals book at
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-ColumnarBatch.html
It's
WIP. Let me know what part to explore in more detail. I'd do this
momentarily (as I'm exploring parquet data source in more detail as we
speak).

Pozdrawiam,
Jacek Laskowski
----
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Mon, Apr 22, 2019 at 10:29 AM Priyanka Gomatam
<Pr...@microsoft.com.invalid> wrote:

> Hi,
>
> I am new to Spark and have been playing around with the Parquet reader
> code. I have two questions:
>
>    1. I saw the code that starts at DataSourceScanExec class, and moves
>    on to the ParquetFileFormat class and does a VectorizedParquetRecordReader.
>    I tried doing a spark.read.parquet(…) and debugged through the code, but
>    for some reason it never hit the breakpoints I placed in these classes.
>    Perhaps I am doing something wrong, but is there a certain versioning for
>    parquet readers that I am missing out on? How do I make the code take the
>    DataSourceScanExec -> … -> ParquetReader … -> VectorizedParqeutRecordRead …
>    route?
>    2. If I do manage to make it take the above path, I see there is a
>    point at which the data is filled into ColumnarBatch objects, has anyone
>    tried returning all the data as ColumnarBatch? Is there any reading
>    material you can point me to?
>
> Thanks in advance, this will be super helpful for me!
>