You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Mike Zhang <mi...@gmail.com> on 2022/02/05 05:00:51 UTC

Iceberg reading Parquet files to Arrow format

I am reading the Iceberg code regarding the Parquet reading path and see
the Parquet files are red to Arrow format first. I wonder how much
performance gain we could have by doing that. Let’s take the example of the
Spark application with Iceberg. If the Parquet file is red directly to
Spark RDD records, shouldn’t it be faster than Parquet->Arrow->Spark
Record? Since Iceberg is converting to Arrow first today, there must be
some benefits of that. So I feel I miss something. Can somebody help to
explain?

Re: Iceberg reading Parquet files to Arrow format

Posted by ru...@gmail.com.

I’m not sure what your are saying, for our implementation vectorization is the arrow format. That’s how we pass batches to spark in vectorization mode. They cannot be separated in the iceberg code although I guess you could implement another columnar in memory format extending Spark columnar batch.

Sent from my iPhone

> On Feb 5, 2022, at 7:06 PM, Mike Zhang <mi...@gmail.com> wrote:
> 
> 
> Thanks Russel! I wonder if the performance gain is mainly from vectorization instead of using arrow format?  My understanding of the benefits of using Arrow is to avoid serialization/deserialization. I just got a hard time understanding how Iceberg uses Arrow to get the benefit of that. 
> 
>> On Sat, Feb 5, 2022 at 5:39 AM Russell Spitzer <ru...@gmail.com> wrote:
>> One thing to note is we never go to "RDD" records really, since we are always working the DataFrame API. Spark builds RDDs but expects us to deliver data in one of two ways, row-based (internalRows) or columnar (arrowVectors). Columnar reads are generally more efficient and parallelizable, usually when someone is talking vectorizing parquet reads they mean columnar reads.
>> 
>> While this gives us much better performance (see our various perf test modules in the code base if you would like to run yourself) Spark is still a row oriented engine. Spark wants to take advantage of this format which is why it provides the "columnarBatch" interface but still does all codegen and other operations on a per row basis. This means that although we can generally load the data in a much faster way than row based loading, Spark still has to work on the data in a row format most of the time. There are a variety of projects working to fix this as well.
>> 
>>> On Fri, Feb 4, 2022 at 11:01 PM Mike Zhang <mi...@gmail.com> wrote:
>>> I am reading the Iceberg code regarding the Parquet reading path and see the Parquet files are red to Arrow format first. I wonder how much performance gain we could have by doing that. Let’s take the example of the Spark application with Iceberg. If the Parquet file is red directly to Spark RDD records, shouldn’t it be faster than Parquet->Arrow->Spark Record? Since Iceberg is converting to Arrow first today, there must be some benefits of that. So I feel I miss something. Can somebody help to explain?

Re: Iceberg reading Parquet files to Arrow format

Posted by Mike Zhang <mi...@gmail.com>.

Thanks Russel! I wonder if the performance gain is mainly from
vectorization instead of using arrow format?  My understanding of the
benefits of using Arrow is to avoid serialization/deserialization. I just
got a hard time understanding how Iceberg uses Arrow to get the benefit of
that.

On Sat, Feb 5, 2022 at 5:39 AM Russell Spitzer <ru...@gmail.com>
wrote:

> One thing to note is we never go to "RDD" records really, since we are
> always working the DataFrame API. Spark builds RDDs but expects us to
> deliver data in one of two ways, row-based
> <https://github.com/apache/iceberg/blob/a2260faa1d1177342d453c1de91c15ed9592e0e9/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchScan.java#L255-L262> (internalRows)
> or columnar
> <https://github.com/apache/iceberg/blob/a2260faa1d1177342d453c1de91c15ed9592e0e9/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchScan.java#L267> (arrowVectors).
> Columnar reads are generally more efficient and parallelizable, usually
> when someone is talking *vectorizing parquet reads* they mean columnar
> reads.
>
> While this gives us much better performance (see our various perf test
> modules in the code base if you would like to run yourself) Spark is still
> a row oriented engine. Spark wants to take advantage of this format which
> is why it provides the "columnarBatch" interface but still does all codegen
> and other operations on a per row basis. This means that although we can
> generally load the data in a much faster way than row based loading, Spark
> still has to work on the data in a row format most of the time. There are a
> variety of projects working to fix this as well.
>
> On Fri, Feb 4, 2022 at 11:01 PM Mike Zhang <mi...@gmail.com>
> wrote:
>
>> I am reading the Iceberg code regarding the Parquet reading path and see
>> the Parquet files are red to Arrow format first. I wonder how much
>> performance gain we could have by doing that. Let’s take the example of the
>> Spark application with Iceberg. If the Parquet file is red directly to
>> Spark RDD records, shouldn’t it be faster than Parquet->Arrow->Spark
>> Record? Since Iceberg is converting to Arrow first today, there must be
>> some benefits of that. So I feel I miss something. Can somebody help to
>> explain?
>>
>

Re: Iceberg reading Parquet files to Arrow format

Posted by Russell Spitzer <ru...@gmail.com>.

One thing to note is we never go to "RDD" records really, since we are
always working the DataFrame API. Spark builds RDDs but expects us to
deliver data in one of two ways, row-based
<https://github.com/apache/iceberg/blob/a2260faa1d1177342d453c1de91c15ed9592e0e9/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchScan.java#L255-L262>
(internalRows)
or columnar
<https://github.com/apache/iceberg/blob/a2260faa1d1177342d453c1de91c15ed9592e0e9/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchScan.java#L267>
(arrowVectors).
Columnar reads are generally more efficient and parallelizable, usually
when someone is talking *vectorizing parquet reads* they mean columnar
reads.

While this gives us much better performance (see our various perf test
modules in the code base if you would like to run yourself) Spark is still
a row oriented engine. Spark wants to take advantage of this format which
is why it provides the "columnarBatch" interface but still does all codegen
and other operations on a per row basis. This means that although we can
generally load the data in a much faster way than row based loading, Spark
still has to work on the data in a row format most of the time. There are a
variety of projects working to fix this as well.

On Fri, Feb 4, 2022 at 11:01 PM Mike Zhang <mi...@gmail.com>
wrote:

> I am reading the Iceberg code regarding the Parquet reading path and see
> the Parquet files are red to Arrow format first. I wonder how much
> performance gain we could have by doing that. Let’s take the example of the
> Spark application with Iceberg. If the Parquet file is red directly to
> Spark RDD records, shouldn’t it be faster than Parquet->Arrow->Spark
> Record? Since Iceberg is converting to Arrow first today, there must be
> some benefits of that. So I feel I miss something. Can somebody help to
> explain?
>