You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Adam Hooper <ad...@adamhooper.com> on 2021/07/20 18:00:31 UTC

C++ parquet::TypedColumnReader::ReadBatchSpaced() replacement?

Hi list,

Updating some code to Arrow 4.0, I noticed
https://issues.apache.org/jira/browse/PARQUET-1899 deprecated
parquet::TypedColumnReader<T>::ReadBatchSpaced().

I use this function in a parquet-to-csv converter. It reads batches of
1,000 values at a time, allowing nulls. ReadBatchSpaced() in a loop is
faster than reading an entire record batch. It's also more RAM-friendly (so
the program costs only a few megabytes, regardless of Parquet file
size). I've spawned hundreds of concurrent parquet-to-csv processes,
streaming to slow clients via Python+ASGI, with response times in the
milliseconds. I commented my findings:
https://github.com/CJWorkbench/parquet-to-arrow/blob/70253c7fdf0fc778e51f50b992c98b16e8864723/src/parquet-to-text-stream.cc#L73

As I understand it, the function is deprecated because it has bugs
concerning nested values. These bugs didn't affect me because I don't use
nested values.

Does the C++ parquet reader support reading a batch of values and their
validity bitmap?

Enjoy life,
Adam

-- 
Adam Hooper
+1-514-882-9694
http://adamhooper.com

Re: C++ parquet::TypedColumnReader::ReadBatchSpaced() replacement?

Posted by Micah Kornfield <em...@gmail.com>.
If dictionary encoded data is specifically a concern, we've added new
experimental APIs that should be in the next release that allows for
retrieving dictionary data as indexes + dictionaries
(ReadBatchWithDictionary) instead of denormalizing them as ReadBatch does.

-Micah

On Wed, Jul 21, 2021 at 8:02 AM Adam Hooper <ad...@adamhooper.com> wrote:

> Hi Micah,
>
> Thank you for this wonderful description. You've solved my problem exactly.
>
> Responses inline:
>
> > "ReadBatchSpaced() in a loop isfaster than reading an entire record
>> > batch."
>>
>> Could you elaborate on this?  What code path were you using for reading
>> record batches that was slower?
>
>
> I'll elaborate based on my (iffy) memory:
>
> The slow path, as I recall, is converting from dictionary-encoded string
> to string. This decoding is fast in batch, slow otherwise.
>
> With Arrow 0.15/0.16, in prototype phase, I converted Parquet to Arrow
> column chunks before I even began streaming. (I cast DictionaryArray to
> StringArray in this step.) Speed was decent, RAM usage wasn't.
>
> When I upgraded to Arrow 1.0, I tried *not* casting DictionaryArray to
> StringArray. RAM usage improved; but testing with a dictionary-heavy file,
> I saw a 5x slowdown.
>
> Then I discovered ReadBatchSpaced(). I love it (and ReadBatch()) because
> it skips Arrow entirely. In my benchmarks, batch-reading just 30 values at
> a time made my whole program 2x faster than the Arrow 0.16 version, on a
> typical 70MB Parquet file. I could trade RAM vs speed by increasing batch
> size; speed was optimal at size 1,000.
>
> Today I don't have time to benchmark any more approaches -- or even
> benchmark that the sentences I wrote above are 100% correct.
>
> Did you try adjusting the batch size with
>> ArrowReaderProperties [1] to be ~1000 rows also (by default it is 64 K so
>> I
>> would imagine a higher memory overhead).  There could also be some other
>> places where memory efficiency could be improved.
>>
>
> I didn't test this. I'm not keen to benchmark Parquet => Arrow => CSV
> because I'm already directly converting Parquet => CSV. I imagine there's
> no win for me to find here.
>
> There are several potential options for the CSV use-case:
>> 1.  The stream-reader API (
>>
>> https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/stream_reader.h
>> )
>>
>
> This looks like a beautiful API. I won't try it because I expect
> dictionary decoding to be slow.
>
>
>> 2.  Using ReadBatch.  The logic of determining nulls for non-nested data
>> is
>> trivial.  You simply need to compare definition levels returned to the max
>> definition level (
>>
>> https://github.com/apache/arrow/blob/d0de88d8384c7593fac1b1e82b276d4a0d364767/cpp/src/parquet/schema.h#L368
>> ).
>> Any definition level less than the max indicates a null.  This also has
>> the
>> nice side effect of requiring less memory for when data is null.
>>
>
> This is perfect for me. Thank you -- I will use this approach.
>
>
>> 3.  Using a record batch reader (
>>
>> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.h#L179
>> )
>> and the Arrow to CSV writer  (
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/writer.h).
>> The CSV writer code doesn't support all types yet, they require having a
>> cast to string kernel available.   If extreme memory efficiency is your
>> aim, this is probably not the best option.  Speed wise it is probably
>> going
>> to be pretty competitive and will likely see the most improvements for
>> "free" in the long run.
>
>
> Ooh, lovely. Yes, I imagine this can be fastest; but it's not ideal for
> streaming because it's high-RAM and high time-to-first-byte.
>
> Thank you again for your advice. You've been more than helpful.
>
> Enjoy life,
> Adam
>
> --
> Adam Hooper
> +1-514-882-9694
> http://adamhooper.com
>

Re: C++ parquet::TypedColumnReader::ReadBatchSpaced() replacement?

Posted by Adam Hooper <ad...@adamhooper.com>.
Hi Micah,

Thank you for this wonderful description. You've solved my problem exactly.

Responses inline:

> "ReadBatchSpaced() in a loop isfaster than reading an entire record
> > batch."
>
> Could you elaborate on this?  What code path were you using for reading
> record batches that was slower?


I'll elaborate based on my (iffy) memory:

The slow path, as I recall, is converting from dictionary-encoded string to
string. This decoding is fast in batch, slow otherwise.

With Arrow 0.15/0.16, in prototype phase, I converted Parquet to Arrow
column chunks before I even began streaming. (I cast DictionaryArray to
StringArray in this step.) Speed was decent, RAM usage wasn't.

When I upgraded to Arrow 1.0, I tried *not* casting DictionaryArray to
StringArray. RAM usage improved; but testing with a dictionary-heavy file,
I saw a 5x slowdown.

Then I discovered ReadBatchSpaced(). I love it (and ReadBatch()) because it
skips Arrow entirely. In my benchmarks, batch-reading just 30 values at a
time made my whole program 2x faster than the Arrow 0.16 version, on a
typical 70MB Parquet file. I could trade RAM vs speed by increasing batch
size; speed was optimal at size 1,000.

Today I don't have time to benchmark any more approaches -- or even
benchmark that the sentences I wrote above are 100% correct.

Did you try adjusting the batch size with
> ArrowReaderProperties [1] to be ~1000 rows also (by default it is 64 K so I
> would imagine a higher memory overhead).  There could also be some other
> places where memory efficiency could be improved.
>

I didn't test this. I'm not keen to benchmark Parquet => Arrow => CSV
because I'm already directly converting Parquet => CSV. I imagine there's
no win for me to find here.

There are several potential options for the CSV use-case:
> 1.  The stream-reader API (
>
> https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/stream_reader.h
> )
>

This looks like a beautiful API. I won't try it because I expect dictionary
decoding to be slow.


> 2.  Using ReadBatch.  The logic of determining nulls for non-nested data is
> trivial.  You simply need to compare definition levels returned to the max
> definition level (
>
> https://github.com/apache/arrow/blob/d0de88d8384c7593fac1b1e82b276d4a0d364767/cpp/src/parquet/schema.h#L368
> ).
> Any definition level less than the max indicates a null.  This also has the
> nice side effect of requiring less memory for when data is null.
>

This is perfect for me. Thank you -- I will use this approach.


> 3.  Using a record batch reader (
>
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.h#L179
> )
> and the Arrow to CSV writer  (
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/writer.h).
> The CSV writer code doesn't support all types yet, they require having a
> cast to string kernel available.   If extreme memory efficiency is your
> aim, this is probably not the best option.  Speed wise it is probably going
> to be pretty competitive and will likely see the most improvements for
> "free" in the long run.


Ooh, lovely. Yes, I imagine this can be fastest; but it's not ideal for
streaming because it's high-RAM and high time-to-first-byte.

Thank you again for your advice. You've been more than helpful.

Enjoy life,
Adam

-- 
Adam Hooper
+1-514-882-9694
http://adamhooper.com

Re: C++ parquet::TypedColumnReader::ReadBatchSpaced() replacement?

Posted by Micah Kornfield <em...@gmail.com>.
Hi Adam,

> "ReadBatchSpaced() in a loop isfaster than reading an entire record
> batch."


Could you elaborate on this?  What code path were you using for reading
record batches that was slower?  Did you try adjusting the batch size with
ArrowReaderProperties [1] to be ~1000 rows also (by default it is 64 K so I
would imagine a higher memory overhead).  There could also be some other
places where memory efficiency could be improved.

As I understand it, the function is deprecated because it has bugs
> concerning nested values. These bugs didn't affect me because I don't use
> nested values.


This is correct.  Even if they don't affect you I think having this API is
dangerous to keep around if it is not maintained and has potential bugs.

Does the C++ parquet reader support reading a batch of values and their
> validity bitmap?


No, but see below for using ReadBatch, reconstructing the null bitmap is
trivial for non-nested data (and probably isn't even necessary if you read
back the definition levels).


There are several potential options for the CSV use-case:
1.  The stream-reader API (
https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/stream_reader.h
)

2.  Using ReadBatch.  The logic of determining nulls for non-nested data is
trivial.  You simply need to compare definition levels returned to the max
definition level (
https://github.com/apache/arrow/blob/d0de88d8384c7593fac1b1e82b276d4a0d364767/cpp/src/parquet/schema.h#L368).
Any definition level less than the max indicates a null.  This also has the
nice side effect of requiring less memory for when data is null.

3.  Using a record batch reader (
https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.h#L179)
and the Arrow to CSV writer  (
https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/writer.h).
The CSV writer code doesn't support all types yet, they require having a
cast to string kernel available.   If extreme memory efficiency is your
aim, this is probably not the best option.  Speed wise it is probably going
to be pretty competitive and will likely see the most improvements for
"free" in the long run.

Thanks,
Micah

[1]
https://github.com/apache/arrow/blob/master/cpp/src/parquet/properties.h#L571


On Tue, Jul 20, 2021 at 11:07 AM Adam Hooper <ad...@adamhooper.com> wrote:

> Hi list,
>
> Updating some code to Arrow 4.0, I noticed
> https://issues.apache.org/jira/browse/PARQUET-1899 deprecated
> parquet::TypedColumnReader<T>::ReadBatchSpaced().
>
> I use this function in a parquet-to-csv converter. It reads batches of
> 1,000 values at a time, allowing nulls. ReadBatchSpaced() in a loop is
> faster than reading an entire record batch. It's also more RAM-friendly (so
> the program costs only a few megabytes, regardless of Parquet file
> size). I've spawned hundreds of concurrent parquet-to-csv processes,
> streaming to slow clients via Python+ASGI, with response times in the
> milliseconds. I commented my findings:
>
> https://github.com/CJWorkbench/parquet-to-arrow/blob/70253c7fdf0fc778e51f50b992c98b16e8864723/src/parquet-to-text-stream.cc#L73
>
> As I understand it, the function is deprecated because it has bugs
> concerning nested values. These bugs didn't affect me because I don't use
> nested values.
>
> Does the C++ parquet reader support reading a batch of values and their
> validity bitmap?
>
> Enjoy life,
> Adam
>
> --
> Adam Hooper
> +1-514-882-9694
> http://adamhooper.com
>