You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Tomer Solomon <to...@gmail.com> on 2019/02/27 14:32:02 UTC

Parquet-mr - ParquetFileReader IO and memory foot-print

Hi everybody,

I'm trying to understand the IO mechanism and memory foot-print of the
parquet-mr library.
In particular, I wish to understand what happens when the ParquetFileReader
reads the next row-group. For simplicity I'm interested to understand first
the case where no filtering is required, and we want need to read all
records in the file and print them out.

Does the ParquetFileReader load to its internal memory each time the entire
row-group in advance? Can it be configured to read the file lazily and
fined grained: At each step read only the current page for each column,
instead of reading in advance all pages in the column-chunks in the
row-group? I mean, read the first page of each column, process it and
produce the records inside it, and only then read the second one etc.

As I understand, the NextFilteredRowGroup method first figures out all
metadata, and create an array of ConsecutivePartList for all the chunks we
are about to read. After that, it calls readAll for each consecutiveChunk.
In the case I'm reading all columns in the Parquet file, this
ConsecutivePartList would  contain all pages in all columns in the row
group, right?

inside the readAll method, ByteBuffers are allocated, and we call a
readFully on them. Now, from what I understand, parquet-mr uses the
HeapByteBuffer and DirectByteBuffer as its ByteBuffer. In particular,
neither of them support lazy evaluation. So when you read data into them,
it actually reads the data right away.

So, Is it possible to configure the ParquetFileReader to read pages in the
row-group lazily, and at each step read only the relevant pages for each
column?

Reagrds,
Tomer Solomon

Re: Parquet-mr - ParquetFileReader IO and memory foot-print

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

I don’t agree that the issue here is performance. We really haven’t tried
anything other than loading all of the columns at once.

One thing that we have been meaning to do for a long time is to implement
the PageStore interface with an alternative IO manager. Parquet
materialization should request pages and rely on the page store or IO
manager to manage lazily or eagerly fetching the data.

We have a few ideas for strategies in an IO manager:

   - Several small columns and one large column: eagerly load the small
   columns and stream through the large column
   - Few large columns: open multiple streams and read pages in parallel
   - Page-level filtering: push record filters to the IO manager so it can
   avoid reading pages that will not be requested

rb

On Mon, Mar 4, 2019 at 12:27 AM Gabor Szadovszky <ga...@apache.org> wrote:

> Hi Tomer,
>
> parquet-mr does not support lazy reading currently. The reason is
> performance.
> The pages for one column are written one after another (aka column chunks)
> and then similarly the other pages for the other columns. It means if you
> would like to keep only one page per column in the memory it would require
> so many seeks in the file to position the reading to next particular page.
> It is much faster to read the consecutive parts in one read so you will
> have much less IO.
>
> Meanwhile, I understand it requires much more memory than for the lazy
> reading you've suggested. It might be a good improvement for parquet-mr to
> have a switchable lazy reading and also would be interesting to have some
> benchmarks comparing them.
>
> Regards,
> Gabor
>
> On Fri, Mar 1, 2019 at 8:11 PM Tomer Solomon <to...@gmail.com>
> wrote:
>
> > Hi everybody,
> >
> > I'm trying to understand the IO mechanism and memory foot-print of the
> > parquet-mr library.
> > In particular, I wish to understand what happens when the
> ParquetFileReader
> > reads the next row-group. For simplicity I'm interested to understand
> first
> > the case where no filtering is required, and we want need to read all
> > records in the file and print them out.
> >
> > Does the ParquetFileReader load to its internal memory each time the
> entire
> > row-group in advance? Can it be configured to read the file lazily and
> > fined grained: At each step read only the current page for each column,
> > instead of reading in advance all pages in the column-chunks in the
> > row-group? I mean, read the first page of each column, process it and
> > produce the records inside it, and only then read the second one etc.
> >
> > As I understand, the NextFilteredRowGroup method first figures out all
> > metadata, and create an array of ConsecutivePartList for all the chunks
> we
> > are about to read. After that, it calls readAll for each
> consecutiveChunk.
> > In the case I'm reading all columns in the Parquet file, this
> > ConsecutivePartList would  contain all pages in all columns in the row
> > group, right?
> >
> > inside the readAll method, ByteBuffers are allocated, and we call a
> > readFully on them. Now, from what I understand, parquet-mr uses the
> > HeapByteBuffer and DirectByteBuffer as its ByteBuffer. In particular,
> > neither of them support lazy evaluation. So when you read data into them,
> > it actually reads the data right away.
> >
> > So, Is it possible to configure the ParquetFileReader to read pages in
> the
> > row-group lazily, and at each step read only the relevant pages for each
> > column?
> >
> > Reagrds,
> > Tomer Solomon
> >
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Parquet-mr - ParquetFileReader IO and memory foot-print

Posted by Gabor Szadovszky <ga...@apache.org>.

Hi Tomer,

parquet-mr does not support lazy reading currently. The reason is
performance.
The pages for one column are written one after another (aka column chunks)
and then similarly the other pages for the other columns. It means if you
would like to keep only one page per column in the memory it would require
so many seeks in the file to position the reading to next particular page.
It is much faster to read the consecutive parts in one read so you will
have much less IO.

Meanwhile, I understand it requires much more memory than for the lazy
reading you've suggested. It might be a good improvement for parquet-mr to
have a switchable lazy reading and also would be interesting to have some
benchmarks comparing them.

Regards,
Gabor

On Fri, Mar 1, 2019 at 8:11 PM Tomer Solomon <to...@gmail.com>
wrote:

> Hi everybody,
>
> I'm trying to understand the IO mechanism and memory foot-print of the
> parquet-mr library.
> In particular, I wish to understand what happens when the ParquetFileReader
> reads the next row-group. For simplicity I'm interested to understand first
> the case where no filtering is required, and we want need to read all
> records in the file and print them out.
>
> Does the ParquetFileReader load to its internal memory each time the entire
> row-group in advance? Can it be configured to read the file lazily and
> fined grained: At each step read only the current page for each column,
> instead of reading in advance all pages in the column-chunks in the
> row-group? I mean, read the first page of each column, process it and
> produce the records inside it, and only then read the second one etc.
>
> As I understand, the NextFilteredRowGroup method first figures out all
> metadata, and create an array of ConsecutivePartList for all the chunks we
> are about to read. After that, it calls readAll for each consecutiveChunk.
> In the case I'm reading all columns in the Parquet file, this
> ConsecutivePartList would  contain all pages in all columns in the row
> group, right?
>
> inside the readAll method, ByteBuffers are allocated, and we call a
> readFully on them. Now, from what I understand, parquet-mr uses the
> HeapByteBuffer and DirectByteBuffer as its ByteBuffer. In particular,
> neither of them support lazy evaluation. So when you read data into them,
> it actually reads the data right away.
>
> So, Is it possible to configure the ParquetFileReader to read pages in the
> row-group lazily, and at each step read only the relevant pages for each
> column?
>
> Reagrds,
> Tomer Solomon
>