You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Jacques Nadeau <ja...@apache.org> on 2013/12/02 04:31:12 UTC

parquet performance...

Hey all,

I've been working on trying to improve Parquet read performance
(specifically translation performance).  With my latest changes, I get
about 300 mb/s per core for the transformation process.  It looks like this
is dominated by excess copies, allocation and GC (since Parquet currently
requires use of byte arrays).  My updates are in my local branch here:
https://github.com/jacques-n/incubator-drill/tree/parquet-updates and are
built on top of Jason's work on adding column selection to the
ParquetRecordReader.

In general, I think that two next big wins that we can look for are:
- Switching to a parquet format where we can bulk copy variable length
binary fields rather than value by value (believe this is in the parquet
2.0 spec).
- Updating the parquet readers to work with buffers to avoid excess copies
and pulling data constantly on to and then off heap.  (The object churn is
just too high since we're generating multiple objects for every page.)

The updates I made don't have a huge impact on performance.  The biggest
was removing the logging message associated with each page read :P).
 However, the one big difference with my changes is substantially lower
memory usage.

I need to spend some time breaking down performance by field type to see
where we're paying we're having the slowdown.  I'm hypothesizing above that
varlen is probably the dominant cost but will spend more time analyzing.

Jacques