You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Jörg Anders <jr...@ymail.com.INVALID> on 2017/08/07 13:38:33 UTC

why line by line

Hi all!
I have a general question concerning PARQUET. 
PARQUET is a columnar store. But the typical Apache PARQUET Writer/Reader loops use a row by row strategy:
Iterator<Valuet> itr = theValues.iterator();
while (itr.hasNext()) {
            writer.write(groupFromValue(itr.next()));
}
writer.close();
Assume I had the columns at hand. This procedure requires to convert them into rows. Is there a way to write columns directly?If not: Could please anybody explain the contradiction between the columnar nature of PARQUET and a the row by rowbased read/write stratagy. 

Is it for technical reasons, perhapsbecause of some requirements of  the record shredding and assembly algorithm?
An URL would suffice.
Thank you in advance
Joerg
 

Re: why line by line

Posted by Wes McKinney <we...@gmail.com>.
hi Joerg,

It sounds like you are referring to the record-based writer API that's
found in parquet-mr, which was originally designed for use in Hadoop
MapReduce (if I understand correctly).

There is no requirement to write Parquet files in this fashion. The
Parquet C++ writer and reader API
(https://github.com/apache/parquet-cpp) is vectorized / column based.
Some systems (like Spark, Dremio, Drill, I believe) have vectorized
Java implementations.

There is interest in creating an Arrow-based columnar reader and
writer API for Java within parquet-mr; that would be a promising
approach.

- Wes

On Mon, Aug 7, 2017 at 9:38 AM, Jörg Anders
<jr...@ymail.com.invalid> wrote:
> Hi all!
> I have a general question concerning PARQUET.
> PARQUET is a columnar store. But the typical Apache PARQUET Writer/Reader loops use a row by row strategy:
> Iterator<Valuet> itr = theValues.iterator();
> while (itr.hasNext()) {
>             writer.write(groupFromValue(itr.next()));
> }
> writer.close();
> Assume I had the columns at hand. This procedure requires to convert them into rows. Is there a way to write columns directly?If not: Could please anybody explain the contradiction between the columnar nature of PARQUET and a the row by rowbased read/write stratagy.
>
> Is it for technical reasons, perhapsbecause of some requirements of  the record shredding and assembly algorithm?
> An URL would suffice.
> Thank you in advance
> Joerg
>