You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Wes McKinney <we...@gmail.com> on 2019/05/23 13:58:00 UTC

Re: Java/Scala: efficient reading of Parquet into Arrow?

hi Joris,

The Apache Parquet mailing list is

dev@parquet.apache.org

I'm copying the list here

AFAIK parquet-mr doesn't feature vectorized reading (for Arrow or
otherwise). There are some vectorized Java-based readers in the wild:
in Dremio [1] and Apache Spark, at least. I'm interested to see a
reusable library that supports vectorized Arrow reads in Java.

- Wes

[1]: https://github.com/dremio/dremio-oss

On Thu, May 23, 2019 at 8:54 AM Joris Peeters
<jo...@gmail.com> wrote:
>
> Hello,
>
> I'm trying to read a Parquet file from disk into Arrow in memory, in Scala.
> I'm wondering what the most efficient approach is, especially for the
> reading part. I'm aware that Parquet reading is perhaps beyond the scope of
> this mailing list but,
>
> - I believe Arrow and Parquet are closely intertwined these days?
> - I can't find an appropriate Parquet mailing list.
>
> Any pointers would be appreciated!
>
> Below is the code I currently have. My concern is that this alone already
> takes about 2s, whereas "pq.read_pandas(the_file_path).to_pandas()" takes
> ~=100ms in Python. So I suspect I'm not doing this in the most efficient
> way possible ... The Parquet data holds 1570150 rows, with 14 columns of
> various types, and takes 15MB on disk.
>
> import org.apache.hadoop.conf.Configuration
> import org.apache.parquet.column.ColumnDescriptor
> import org.apache.parquet.example.data.simple.convert.GroupRecordConverter
> import org.apache.parquet.format.converter.ParquetMetadataConverter
> import org.apache.parquet.hadoop.{ParquetFileReader}
> import org.apache.parquet.io.ColumnIOFactory
>
> ...
>
> val path: Path = Paths.get("C:\\item.pq")
> val jpath = new org.apache.hadoop.fs.Path(path.toFile.getAbsolutePath)
> val conf = new Configuration()
>
> val readFooter = ParquetFileReader.readFooter(conf, jpath,
> ParquetMetadataConverter.NO_FILTER)
> val schema = readFooter.getFileMetaData.getSchema
> val r = ParquetFileReader.open(conf, jpath)
>
> val pages = r.readNextRowGroup()
> val rows = pages.getRowCount
>
> val columnIO = new ColumnIOFactory().getColumnIO(schema)
> val recordReader = columnIO.getRecordReader(pages, new
> GroupRecordConverter(schema))
>
> // This takes about 2s
> (1 to rows.toInt).map { i =>
>   val group = recordReader.read
>   // Just read first column for now ...
>   val x = group.getLong(0,0)
> }
>
> ...
>
> As this will be in the hot path of my code, I'm quite keen to make it
> as fast as possible. Note that the eventual objective is to build
> Arrow data. I was assuming there would be a way to quickly load the
> columns. I suspect the loop over the rows, building row-based records,
> is causing a lot of overhead, but can't seem to find another way.
>
>
> Thanks,
>
> -J

Re: Java/Scala: efficient reading of Parquet into Arrow?

Posted by Masayuki Takahashi <ma...@gmail.com>.
I also think parquet-mr should support vectorized reading.

Do we have only a way to implement from scratch?
Or do we should take over the Vectorized Reader issue [1] ?

[1]: https://issues.apache.org/jira/browse/PARQUET-131

2019年5月24日(金) 2:44 Ryan Blue <rb...@netflix.com.invalid>:
>
> The Iceberg community is probably going to work on this soon. You can
> follow https://github.com/apache/incubator-iceberg/issues/9 for progress.
>
> On Thu, May 23, 2019 at 7:07 AM Wes McKinney <we...@gmail.com> wrote:
>
> > hi Joris,
> >
> > The Apache Parquet mailing list is
> >
> > dev@parquet.apache.org
> >
> > I'm copying the list here
> >
> > AFAIK parquet-mr doesn't feature vectorized reading (for Arrow or
> > otherwise). There are some vectorized Java-based readers in the wild:
> > in Dremio [1] and Apache Spark, at least. I'm interested to see a
> > reusable library that supports vectorized Arrow reads in Java.
> >
> > - Wes
> >
> > [1]: https://github.com/dremio/dremio-oss
> >
> > On Thu, May 23, 2019 at 8:54 AM Joris Peeters
> > <jo...@gmail.com> wrote:
> > >
> > > Hello,
> > >
> > > I'm trying to read a Parquet file from disk into Arrow in memory, in
> > Scala.
> > > I'm wondering what the most efficient approach is, especially for the
> > > reading part. I'm aware that Parquet reading is perhaps beyond the scope
> > of
> > > this mailing list but,
> > >
> > > - I believe Arrow and Parquet are closely intertwined these days?
> > > - I can't find an appropriate Parquet mailing list.
> > >
> > > Any pointers would be appreciated!
> > >
> > > Below is the code I currently have. My concern is that this alone already
> > > takes about 2s, whereas "pq.read_pandas(the_file_path).to_pandas()" takes
> > > ~=100ms in Python. So I suspect I'm not doing this in the most efficient
> > > way possible ... The Parquet data holds 1570150 rows, with 14 columns of
> > > various types, and takes 15MB on disk.
> > >
> > > import org.apache.hadoop.conf.Configuration
> > > import org.apache.parquet.column.ColumnDescriptor
> > > import
> > org.apache.parquet.example.data.simple.convert.GroupRecordConverter
> > > import org.apache.parquet.format.converter.ParquetMetadataConverter
> > > import org.apache.parquet.hadoop.{ParquetFileReader}
> > > import org.apache.parquet.io.ColumnIOFactory
> > >
> > > ...
> > >
> > > val path: Path = Paths.get("C:\\item.pq")
> > > val jpath = new org.apache.hadoop.fs.Path(path.toFile.getAbsolutePath)
> > > val conf = new Configuration()
> > >
> > > val readFooter = ParquetFileReader.readFooter(conf, jpath,
> > > ParquetMetadataConverter.NO_FILTER)
> > > val schema = readFooter.getFileMetaData.getSchema
> > > val r = ParquetFileReader.open(conf, jpath)
> > >
> > > val pages = r.readNextRowGroup()
> > > val rows = pages.getRowCount
> > >
> > > val columnIO = new ColumnIOFactory().getColumnIO(schema)
> > > val recordReader = columnIO.getRecordReader(pages, new
> > > GroupRecordConverter(schema))
> > >
> > > // This takes about 2s
> > > (1 to rows.toInt).map { i =>
> > >   val group = recordReader.read
> > >   // Just read first column for now ...
> > >   val x = group.getLong(0,0)
> > > }
> > >
> > > ...
> > >
> > > As this will be in the hot path of my code, I'm quite keen to make it
> > > as fast as possible. Note that the eventual objective is to build
> > > Arrow data. I was assuming there would be a way to quickly load the
> > > columns. I suspect the loop over the rows, building row-based records,
> > > is causing a lot of overhead, but can't seem to find another way.
> > >
> > >
> > > Thanks,
> > >
> > > -J
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix



-- 
Masayuki Takahashi

Re: Java/Scala: efficient reading of Parquet into Arrow?

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
The Iceberg community is probably going to work on this soon. You can
follow https://github.com/apache/incubator-iceberg/issues/9 for progress.

On Thu, May 23, 2019 at 7:07 AM Wes McKinney <we...@gmail.com> wrote:

> hi Joris,
>
> The Apache Parquet mailing list is
>
> dev@parquet.apache.org
>
> I'm copying the list here
>
> AFAIK parquet-mr doesn't feature vectorized reading (for Arrow or
> otherwise). There are some vectorized Java-based readers in the wild:
> in Dremio [1] and Apache Spark, at least. I'm interested to see a
> reusable library that supports vectorized Arrow reads in Java.
>
> - Wes
>
> [1]: https://github.com/dremio/dremio-oss
>
> On Thu, May 23, 2019 at 8:54 AM Joris Peeters
> <jo...@gmail.com> wrote:
> >
> > Hello,
> >
> > I'm trying to read a Parquet file from disk into Arrow in memory, in
> Scala.
> > I'm wondering what the most efficient approach is, especially for the
> > reading part. I'm aware that Parquet reading is perhaps beyond the scope
> of
> > this mailing list but,
> >
> > - I believe Arrow and Parquet are closely intertwined these days?
> > - I can't find an appropriate Parquet mailing list.
> >
> > Any pointers would be appreciated!
> >
> > Below is the code I currently have. My concern is that this alone already
> > takes about 2s, whereas "pq.read_pandas(the_file_path).to_pandas()" takes
> > ~=100ms in Python. So I suspect I'm not doing this in the most efficient
> > way possible ... The Parquet data holds 1570150 rows, with 14 columns of
> > various types, and takes 15MB on disk.
> >
> > import org.apache.hadoop.conf.Configuration
> > import org.apache.parquet.column.ColumnDescriptor
> > import
> org.apache.parquet.example.data.simple.convert.GroupRecordConverter
> > import org.apache.parquet.format.converter.ParquetMetadataConverter
> > import org.apache.parquet.hadoop.{ParquetFileReader}
> > import org.apache.parquet.io.ColumnIOFactory
> >
> > ...
> >
> > val path: Path = Paths.get("C:\\item.pq")
> > val jpath = new org.apache.hadoop.fs.Path(path.toFile.getAbsolutePath)
> > val conf = new Configuration()
> >
> > val readFooter = ParquetFileReader.readFooter(conf, jpath,
> > ParquetMetadataConverter.NO_FILTER)
> > val schema = readFooter.getFileMetaData.getSchema
> > val r = ParquetFileReader.open(conf, jpath)
> >
> > val pages = r.readNextRowGroup()
> > val rows = pages.getRowCount
> >
> > val columnIO = new ColumnIOFactory().getColumnIO(schema)
> > val recordReader = columnIO.getRecordReader(pages, new
> > GroupRecordConverter(schema))
> >
> > // This takes about 2s
> > (1 to rows.toInt).map { i =>
> >   val group = recordReader.read
> >   // Just read first column for now ...
> >   val x = group.getLong(0,0)
> > }
> >
> > ...
> >
> > As this will be in the hot path of my code, I'm quite keen to make it
> > as fast as possible. Note that the eventual objective is to build
> > Arrow data. I was assuming there would be a way to quickly load the
> > columns. I suspect the loop over the rows, building row-based records,
> > is causing a lot of overhead, but can't seem to find another way.
> >
> >
> > Thanks,
> >
> > -J
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Java/Scala: efficient reading of Parquet into Arrow?

Posted by Arvin <ar...@gmail.com>.
In my opinion, parquet support only read part of the columns, I use alike
code below. I think parquet reader only read the corresponding page of
specific column of the data block according to meta. If parquet file reader
read all the block data, then the meta info is unnecessary and it must
filter on the parquet reader client, that's wasted. Are you mean parquet
reader read all data from block when you say  "parquet-mr doesn't feature
vectorized reading"? Thanks a lot.

Arvin

``` partical code

Schema projectedSchema = Schema.createRecord(writeSchema.getName(),
writeSchema.getDoc(), writeSchema.getNamespace(),
writeSchema.isError());
projectedSchema.setFields(readFields);
log.info("read fields are {}", readFields);
configuration.set(AvroReadSupport.AVRO_REQUESTED_PROJECTION,
projectedSchema.toString());
InputFile inputFile = HadoopInputFile.fromPath(path, configuration);
ParquetReader<GenericRecord> reader =
AvroParquetReader.<GenericRecord>builder(inputFile)
    .build();

```

Wes McKinney <we...@gmail.com> 于2019年5月23日周四 下午11:49写道:

> Cool. At some point we are interested in having simple compressed
> (e.g. with LZ4 or ZSTD) record batches natively in the Arrow protocol,
> see
>
> https://issues.apache.org/jira/browse/ARROW-300
>
> On Thu, May 23, 2019 at 10:21 AM Joris Peeters
> <jo...@gmail.com> wrote:
> >
> > Cool, thanks. I think we'll just go with reading LZ4 compressed Arrow
> > directly from disk then, and by-pass Parquet altogether.
> > The compressed Arrow files are about 20% larger than the PQ files, but
> > getting it into some useful form in memory is almost on par with pandas.
> >
> > At the moment, I don't need the additional benefits parquet would give
> > (various forms of filtering), so this is perfectly OK. And now I don't
> have
> > half of Hadoop in my dependencies. \o/
> > But yes, definitely +1 on a self-contained vectorised Parquet reader.
> >
> > -J
> >
> >
> > On Thu, May 23, 2019 at 2:58 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > hi Joris,
> > >
> > > The Apache Parquet mailing list is
> > >
> > > dev@parquet.apache.org
> > >
> > > I'm copying the list here
> > >
> > > AFAIK parquet-mr doesn't feature vectorized reading (for Arrow or
> > > otherwise). There are some vectorized Java-based readers in the wild:
> > > in Dremio [1] and Apache Spark, at least. I'm interested to see a
> > > reusable library that supports vectorized Arrow reads in Java.
> > >
> > > - Wes
> > >
> > > [1]: https://github.com/dremio/dremio-oss
> > >
> > > On Thu, May 23, 2019 at 8:54 AM Joris Peeters
> > > <jo...@gmail.com> wrote:
> > > >
> > > > Hello,
> > > >
> > > > I'm trying to read a Parquet file from disk into Arrow in memory, in
> > > Scala.
> > > > I'm wondering what the most efficient approach is, especially for the
> > > > reading part. I'm aware that Parquet reading is perhaps beyond the
> scope
> > > of
> > > > this mailing list but,
> > > >
> > > > - I believe Arrow and Parquet are closely intertwined these days?
> > > > - I can't find an appropriate Parquet mailing list.
> > > >
> > > > Any pointers would be appreciated!
> > > >
> > > > Below is the code I currently have. My concern is that this alone
> already
> > > > takes about 2s, whereas "pq.read_pandas(the_file_path).to_pandas()"
> takes
> > > > ~=100ms in Python. So I suspect I'm not doing this in the most
> efficient
> > > > way possible ... The Parquet data holds 1570150 rows, with 14
> columns of
> > > > various types, and takes 15MB on disk.
> > > >
> > > > import org.apache.hadoop.conf.Configuration
> > > > import org.apache.parquet.column.ColumnDescriptor
> > > > import
> > > org.apache.parquet.example.data.simple.convert.GroupRecordConverter
> > > > import org.apache.parquet.format.converter.ParquetMetadataConverter
> > > > import org.apache.parquet.hadoop.{ParquetFileReader}
> > > > import org.apache.parquet.io.ColumnIOFactory
> > > >
> > > > ...
> > > >
> > > > val path: Path = Paths.get("C:\\item.pq")
> > > > val jpath = new
> org.apache.hadoop.fs.Path(path.toFile.getAbsolutePath)
> > > > val conf = new Configuration()
> > > >
> > > > val readFooter = ParquetFileReader.readFooter(conf, jpath,
> > > > ParquetMetadataConverter.NO_FILTER)
> > > > val schema = readFooter.getFileMetaData.getSchema
> > > > val r = ParquetFileReader.open(conf, jpath)
> > > >
> > > > val pages = r.readNextRowGroup()
> > > > val rows = pages.getRowCount
> > > >
> > > > val columnIO = new ColumnIOFactory().getColumnIO(schema)
> > > > val recordReader = columnIO.getRecordReader(pages, new
> > > > GroupRecordConverter(schema))
> > > >
> > > > // This takes about 2s
> > > > (1 to rows.toInt).map { i =>
> > > >   val group = recordReader.read
> > > >   // Just read first column for now ...
> > > >   val x = group.getLong(0,0)
> > > > }
> > > >
> > > > ...
> > > >
> > > > As this will be in the hot path of my code, I'm quite keen to make it
> > > > as fast as possible. Note that the eventual objective is to build
> > > > Arrow data. I was assuming there would be a way to quickly load the
> > > > columns. I suspect the loop over the rows, building row-based
> records,
> > > > is causing a lot of overhead, but can't seem to find another way.
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > -J
> > >
>

Re: Java/Scala: efficient reading of Parquet into Arrow?

Posted by Wes McKinney <we...@gmail.com>.
Cool. At some point we are interested in having simple compressed
(e.g. with LZ4 or ZSTD) record batches natively in the Arrow protocol,
see

https://issues.apache.org/jira/browse/ARROW-300

On Thu, May 23, 2019 at 10:21 AM Joris Peeters
<jo...@gmail.com> wrote:
>
> Cool, thanks. I think we'll just go with reading LZ4 compressed Arrow
> directly from disk then, and by-pass Parquet altogether.
> The compressed Arrow files are about 20% larger than the PQ files, but
> getting it into some useful form in memory is almost on par with pandas.
>
> At the moment, I don't need the additional benefits parquet would give
> (various forms of filtering), so this is perfectly OK. And now I don't have
> half of Hadoop in my dependencies. \o/
> But yes, definitely +1 on a self-contained vectorised Parquet reader.
>
> -J
>
>
> On Thu, May 23, 2019 at 2:58 PM Wes McKinney <we...@gmail.com> wrote:
>
> > hi Joris,
> >
> > The Apache Parquet mailing list is
> >
> > dev@parquet.apache.org
> >
> > I'm copying the list here
> >
> > AFAIK parquet-mr doesn't feature vectorized reading (for Arrow or
> > otherwise). There are some vectorized Java-based readers in the wild:
> > in Dremio [1] and Apache Spark, at least. I'm interested to see a
> > reusable library that supports vectorized Arrow reads in Java.
> >
> > - Wes
> >
> > [1]: https://github.com/dremio/dremio-oss
> >
> > On Thu, May 23, 2019 at 8:54 AM Joris Peeters
> > <jo...@gmail.com> wrote:
> > >
> > > Hello,
> > >
> > > I'm trying to read a Parquet file from disk into Arrow in memory, in
> > Scala.
> > > I'm wondering what the most efficient approach is, especially for the
> > > reading part. I'm aware that Parquet reading is perhaps beyond the scope
> > of
> > > this mailing list but,
> > >
> > > - I believe Arrow and Parquet are closely intertwined these days?
> > > - I can't find an appropriate Parquet mailing list.
> > >
> > > Any pointers would be appreciated!
> > >
> > > Below is the code I currently have. My concern is that this alone already
> > > takes about 2s, whereas "pq.read_pandas(the_file_path).to_pandas()" takes
> > > ~=100ms in Python. So I suspect I'm not doing this in the most efficient
> > > way possible ... The Parquet data holds 1570150 rows, with 14 columns of
> > > various types, and takes 15MB on disk.
> > >
> > > import org.apache.hadoop.conf.Configuration
> > > import org.apache.parquet.column.ColumnDescriptor
> > > import
> > org.apache.parquet.example.data.simple.convert.GroupRecordConverter
> > > import org.apache.parquet.format.converter.ParquetMetadataConverter
> > > import org.apache.parquet.hadoop.{ParquetFileReader}
> > > import org.apache.parquet.io.ColumnIOFactory
> > >
> > > ...
> > >
> > > val path: Path = Paths.get("C:\\item.pq")
> > > val jpath = new org.apache.hadoop.fs.Path(path.toFile.getAbsolutePath)
> > > val conf = new Configuration()
> > >
> > > val readFooter = ParquetFileReader.readFooter(conf, jpath,
> > > ParquetMetadataConverter.NO_FILTER)
> > > val schema = readFooter.getFileMetaData.getSchema
> > > val r = ParquetFileReader.open(conf, jpath)
> > >
> > > val pages = r.readNextRowGroup()
> > > val rows = pages.getRowCount
> > >
> > > val columnIO = new ColumnIOFactory().getColumnIO(schema)
> > > val recordReader = columnIO.getRecordReader(pages, new
> > > GroupRecordConverter(schema))
> > >
> > > // This takes about 2s
> > > (1 to rows.toInt).map { i =>
> > >   val group = recordReader.read
> > >   // Just read first column for now ...
> > >   val x = group.getLong(0,0)
> > > }
> > >
> > > ...
> > >
> > > As this will be in the hot path of my code, I'm quite keen to make it
> > > as fast as possible. Note that the eventual objective is to build
> > > Arrow data. I was assuming there would be a way to quickly load the
> > > columns. I suspect the loop over the rows, building row-based records,
> > > is causing a lot of overhead, but can't seem to find another way.
> > >
> > >
> > > Thanks,
> > >
> > > -J
> >

Re: Java/Scala: efficient reading of Parquet into Arrow?

Posted by Wes McKinney <we...@gmail.com>.
Cool. At some point we are interested in having simple compressed
(e.g. with LZ4 or ZSTD) record batches natively in the Arrow protocol,
see

https://issues.apache.org/jira/browse/ARROW-300

On Thu, May 23, 2019 at 10:21 AM Joris Peeters
<jo...@gmail.com> wrote:
>
> Cool, thanks. I think we'll just go with reading LZ4 compressed Arrow
> directly from disk then, and by-pass Parquet altogether.
> The compressed Arrow files are about 20% larger than the PQ files, but
> getting it into some useful form in memory is almost on par with pandas.
>
> At the moment, I don't need the additional benefits parquet would give
> (various forms of filtering), so this is perfectly OK. And now I don't have
> half of Hadoop in my dependencies. \o/
> But yes, definitely +1 on a self-contained vectorised Parquet reader.
>
> -J
>
>
> On Thu, May 23, 2019 at 2:58 PM Wes McKinney <we...@gmail.com> wrote:
>
> > hi Joris,
> >
> > The Apache Parquet mailing list is
> >
> > dev@parquet.apache.org
> >
> > I'm copying the list here
> >
> > AFAIK parquet-mr doesn't feature vectorized reading (for Arrow or
> > otherwise). There are some vectorized Java-based readers in the wild:
> > in Dremio [1] and Apache Spark, at least. I'm interested to see a
> > reusable library that supports vectorized Arrow reads in Java.
> >
> > - Wes
> >
> > [1]: https://github.com/dremio/dremio-oss
> >
> > On Thu, May 23, 2019 at 8:54 AM Joris Peeters
> > <jo...@gmail.com> wrote:
> > >
> > > Hello,
> > >
> > > I'm trying to read a Parquet file from disk into Arrow in memory, in
> > Scala.
> > > I'm wondering what the most efficient approach is, especially for the
> > > reading part. I'm aware that Parquet reading is perhaps beyond the scope
> > of
> > > this mailing list but,
> > >
> > > - I believe Arrow and Parquet are closely intertwined these days?
> > > - I can't find an appropriate Parquet mailing list.
> > >
> > > Any pointers would be appreciated!
> > >
> > > Below is the code I currently have. My concern is that this alone already
> > > takes about 2s, whereas "pq.read_pandas(the_file_path).to_pandas()" takes
> > > ~=100ms in Python. So I suspect I'm not doing this in the most efficient
> > > way possible ... The Parquet data holds 1570150 rows, with 14 columns of
> > > various types, and takes 15MB on disk.
> > >
> > > import org.apache.hadoop.conf.Configuration
> > > import org.apache.parquet.column.ColumnDescriptor
> > > import
> > org.apache.parquet.example.data.simple.convert.GroupRecordConverter
> > > import org.apache.parquet.format.converter.ParquetMetadataConverter
> > > import org.apache.parquet.hadoop.{ParquetFileReader}
> > > import org.apache.parquet.io.ColumnIOFactory
> > >
> > > ...
> > >
> > > val path: Path = Paths.get("C:\\item.pq")
> > > val jpath = new org.apache.hadoop.fs.Path(path.toFile.getAbsolutePath)
> > > val conf = new Configuration()
> > >
> > > val readFooter = ParquetFileReader.readFooter(conf, jpath,
> > > ParquetMetadataConverter.NO_FILTER)
> > > val schema = readFooter.getFileMetaData.getSchema
> > > val r = ParquetFileReader.open(conf, jpath)
> > >
> > > val pages = r.readNextRowGroup()
> > > val rows = pages.getRowCount
> > >
> > > val columnIO = new ColumnIOFactory().getColumnIO(schema)
> > > val recordReader = columnIO.getRecordReader(pages, new
> > > GroupRecordConverter(schema))
> > >
> > > // This takes about 2s
> > > (1 to rows.toInt).map { i =>
> > >   val group = recordReader.read
> > >   // Just read first column for now ...
> > >   val x = group.getLong(0,0)
> > > }
> > >
> > > ...
> > >
> > > As this will be in the hot path of my code, I'm quite keen to make it
> > > as fast as possible. Note that the eventual objective is to build
> > > Arrow data. I was assuming there would be a way to quickly load the
> > > columns. I suspect the loop over the rows, building row-based records,
> > > is causing a lot of overhead, but can't seem to find another way.
> > >
> > >
> > > Thanks,
> > >
> > > -J
> >

Re: Java/Scala: efficient reading of Parquet into Arrow?

Posted by Joris Peeters <jo...@gmail.com>.
Cool, thanks. I think we'll just go with reading LZ4 compressed Arrow
directly from disk then, and by-pass Parquet altogether.
The compressed Arrow files are about 20% larger than the PQ files, but
getting it into some useful form in memory is almost on par with pandas.

At the moment, I don't need the additional benefits parquet would give
(various forms of filtering), so this is perfectly OK. And now I don't have
half of Hadoop in my dependencies. \o/
But yes, definitely +1 on a self-contained vectorised Parquet reader.

-J


On Thu, May 23, 2019 at 2:58 PM Wes McKinney <we...@gmail.com> wrote:

> hi Joris,
>
> The Apache Parquet mailing list is
>
> dev@parquet.apache.org
>
> I'm copying the list here
>
> AFAIK parquet-mr doesn't feature vectorized reading (for Arrow or
> otherwise). There are some vectorized Java-based readers in the wild:
> in Dremio [1] and Apache Spark, at least. I'm interested to see a
> reusable library that supports vectorized Arrow reads in Java.
>
> - Wes
>
> [1]: https://github.com/dremio/dremio-oss
>
> On Thu, May 23, 2019 at 8:54 AM Joris Peeters
> <jo...@gmail.com> wrote:
> >
> > Hello,
> >
> > I'm trying to read a Parquet file from disk into Arrow in memory, in
> Scala.
> > I'm wondering what the most efficient approach is, especially for the
> > reading part. I'm aware that Parquet reading is perhaps beyond the scope
> of
> > this mailing list but,
> >
> > - I believe Arrow and Parquet are closely intertwined these days?
> > - I can't find an appropriate Parquet mailing list.
> >
> > Any pointers would be appreciated!
> >
> > Below is the code I currently have. My concern is that this alone already
> > takes about 2s, whereas "pq.read_pandas(the_file_path).to_pandas()" takes
> > ~=100ms in Python. So I suspect I'm not doing this in the most efficient
> > way possible ... The Parquet data holds 1570150 rows, with 14 columns of
> > various types, and takes 15MB on disk.
> >
> > import org.apache.hadoop.conf.Configuration
> > import org.apache.parquet.column.ColumnDescriptor
> > import
> org.apache.parquet.example.data.simple.convert.GroupRecordConverter
> > import org.apache.parquet.format.converter.ParquetMetadataConverter
> > import org.apache.parquet.hadoop.{ParquetFileReader}
> > import org.apache.parquet.io.ColumnIOFactory
> >
> > ...
> >
> > val path: Path = Paths.get("C:\\item.pq")
> > val jpath = new org.apache.hadoop.fs.Path(path.toFile.getAbsolutePath)
> > val conf = new Configuration()
> >
> > val readFooter = ParquetFileReader.readFooter(conf, jpath,
> > ParquetMetadataConverter.NO_FILTER)
> > val schema = readFooter.getFileMetaData.getSchema
> > val r = ParquetFileReader.open(conf, jpath)
> >
> > val pages = r.readNextRowGroup()
> > val rows = pages.getRowCount
> >
> > val columnIO = new ColumnIOFactory().getColumnIO(schema)
> > val recordReader = columnIO.getRecordReader(pages, new
> > GroupRecordConverter(schema))
> >
> > // This takes about 2s
> > (1 to rows.toInt).map { i =>
> >   val group = recordReader.read
> >   // Just read first column for now ...
> >   val x = group.getLong(0,0)
> > }
> >
> > ...
> >
> > As this will be in the hot path of my code, I'm quite keen to make it
> > as fast as possible. Note that the eventual objective is to build
> > Arrow data. I was assuming there would be a way to quickly load the
> > columns. I suspect the loop over the rows, building row-based records,
> > is causing a lot of overhead, but can't seem to find another way.
> >
> >
> > Thanks,
> >
> > -J
>

Re: Java/Scala: efficient reading of Parquet into Arrow?

Posted by Joris Peeters <jo...@gmail.com>.
Cool, thanks. I think we'll just go with reading LZ4 compressed Arrow
directly from disk then, and by-pass Parquet altogether.
The compressed Arrow files are about 20% larger than the PQ files, but
getting it into some useful form in memory is almost on par with pandas.

At the moment, I don't need the additional benefits parquet would give
(various forms of filtering), so this is perfectly OK. And now I don't have
half of Hadoop in my dependencies. \o/
But yes, definitely +1 on a self-contained vectorised Parquet reader.

-J


On Thu, May 23, 2019 at 2:58 PM Wes McKinney <we...@gmail.com> wrote:

> hi Joris,
>
> The Apache Parquet mailing list is
>
> dev@parquet.apache.org
>
> I'm copying the list here
>
> AFAIK parquet-mr doesn't feature vectorized reading (for Arrow or
> otherwise). There are some vectorized Java-based readers in the wild:
> in Dremio [1] and Apache Spark, at least. I'm interested to see a
> reusable library that supports vectorized Arrow reads in Java.
>
> - Wes
>
> [1]: https://github.com/dremio/dremio-oss
>
> On Thu, May 23, 2019 at 8:54 AM Joris Peeters
> <jo...@gmail.com> wrote:
> >
> > Hello,
> >
> > I'm trying to read a Parquet file from disk into Arrow in memory, in
> Scala.
> > I'm wondering what the most efficient approach is, especially for the
> > reading part. I'm aware that Parquet reading is perhaps beyond the scope
> of
> > this mailing list but,
> >
> > - I believe Arrow and Parquet are closely intertwined these days?
> > - I can't find an appropriate Parquet mailing list.
> >
> > Any pointers would be appreciated!
> >
> > Below is the code I currently have. My concern is that this alone already
> > takes about 2s, whereas "pq.read_pandas(the_file_path).to_pandas()" takes
> > ~=100ms in Python. So I suspect I'm not doing this in the most efficient
> > way possible ... The Parquet data holds 1570150 rows, with 14 columns of
> > various types, and takes 15MB on disk.
> >
> > import org.apache.hadoop.conf.Configuration
> > import org.apache.parquet.column.ColumnDescriptor
> > import
> org.apache.parquet.example.data.simple.convert.GroupRecordConverter
> > import org.apache.parquet.format.converter.ParquetMetadataConverter
> > import org.apache.parquet.hadoop.{ParquetFileReader}
> > import org.apache.parquet.io.ColumnIOFactory
> >
> > ...
> >
> > val path: Path = Paths.get("C:\\item.pq")
> > val jpath = new org.apache.hadoop.fs.Path(path.toFile.getAbsolutePath)
> > val conf = new Configuration()
> >
> > val readFooter = ParquetFileReader.readFooter(conf, jpath,
> > ParquetMetadataConverter.NO_FILTER)
> > val schema = readFooter.getFileMetaData.getSchema
> > val r = ParquetFileReader.open(conf, jpath)
> >
> > val pages = r.readNextRowGroup()
> > val rows = pages.getRowCount
> >
> > val columnIO = new ColumnIOFactory().getColumnIO(schema)
> > val recordReader = columnIO.getRecordReader(pages, new
> > GroupRecordConverter(schema))
> >
> > // This takes about 2s
> > (1 to rows.toInt).map { i =>
> >   val group = recordReader.read
> >   // Just read first column for now ...
> >   val x = group.getLong(0,0)
> > }
> >
> > ...
> >
> > As this will be in the hot path of my code, I'm quite keen to make it
> > as fast as possible. Note that the eventual objective is to build
> > Arrow data. I was assuming there would be a way to quickly load the
> > columns. I suspect the loop over the rows, building row-based records,
> > is causing a lot of overhead, but can't seem to find another way.
> >
> >
> > Thanks,
> >
> > -J
>