You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Andy Grove <An...@rms.com> on 2018/07/23 20:03:23 UTC

Reading arrays from Java using ParquetFileReader

I‘m using ParquetFileReader/ParquetPageReader to scan parquet files and apply a projection. This is working well for primitive column types but I’m running into an issue when trying at add support for arrays and could use some help.

I’m retrieving the schema like this:

  val r = new ParquetFileReader(file, options)
  val schema: MessageType = r.getFileMetaData.getSchema

I’m then filtering the schema on column name to get the column descriptors.

Let’s say the field I am looking for is “foo” .. in the case of an array I get a descriptor with the path { “foo” / “list” / “element” }.

I’m building a projection like this

    val projectionBuilder = Types.buildMessage()
    for (col <- projectedColumnDefs) {
      projectionBuilder.addField(col.getPrimitiveType)
    }
    projectionBuilder.named("projection")

The problem is that this projection then ends up containing a descriptor named “element” instead of “foo” and I end up getting null values for this column (and valid values for the primitive columns).

This is how I’m applying the projection to the ParquetFileReader “r”.

r.setRequestedSchema(projectionType)

I’d appreciate some pointers on general approach here.

Thanks,

Andy.

Re: Reading arrays from Java using ParquetFileReader

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Andy,

You should use the file schema to produce the projection. Otherwise, you'd
need to match the structure of the file in your projection schema. That's
possible, but it's easier to use the file's schema and create a new schema
with a subset of columns from the file. Here's an example:

https://github.com/Netflix/iceberg/blob/master/parquet/src/main/java/com/netflix/iceberg/parquet/ParquetSchemaUtil.java#L58-L76

rb

On Mon, Jul 23, 2018 at 1:03 PM Andy Grove <An...@rms.com> wrote:

> I‘m using ParquetFileReader/ParquetPageReader to scan parquet files and
> apply a projection. This is working well for primitive column types but I’m
> running into an issue when trying at add support for arrays and could use
> some help.
>
> I’m retrieving the schema like this:
>
>   val r = new ParquetFileReader(file, options)
>   val schema: MessageType = r.getFileMetaData.getSchema
>
> I’m then filtering the schema on column name to get the column descriptors.
>
> Let’s say the field I am looking for is “foo” .. in the case of an array I
> get a descriptor with the path { “foo” / “list” / “element” }.
>
> I’m building a projection like this
>
>     val projectionBuilder = Types.buildMessage()
>     for (col <- projectedColumnDefs) {
>       projectionBuilder.addField(col.getPrimitiveType)
>     }
>     projectionBuilder.named("projection")
>
> The problem is that this projection then ends up containing a descriptor
> named “element” instead of “foo” and I end up getting null values for this
> column (and valid values for the primitive columns).
>
> This is how I’m applying the projection to the ParquetFileReader “r”.
>
> r.setRequestedSchema(projectionType)
>
> I’d appreciate some pointers on general approach here.
>
> Thanks,
>
> Andy.
>
>
>
>
>
>

-- 
Ryan Blue
Software Engineer
Netflix