You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Andy Grove <An...@rms.com> on 2018/04/13 17:31:16 UTC

Specifying a projection in Java API

Hi,

I’m trying to read a parquet file with a projection from Scala and I can’t find docs or examples for the correct way to do this.

I have the file schema and have filtered for the list of columns I need, so I have a List of ColumnDescriptors.

It looks like I should call ParquetFileReader.setRequestedSchema() but I can’t find an example of constructing the required MessageType parameter.

I’d appreciate any pointers on what to do next.

Thanks,

Andy.

Re: Specifying a projection in Java API

Posted by Andy Grove <An...@rms.com>.

OK sorry for all the messages but I have this working now:


On 4/13/18, 12:59 PM, "Andy Grove" <An...@rms.com> wrote:

    Immediately after sending this I realized that I also needed to pass the projection message type in the following lines:
    
          val columnIO = new ColumnIOFactory().getColumnIO(projectionType)
    
          val recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(projectionType))
    
    I feel like I am getting close. Current failure is:
    
    Exception in thread "main" java.lang.RuntimeException: not found 2(my_projected_column) element number 0 in group:
    
    	at org.apache.parquet.example.data.simple.SimpleGroup.getValue(SimpleGroup.java:97)
    	at org.apache.parquet.example.data.simple.SimpleGroup.getInteger(SimpleGroup.java:129)
    	at org.apache.parquet.example.data.GroupValueSource.getInteger(GroupValueSource.java:39)
    
    On 4/13/18, 12:56 PM, "Andy Grove" <An...@rms.com> wrote:
    
        Thanks. I tried this.
        
            val projection: Seq[column.ColumnDescriptor] = ....//filter the columns I want from the schema
        
            val projectionBuilder = Types.buildMessage()
            for (col <- projection) {
              projectionBuilder.addField(Types.buildMessage().named(col.getPath.head))
            }
            r.setRequestedSchema(projectionBuilder.named("tbd"))
        
        This fails when reading the file with "[some_col_name] optional int64 some_col_name is not in the store" where "some_col_name" is not part of my projection.
        
        Any idea what I need to do next?
        
        Thanks,
        
        Andy.
        
        On 4/13/18, 12:08 PM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote:
        
            I'd suggest using the Types builders to create your projection schema
            (MessageType), then passing that schema to the
            ParquetFileReader.setRequestedSchema method you found.
            
            On Fri, Apr 13, 2018 at 10:40 AM, Andy Grove <An...@rms.com> wrote:
            
            > Hi Ryan,
            >
            > I'm writing some low-level performance tests to try and find a bottleneck
            > on our platform and have intentionally excluded Spark/Thrift/Presto etc and
            > want to test Parquet directly both with local files and against our HDFS
            > cluster to get performance metrics. Our parquet files were created by Spark
            > and contain schema meta-data.
            >
            > Here is my code for opening the file:
            >
            >     val footer = ParquetFileReader.open(file, options)
            >     val schema = footer.getFileMetaData.getSchema
            >     val r = new ParquetFileReader(file, options)
            >
            > I can call schema.getColumns and see all of the column definitions.
            >
            > I have my query working fine but it is reading all the columns and I want
            > to push down the projection so it only reads the 5 columns I need.
            >
            > I see that there are some versions of the ParquetFileReader constructors
            > that accept a List[ColumnDescriptor] and I did try that but ran into errors.
            >
            > What would you suggest?
            >
            > Thanks,
            >
            > Andy.
            >
            >
            > On 4/13/18, 11:34 AM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote:
            >
            >     Andy, what object model are you using to read? Usually you don't have a
            >     list of column descriptors, you have an Avro read schema or a Thrift
            > class
            >     or something.
            >
            >     On Fri, Apr 13, 2018 at 10:31 AM, Andy Grove <An...@rms.com>
            > wrote:
            >
            >     > Hi,
            >     >
            >     > I’m trying to read a parquet file with a projection from Scala and I
            > can’t
            >     > find docs or examples for the correct way to do this.
            >     >
            >     > I have the file schema and have filtered for the list of columns I
            > need,
            >     > so I have a List of ColumnDescriptors.
            >     >
            >     > It looks like I should call ParquetFileReader.setRequestedSchema()
            > but I
            >     > can’t find an example of constructing the required MessageType
            > parameter.
            >     >
            >     > I’d appreciate any pointers on what to do next.
            >     >
            >     > Thanks,
            >     >
            >     > Andy.
            >     >
            >     >
            >     >
            >
            >
            >     --
            >     Ryan Blue
            >     Software Engineer
            >     Netflix
            >
            >
            >
            
            
            -- 
            Ryan Blue
            Software Engineer
            Netflix

Re: Specifying a projection in Java API

Posted by Andy Grove <An...@rms.com>.

Immediately after sending this I realized that I also needed to pass the projection message type in the following lines:

      val columnIO = new ColumnIOFactory().getColumnIO(projectionType)

      val recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(projectionType))

I feel like I am getting close. Current failure is:

Exception in thread "main" java.lang.RuntimeException: not found 2(my_projected_column) element number 0 in group:

	at org.apache.parquet.example.data.simple.SimpleGroup.getValue(SimpleGroup.java:97)
	at org.apache.parquet.example.data.simple.SimpleGroup.getInteger(SimpleGroup.java:129)
	at org.apache.parquet.example.data.GroupValueSource.getInteger(GroupValueSource.java:39)

On 4/13/18, 12:56 PM, "Andy Grove" <An...@rms.com> wrote:

    Thanks. I tried this.
    
        val projection: Seq[column.ColumnDescriptor] = ....//filter the columns I want from the schema
    
        val projectionBuilder = Types.buildMessage()
        for (col <- projection) {
          projectionBuilder.addField(Types.buildMessage().named(col.getPath.head))
        }
        r.setRequestedSchema(projectionBuilder.named("tbd"))
    
    This fails when reading the file with "[some_col_name] optional int64 some_col_name is not in the store" where "some_col_name" is not part of my projection.
    
    Any idea what I need to do next?
    
    Thanks,
    
    Andy.
    
    On 4/13/18, 12:08 PM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote:
    
        I'd suggest using the Types builders to create your projection schema
        (MessageType), then passing that schema to the
        ParquetFileReader.setRequestedSchema method you found.
        
        On Fri, Apr 13, 2018 at 10:40 AM, Andy Grove <An...@rms.com> wrote:
        
        > Hi Ryan,
        >
        > I'm writing some low-level performance tests to try and find a bottleneck
        > on our platform and have intentionally excluded Spark/Thrift/Presto etc and
        > want to test Parquet directly both with local files and against our HDFS
        > cluster to get performance metrics. Our parquet files were created by Spark
        > and contain schema meta-data.
        >
        > Here is my code for opening the file:
        >
        >     val footer = ParquetFileReader.open(file, options)
        >     val schema = footer.getFileMetaData.getSchema
        >     val r = new ParquetFileReader(file, options)
        >
        > I can call schema.getColumns and see all of the column definitions.
        >
        > I have my query working fine but it is reading all the columns and I want
        > to push down the projection so it only reads the 5 columns I need.
        >
        > I see that there are some versions of the ParquetFileReader constructors
        > that accept a List[ColumnDescriptor] and I did try that but ran into errors.
        >
        > What would you suggest?
        >
        > Thanks,
        >
        > Andy.
        >
        >
        > On 4/13/18, 11:34 AM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote:
        >
        >     Andy, what object model are you using to read? Usually you don't have a
        >     list of column descriptors, you have an Avro read schema or a Thrift
        > class
        >     or something.
        >
        >     On Fri, Apr 13, 2018 at 10:31 AM, Andy Grove <An...@rms.com>
        > wrote:
        >
        >     > Hi,
        >     >
        >     > I’m trying to read a parquet file with a projection from Scala and I
        > can’t
        >     > find docs or examples for the correct way to do this.
        >     >
        >     > I have the file schema and have filtered for the list of columns I
        > need,
        >     > so I have a List of ColumnDescriptors.
        >     >
        >     > It looks like I should call ParquetFileReader.setRequestedSchema()
        > but I
        >     > can’t find an example of constructing the required MessageType
        > parameter.
        >     >
        >     > I’d appreciate any pointers on what to do next.
        >     >
        >     > Thanks,
        >     >
        >     > Andy.
        >     >
        >     >
        >     >
        >
        >
        >     --
        >     Ryan Blue
        >     Software Engineer
        >     Netflix
        >
        >
        >
        
        
        -- 
        Ryan Blue
        Software Engineer
        Netflix

Re: Specifying a projection in Java API

Posted by Andy Grove <An...@rms.com>.

Thanks. I tried this.

    val projection: Seq[column.ColumnDescriptor] = ....//filter the columns I want from the schema

    val projectionBuilder = Types.buildMessage()
    for (col <- projection) {
      projectionBuilder.addField(Types.buildMessage().named(col.getPath.head))
    }
    r.setRequestedSchema(projectionBuilder.named("tbd"))

This fails when reading the file with "[some_col_name] optional int64 some_col_name is not in the store" where "some_col_name" is not part of my projection.

Any idea what I need to do next?

Thanks,

Andy.

On 4/13/18, 12:08 PM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote:

    I'd suggest using the Types builders to create your projection schema
    (MessageType), then passing that schema to the
    ParquetFileReader.setRequestedSchema method you found.
    
    On Fri, Apr 13, 2018 at 10:40 AM, Andy Grove <An...@rms.com> wrote:
    
    > Hi Ryan,
    >
    > I'm writing some low-level performance tests to try and find a bottleneck
    > on our platform and have intentionally excluded Spark/Thrift/Presto etc and
    > want to test Parquet directly both with local files and against our HDFS
    > cluster to get performance metrics. Our parquet files were created by Spark
    > and contain schema meta-data.
    >
    > Here is my code for opening the file:
    >
    >     val footer = ParquetFileReader.open(file, options)
    >     val schema = footer.getFileMetaData.getSchema
    >     val r = new ParquetFileReader(file, options)
    >
    > I can call schema.getColumns and see all of the column definitions.
    >
    > I have my query working fine but it is reading all the columns and I want
    > to push down the projection so it only reads the 5 columns I need.
    >
    > I see that there are some versions of the ParquetFileReader constructors
    > that accept a List[ColumnDescriptor] and I did try that but ran into errors.
    >
    > What would you suggest?
    >
    > Thanks,
    >
    > Andy.
    >
    >
    > On 4/13/18, 11:34 AM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote:
    >
    >     Andy, what object model are you using to read? Usually you don't have a
    >     list of column descriptors, you have an Avro read schema or a Thrift
    > class
    >     or something.
    >
    >     On Fri, Apr 13, 2018 at 10:31 AM, Andy Grove <An...@rms.com>
    > wrote:
    >
    >     > Hi,
    >     >
    >     > I’m trying to read a parquet file with a projection from Scala and I
    > can’t
    >     > find docs or examples for the correct way to do this.
    >     >
    >     > I have the file schema and have filtered for the list of columns I
    > need,
    >     > so I have a List of ColumnDescriptors.
    >     >
    >     > It looks like I should call ParquetFileReader.setRequestedSchema()
    > but I
    >     > can’t find an example of constructing the required MessageType
    > parameter.
    >     >
    >     > I’d appreciate any pointers on what to do next.
    >     >
    >     > Thanks,
    >     >
    >     > Andy.
    >     >
    >     >
    >     >
    >
    >
    >     --
    >     Ryan Blue
    >     Software Engineer
    >     Netflix
    >
    >
    >
    
    
    -- 
    Ryan Blue
    Software Engineer
    Netflix

Re: Specifying a projection in Java API

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

I'd suggest using the Types builders to create your projection schema
(MessageType), then passing that schema to the
ParquetFileReader.setRequestedSchema method you found.

On Fri, Apr 13, 2018 at 10:40 AM, Andy Grove <An...@rms.com> wrote:

> Hi Ryan,
>
> I'm writing some low-level performance tests to try and find a bottleneck
> on our platform and have intentionally excluded Spark/Thrift/Presto etc and
> want to test Parquet directly both with local files and against our HDFS
> cluster to get performance metrics. Our parquet files were created by Spark
> and contain schema meta-data.
>
> Here is my code for opening the file:
>
>     val footer = ParquetFileReader.open(file, options)
>     val schema = footer.getFileMetaData.getSchema
>     val r = new ParquetFileReader(file, options)
>
> I can call schema.getColumns and see all of the column definitions.
>
> I have my query working fine but it is reading all the columns and I want
> to push down the projection so it only reads the 5 columns I need.
>
> I see that there are some versions of the ParquetFileReader constructors
> that accept a List[ColumnDescriptor] and I did try that but ran into errors.
>
> What would you suggest?
>
> Thanks,
>
> Andy.
>
>
> On 4/13/18, 11:34 AM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote:
>
>     Andy, what object model are you using to read? Usually you don't have a
>     list of column descriptors, you have an Avro read schema or a Thrift
> class
>     or something.
>
>     On Fri, Apr 13, 2018 at 10:31 AM, Andy Grove <An...@rms.com>
> wrote:
>
>     > Hi,
>     >
>     > I’m trying to read a parquet file with a projection from Scala and I
> can’t
>     > find docs or examples for the correct way to do this.
>     >
>     > I have the file schema and have filtered for the list of columns I
> need,
>     > so I have a List of ColumnDescriptors.
>     >
>     > It looks like I should call ParquetFileReader.setRequestedSchema()
> but I
>     > can’t find an example of constructing the required MessageType
> parameter.
>     >
>     > I’d appreciate any pointers on what to do next.
>     >
>     > Thanks,
>     >
>     > Andy.
>     >
>     >
>     >
>
>
>     --
>     Ryan Blue
>     Software Engineer
>     Netflix
>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Specifying a projection in Java API

Posted by Andy Grove <An...@rms.com>.

Hi Ryan,

I'm writing some low-level performance tests to try and find a bottleneck on our platform and have intentionally excluded Spark/Thrift/Presto etc and want to test Parquet directly both with local files and against our HDFS cluster to get performance metrics. Our parquet files were created by Spark and contain schema meta-data.

Here is my code for opening the file:

    val footer = ParquetFileReader.open(file, options)
    val schema = footer.getFileMetaData.getSchema
    val r = new ParquetFileReader(file, options)

I can call schema.getColumns and see all of the column definitions.

I have my query working fine but it is reading all the columns and I want to push down the projection so it only reads the 5 columns I need.

I see that there are some versions of the ParquetFileReader constructors that accept a List[ColumnDescriptor] and I did try that but ran into errors.

What would you suggest?

Thanks,

Andy.

On 4/13/18, 11:34 AM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote:

    Andy, what object model are you using to read? Usually you don't have a
    list of column descriptors, you have an Avro read schema or a Thrift class
    or something.

    On Fri, Apr 13, 2018 at 10:31 AM, Andy Grove <An...@rms.com> wrote:

    > Hi,
    >
    > I’m trying to read a parquet file with a projection from Scala and I can’t
    > find docs or examples for the correct way to do this.
    >
    > I have the file schema and have filtered for the list of columns I need,
    > so I have a List of ColumnDescriptors.
    >
    > It looks like I should call ParquetFileReader.setRequestedSchema() but I
    > can’t find an example of constructing the required MessageType parameter.
    >
    > I’d appreciate any pointers on what to do next.
    >
    > Thanks,
    >
    > Andy.
    >
    >
    >

    -- 
    Ryan Blue
    Software Engineer
    Netflix

Re: Specifying a projection in Java API

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Andy, what object model are you using to read? Usually you don't have a
list of column descriptors, you have an Avro read schema or a Thrift class
or something.

On Fri, Apr 13, 2018 at 10:31 AM, Andy Grove <An...@rms.com> wrote:

> Hi,
>
> I’m trying to read a parquet file with a projection from Scala and I can’t
> find docs or examples for the correct way to do this.
>
> I have the file schema and have filtered for the list of columns I need,
> so I have a List of ColumnDescriptors.
>
> It looks like I should call ParquetFileReader.setRequestedSchema() but I
> can’t find an example of constructing the required MessageType parameter.
>
> I’d appreciate any pointers on what to do next.
>
> Thanks,
>
> Andy.
>
>
>


-- 
Ryan Blue
Software Engineer
Netflix