You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Grant Monroe <gr...@tnarg.com> on 2017/03/16 04:41:26 UTC

Re: Failing C Parquet Writer

Yes, I realized after posting that my example was faulty because I'm not
creating a new row group every 3 rows. But consider an even simpler example:

https://gist.github.com/tnarg/caa2f098091760255e3c60da2cf17438

 I want to write a single json object:

{
  "foo": false,
  "bars": [1,2,3]
}

I would create two columns in my schema, I choose a row group size of 10,
and write 1 row to the "foo" column and 3 rows to the "bars" column. I get
an error because I didn't write exactly 10 rows to each column. This seems
broken.

gmonroe@blah:~$ ./writer
terminate called after throwing an instance of 'parquet::ParquetException'
  what():  Less than the number of expected rows written in the current
column chunk
Aborted (core dumped)


On 2017-03-13 18:01 (-0400), Wes McKinney <we...@gmail.com> wrote:
> hi Grant,
>
> the exception is coming from
>
>   if (num_rows_ != expected_rows_) {
>     throw ParquetException(
>         "Less than the number of expected rows written in"
>         " the current column chunk");
>   }
>
>
https://github.com/apache/parquet-cpp/blob/5e59bc5c6491a7505585c08fd62aa52f9a6c9afc/src/parquet/column/writer.cc#L159
>
> This is double buggy -- the size of the row group and the number of
> values written is different, but you're writing *more* values than the
> row group contains. I'm opening a JIRA to throw a better exception
>
> See the logic for forming num_rows_ for columns with max_repetition_level
> 0:
>
>
https://github.com/apache/parquet-cpp/blob/master/src/parquet/column/writer.cc#L323
>
> num_rows_ is incremented each time a new record begins
> (repetition_level 0). You can write as many repeated values as you
> like in a row group as long as the repetition levels encode the
> corresponding number of records -- if you run into a case where this
> happens, can you open a JIRA so we can add a test case and fix?
>
> Thanks
> Wes
>
> On Mon, Mar 13, 2017 at 12:14 PM, Grant Monroe <gr...@tnarg.com> wrote:
> > I should also mention that I built parquet-cpp from github, commit
> > 1c4492a111b00ef48663982171e3face1ca2192d.
> >
> > On Mon, Mar 13, 2017 at 12:10 PM, Grant Monroe <gr...@tnarg.com> wrote:
> >
> >> I'm struggling to get a simple parquet writer working using the c
> >> library. The source is here:
> >>
> >> https://gist.github.com/tnarg/8878a38d4a22104328c4d289319f9ac1
> >>
> >> and I'm compiling like so
> >>
> >> g   --std=c  11 -o writer writer.cc -lparquet -larrow -larrow_io
> >>
> >> When I run this program, I get the following error
> >>
> >> gmonroe@foo:~$ ./writer
> >> terminate called after throwing an instance of
'parquet::ParquetException'
> >>   what():  Less than the number of expected rows written in the current
> >> column chunk
> >> Aborted (core dumped)
> >>
> >> If I change NUM_ROWS_PER_ROW_GROUP=3, this writer succeeds. This
suggests
> >> that every column needs to contain N values such that N
> >> % NUM_ROWS_PER_ROW_GROUP = 0 and N > 0. For an arbitrarily complex set
of
> >> values the only reasonable choice for NUM_ROWS_PER_ROW_GROUP is 1.
> >>
> >> Is this a bug in the c   library or am I missing something in the API?
> >>
> >> Regards,
> >> Grant Monroe
> >>
>

Re: Failing C Parquet Writer

Posted by Deepak Majeti <ma...@gmail.com>.
Hi Grant,

Can you use the master branch or the 1.0.0-rc5 release and try again? You
will just get the error and not the core dump.

Just to clarify, the  NUM_ROWS_PER_ROW_GROUP value is NOT an upper bound to
the total number of rows in a RowGroup. The number of rows being added must
be exactly equal to the NUM_ROWS_PER_ROW_GROUP value.

On Thu, Mar 16, 2017 at 12:41 AM, Grant Monroe <gr...@tnarg.com> wrote:

> Yes, I realized after posting that my example was faulty because I'm not
> creating a new row group every 3 rows. But consider an even simpler
> example:
>
> https://gist.github.com/tnarg/caa2f098091760255e3c60da2cf17438
>
>  I want to write a single json object:
>
> {
>   "foo": false,
>   "bars": [1,2,3]
> }
>
> I would create two columns in my schema, I choose a row group size of 10,
> and write 1 row to the "foo" column and 3 rows to the "bars" column. I get
> an error because I didn't write exactly 10 rows to each column. This seems
> broken.
>
> gmonroe@blah:~$ ./writer
> terminate called after throwing an instance of 'parquet::ParquetException'
>   what():  Less than the number of expected rows written in the current
> column chunk
> Aborted (core dumped)
>
>
> On 2017-03-13 18:01 (-0400), Wes McKinney <we...@gmail.com> wrote:
> > hi Grant,
> >
> > the exception is coming from
> >
> >   if (num_rows_ != expected_rows_) {
> >     throw ParquetException(
> >         "Less than the number of expected rows written in"
> >         " the current column chunk");
> >   }
> >
> >
> https://github.com/apache/parquet-cpp/blob/5e59bc5c6491a7505585c08fd62aa5
> 2f9a6c9afc/src/parquet/column/writer.cc#L159
> >
> > This is double buggy -- the size of the row group and the number of
> > values written is different, but you're writing *more* values than the
> > row group contains. I'm opening a JIRA to throw a better exception
> >
> > See the logic for forming num_rows_ for columns with max_repetition_level
> > 0:
> >
> >
> https://github.com/apache/parquet-cpp/blob/master/src/
> parquet/column/writer.cc#L323
> >
> > num_rows_ is incremented each time a new record begins
> > (repetition_level 0). You can write as many repeated values as you
> > like in a row group as long as the repetition levels encode the
> > corresponding number of records -- if you run into a case where this
> > happens, can you open a JIRA so we can add a test case and fix?
> >
> > Thanks
> > Wes
> >
> > On Mon, Mar 13, 2017 at 12:14 PM, Grant Monroe <gr...@tnarg.com> wrote:
> > > I should also mention that I built parquet-cpp from github, commit
> > > 1c4492a111b00ef48663982171e3face1ca2192d.
> > >
> > > On Mon, Mar 13, 2017 at 12:10 PM, Grant Monroe <gr...@tnarg.com>
> wrote:
> > >
> > >> I'm struggling to get a simple parquet writer working using the c
> > >> library. The source is here:
> > >>
> > >> https://gist.github.com/tnarg/8878a38d4a22104328c4d289319f9ac1
> > >>
> > >> and I'm compiling like so
> > >>
> > >> g   --std=c  11 -o writer writer.cc -lparquet -larrow -larrow_io
> > >>
> > >> When I run this program, I get the following error
> > >>
> > >> gmonroe@foo:~$ ./writer
> > >> terminate called after throwing an instance of
> 'parquet::ParquetException'
> > >>   what():  Less than the number of expected rows written in the
> current
> > >> column chunk
> > >> Aborted (core dumped)
> > >>
> > >> If I change NUM_ROWS_PER_ROW_GROUP=3, this writer succeeds. This
> suggests
> > >> that every column needs to contain N values such that N
> > >> % NUM_ROWS_PER_ROW_GROUP = 0 and N > 0. For an arbitrarily complex set
> of
> > >> values the only reasonable choice for NUM_ROWS_PER_ROW_GROUP is 1.
> > >>
> > >> Is this a bug in the c   library or am I missing something in the API?
> > >>
> > >> Regards,
> > >> Grant Monroe
> > >>
> >
>



-- 
regards,
Deepak Majeti

Re: Failing C Parquet Writer

Posted by Grant Monroe <gr...@tnarg.com>.
Okay, cool. I was just missing that "NUM_ROWS_PER_ROW_GROUP" was logic rows, not number of entries per column. So in the case of my example NUM_ROWS_PER_ROW_GROUP=1 is correct. Thanks!

On 2017-03-16 15:51 (-0400), Wes McKinney <we...@gmail.com> wrote: 
> hi Grant,
> 
> The value [1, 2, 3] is only 1 value, not 3. The "Number of rows"
> passed to the row group is with respect to top level records, *not*
> counting repeated fields.
> 
> From https://blog.twitter.com/2013/dremel-made-simple-with-parquet, I
> believe the correct data to write is:
> 
> rep level | def level  | value
> 0         | 1          | 1
> 1         | 1          | 2
> 1         | 1          | 3
> 
> parquet-cpp knows from this data that the 3 values are part of only
> one logical record
> 
> Does that make sense?
> 
> Thanks
> Wes
> 
> On Thu, Mar 16, 2017 at 3:40 PM, Grant Monroe <gr...@tnarg.com> wrote:
> > Hi Deepak,
> >
> >> Can you use the master branch or the 1.0.0-rc5 release and try again? You
> >> will just get the error and not the core dump.
> >
> > Upgrading to master does indeed remove the abort().
> >
> >> Just to clarify, the NUM_ROWS_PER_ROW_GROUP value is NOT an upper bound to
> >> the total number of rows in a RowGroup. The number of rows being added must
> >> be exactly equal to the NUM_ROWS_PER_ROW_GROUP value.
> >
> > I can see that from the error message. My question is, given the example JSON object
> >
> > {
> > "foo": false,
> > "bars": [1,2,3]
> > }
> >
> > how might I store this using the parquet-cpp API? I have one column with 1 value and another with 3. The only general solution I can see would be to use  NUM_ROWS_PER_ROW_GROUP=1 which seems like nonsense. What am I missing? Sample code would be helpful.
> >
> > Thanks,
> > Grant
> >
> 

Re: Failing C Parquet Writer

Posted by Deepak Majeti <ma...@gmail.com>.
As an example, you can look at
https://github.com/apache/parquet-cpp/blob/master/examples/reader-writer.cc#L140
The int64_field column has a list of size 2 in every row.

On Thu, Mar 16, 2017 at 3:56 PM, Wes McKinney <we...@gmail.com> wrote:

> The definition levels depend on the array encoding -- so to account
> for nullable lists and nullable values, the actual definition levels
> (based on the schema) may range from 1 to 3.
>
> I found this exposition in the Impala codebase really useful:
>
> https://github.com/apache/incubator-impala/blob/master/
> be/src/exec/hdfs-parquet-scanner.h#L78
>
>
> On Thu, Mar 16, 2017 at 3:51 PM, Wes McKinney <we...@gmail.com> wrote:
> > hi Grant,
> >
> > The value [1, 2, 3] is only 1 value, not 3. The "Number of rows"
> > passed to the row group is with respect to top level records, *not*
> > counting repeated fields.
> >
> > From https://blog.twitter.com/2013/dremel-made-simple-with-parquet, I
> > believe the correct data to write is:
> >
> > rep level | def level  | value
> > 0         | 1          | 1
> > 1         | 1          | 2
> > 1         | 1          | 3
> >
> > parquet-cpp knows from this data that the 3 values are part of only
> > one logical record
> >
> > Does that make sense?
> >
> > Thanks
> > Wes
> >
> > On Thu, Mar 16, 2017 at 3:40 PM, Grant Monroe <gr...@tnarg.com> wrote:
> >> Hi Deepak,
> >>
> >>> Can you use the master branch or the 1.0.0-rc5 release and try again?
> You
> >>> will just get the error and not the core dump.
> >>
> >> Upgrading to master does indeed remove the abort().
> >>
> >>> Just to clarify, the NUM_ROWS_PER_ROW_GROUP value is NOT an upper
> bound to
> >>> the total number of rows in a RowGroup. The number of rows being added
> must
> >>> be exactly equal to the NUM_ROWS_PER_ROW_GROUP value.
> >>
> >> I can see that from the error message. My question is, given the
> example JSON object
> >>
> >> {
> >> "foo": false,
> >> "bars": [1,2,3]
> >> }
> >>
> >> how might I store this using the parquet-cpp API? I have one column
> with 1 value and another with 3. The only general solution I can see would
> be to use  NUM_ROWS_PER_ROW_GROUP=1 which seems like nonsense. What am I
> missing? Sample code would be helpful.
> >>
> >> Thanks,
> >> Grant
> >>
>



-- 
regards,
Deepak Majeti

Re: Failing C Parquet Writer

Posted by Wes McKinney <we...@gmail.com>.
The definition levels depend on the array encoding -- so to account
for nullable lists and nullable values, the actual definition levels
(based on the schema) may range from 1 to 3.

I found this exposition in the Impala codebase really useful:

https://github.com/apache/incubator-impala/blob/master/be/src/exec/hdfs-parquet-scanner.h#L78


On Thu, Mar 16, 2017 at 3:51 PM, Wes McKinney <we...@gmail.com> wrote:
> hi Grant,
>
> The value [1, 2, 3] is only 1 value, not 3. The "Number of rows"
> passed to the row group is with respect to top level records, *not*
> counting repeated fields.
>
> From https://blog.twitter.com/2013/dremel-made-simple-with-parquet, I
> believe the correct data to write is:
>
> rep level | def level  | value
> 0         | 1          | 1
> 1         | 1          | 2
> 1         | 1          | 3
>
> parquet-cpp knows from this data that the 3 values are part of only
> one logical record
>
> Does that make sense?
>
> Thanks
> Wes
>
> On Thu, Mar 16, 2017 at 3:40 PM, Grant Monroe <gr...@tnarg.com> wrote:
>> Hi Deepak,
>>
>>> Can you use the master branch or the 1.0.0-rc5 release and try again? You
>>> will just get the error and not the core dump.
>>
>> Upgrading to master does indeed remove the abort().
>>
>>> Just to clarify, the NUM_ROWS_PER_ROW_GROUP value is NOT an upper bound to
>>> the total number of rows in a RowGroup. The number of rows being added must
>>> be exactly equal to the NUM_ROWS_PER_ROW_GROUP value.
>>
>> I can see that from the error message. My question is, given the example JSON object
>>
>> {
>> "foo": false,
>> "bars": [1,2,3]
>> }
>>
>> how might I store this using the parquet-cpp API? I have one column with 1 value and another with 3. The only general solution I can see would be to use  NUM_ROWS_PER_ROW_GROUP=1 which seems like nonsense. What am I missing? Sample code would be helpful.
>>
>> Thanks,
>> Grant
>>

Re: Failing C Parquet Writer

Posted by Wes McKinney <we...@gmail.com>.
hi Grant,

The value [1, 2, 3] is only 1 value, not 3. The "Number of rows"
passed to the row group is with respect to top level records, *not*
counting repeated fields.

From https://blog.twitter.com/2013/dremel-made-simple-with-parquet, I
believe the correct data to write is:

rep level | def level  | value
0         | 1          | 1
1         | 1          | 2
1         | 1          | 3

parquet-cpp knows from this data that the 3 values are part of only
one logical record

Does that make sense?

Thanks
Wes

On Thu, Mar 16, 2017 at 3:40 PM, Grant Monroe <gr...@tnarg.com> wrote:
> Hi Deepak,
>
>> Can you use the master branch or the 1.0.0-rc5 release and try again? You
>> will just get the error and not the core dump.
>
> Upgrading to master does indeed remove the abort().
>
>> Just to clarify, the NUM_ROWS_PER_ROW_GROUP value is NOT an upper bound to
>> the total number of rows in a RowGroup. The number of rows being added must
>> be exactly equal to the NUM_ROWS_PER_ROW_GROUP value.
>
> I can see that from the error message. My question is, given the example JSON object
>
> {
> "foo": false,
> "bars": [1,2,3]
> }
>
> how might I store this using the parquet-cpp API? I have one column with 1 value and another with 3. The only general solution I can see would be to use  NUM_ROWS_PER_ROW_GROUP=1 which seems like nonsense. What am I missing? Sample code would be helpful.
>
> Thanks,
> Grant
>

Re: Failing C Parquet Writer

Posted by Grant Monroe <gr...@tnarg.com>.
Hi Deepak,

> Can you use the master branch or the 1.0.0-rc5 release and try again? You
> will just get the error and not the core dump.

Upgrading to master does indeed remove the abort(). 

> Just to clarify, the NUM_ROWS_PER_ROW_GROUP value is NOT an upper bound to
> the total number of rows in a RowGroup. The number of rows being added must
> be exactly equal to the NUM_ROWS_PER_ROW_GROUP value.

I can see that from the error message. My question is, given the example JSON object

{
"foo": false,
"bars": [1,2,3]
}

how might I store this using the parquet-cpp API? I have one column with 1 value and another with 3. The only general solution I can see would be to use  NUM_ROWS_PER_ROW_GROUP=1 which seems like nonsense. What am I missing? Sample code would be helpful.

Thanks,
Grant