You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Jean-Pascal Billaud <jp...@tellapart.com> on 2014/11/11 22:07:32 UTC

Hive Parquet Reader and "repeated" field

Hi,

I am trying to integrate parquet as the underlying storage format in our
data
pipeline but I am facing some issues which I hope some of you can help me
with.

The batch layer is fairly standard, some cascading write thrift log objecs
from
an input tap to a parquet output sink. As a snippet of one of the thrift
structure
serialized:

struct RequestInfo {
  1: optional string status,
  2: optional list<RequestDetails> requests,
}

struct RequestDetails {
  1: optional string type,
  2: optional bool valid,
}

Looking at the cascading Parquet writer, this translates into this:

optional binary status (UTF8);
optional group requests (LIST) {
  repeated group requests_tuple {
    optional binary type (UTF8);
    optional boolean valid;
  }
}

Then I have a hive table that points to the parquet file while specifying
the
thrift class serialized.

CREATE EXTERNAL TABLE parquet_requests
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'hdfs_somewhere'
TBLPROPERTIES ( 'thrift.class' = 'RequestInfo' );

While running "select * from parquet_requests", the whole thing crashes
with the
following exception:

  > public ArrayWritableGroupConverter(final GroupType groupType, final
HiveGroupConverter parent,
  >    final int index) {
  >   this.parent = parent;
  >   this.index = index;
  >   int count = groupType.getFieldCount();
  >   if (count < 1 || count > 2) {
  >     throw new IllegalStateException("Field count must be either 1 or 2:
" + count);
  >   }
  >

What this means is that requests_tuple is not considered a valid list
because
it has more than one field. It basically expects the "repeated" keyword on
the
"requests (LIST)" as opposed to "requests_tuple". The actual code also does
not
seem to handle repeated on primitives since the ETypeConverters always call
parent.set() hence always replacing the previous stored instance.

I cooked up a patch which as far as I can tell would fix the issues here and
I would like to have some comments to see if that patch is in the right
direction
before submitting a more formal pull request. Things need to be polished so
please don't spend too much time on the form but more on the approach.

https://github.com/jpbillaud/hive/commit/4c1de69b0c484903d663b920c1bfbdf8cd9b920d

Moreover, I have a feeling that I should probably not pass the thrift class
for
the parquet table given that at this point it is totally irrelevant and the
parquet
schema is stored in the parquet files. I also expect some ObjectInspector
issue
due to the extra grouping provided by the requests_tuple entry. Thoughts?

Thanks,

Re: Hive Parquet Reader and "repeated" field

Posted by Jean-Pascal Billaud <jp...@tellapart.com>.
Hey Ryan,

I take therefore that parquet-thrift structure using list/set/maps are not
supported with hive as of today.

Regarding the patch that I posted, since I need to make it work for my
deployment regardless, does the approach make sense so far? I still need to
hack into the ObjInspector so that once hive encounters a LIST field (from
hive standpoint), the ObjInspector removes one unnecessary layer of
ArrayWritable coming from the extra "_tuple" field. Does that make sense?

Thanks,

On Tue, Nov 11, 2014 at 3:14 PM, Ryan Blue <bl...@cloudera.com> wrote:

> On 11/11/2014 01:07 PM, Jean-Pascal Billaud wrote:
>
>> While running "select * from parquet_requests", the whole thing crashes
>> with the
>> following exception:
>>
>>    > public ArrayWritableGroupConverter(final GroupType groupType, final
>> HiveGroupConverter parent,
>>    >    final int index) {
>>    >   this.parent = parent;
>>    >   this.index = index;
>>    >   int count = groupType.getFieldCount();
>>    >   if (count < 1 || count > 2) {
>>    >     throw new IllegalStateException("Field count must be either 1 or
>> 2:
>> " + count);
>>    >   }
>>    >
>>
>> What this means is that requests_tuple is not considered a valid list
>> because
>> it has more than one field. It basically expects the "repeated" keyword on
>> the
>> "requests (LIST)" as opposed to "requests_tuple". The actual code also
>> does
>> not
>> seem to handle repeated on primitives since the ETypeConverters always
>> call
>> parent.set() hence always replacing the previous stored instance.
>>
>> I cooked up a patch which as far as I can tell would fix the issues here
>> and
>> I would like to have some comments to see if that patch is in the right
>> direction
>> before submitting a more formal pull request. Things need to be polished
>> so
>> please don't spend too much time on the form but more on the approach.
>>
>> https://github.com/jpbillaud/hive/commit/4c1de69b0c484903d663b920c1bfbd
>> f8cd9b920d
>>
>> Moreover, I have a feeling that I should probably not pass the thrift
>> class
>> for
>> the parquet table given that at this point it is totally irrelevant and
>> the
>> parquet
>> schema is stored in the parquet files. I also expect some ObjectInspector
>> issue
>> due to the extra grouping provided by the requests_tuple entry. Thoughts?
>>
>> Thanks,
>>
>>
> Hi Jean-Pascal,
>
> This is a known issue that we're going to be fixing shortly. The problem
> is that there's a difference in the way Hive and Thrift (or Avro)
> represents lists. PARQUET-113 [1] is an effort to define what is currently
> being written and what we need to do to add the compatibility. It also
> specifies what should be written.
>
> Hive is one of the first object models that will be updated with the
> backward-compatibility rules so that it can read parquet-avro and
> parquet-thrift structures correctly.
>
> rb
>
> [1]: https://issues.apache.org/jira/browse/PARQUET-113
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Re: Hive Parquet Reader and "repeated" field

Posted by Jean-Pascal Billaud <jp...@tellapart.com>.
Hey Ryan,

I take therefore that parquet-thrift structure using list/set/maps are not
supported with hive as of today.

Regarding the patch that I posted, since I need to make it work for my
deployment regardless, does the approach make sense so far? I still need to
hack into the ObjInspector so that once hive encounters a LIST field (from
hive standpoint), the ObjInspector removes one unnecessary layer of
ArrayWritable coming from the extra "_tuple" field. Does that make sense?

Thanks,

On Tue, Nov 11, 2014 at 3:14 PM, Ryan Blue <bl...@cloudera.com> wrote:

> On 11/11/2014 01:07 PM, Jean-Pascal Billaud wrote:
>
>> While running "select * from parquet_requests", the whole thing crashes
>> with the
>> following exception:
>>
>>    > public ArrayWritableGroupConverter(final GroupType groupType, final
>> HiveGroupConverter parent,
>>    >    final int index) {
>>    >   this.parent = parent;
>>    >   this.index = index;
>>    >   int count = groupType.getFieldCount();
>>    >   if (count < 1 || count > 2) {
>>    >     throw new IllegalStateException("Field count must be either 1 or
>> 2:
>> " + count);
>>    >   }
>>    >
>>
>> What this means is that requests_tuple is not considered a valid list
>> because
>> it has more than one field. It basically expects the "repeated" keyword on
>> the
>> "requests (LIST)" as opposed to "requests_tuple". The actual code also
>> does
>> not
>> seem to handle repeated on primitives since the ETypeConverters always
>> call
>> parent.set() hence always replacing the previous stored instance.
>>
>> I cooked up a patch which as far as I can tell would fix the issues here
>> and
>> I would like to have some comments to see if that patch is in the right
>> direction
>> before submitting a more formal pull request. Things need to be polished
>> so
>> please don't spend too much time on the form but more on the approach.
>>
>> https://github.com/jpbillaud/hive/commit/4c1de69b0c484903d663b920c1bfbd
>> f8cd9b920d
>>
>> Moreover, I have a feeling that I should probably not pass the thrift
>> class
>> for
>> the parquet table given that at this point it is totally irrelevant and
>> the
>> parquet
>> schema is stored in the parquet files. I also expect some ObjectInspector
>> issue
>> due to the extra grouping provided by the requests_tuple entry. Thoughts?
>>
>> Thanks,
>>
>>
> Hi Jean-Pascal,
>
> This is a known issue that we're going to be fixing shortly. The problem
> is that there's a difference in the way Hive and Thrift (or Avro)
> represents lists. PARQUET-113 [1] is an effort to define what is currently
> being written and what we need to do to add the compatibility. It also
> specifies what should be written.
>
> Hive is one of the first object models that will be updated with the
> backward-compatibility rules so that it can read parquet-avro and
> parquet-thrift structures correctly.
>
> rb
>
> [1]: https://issues.apache.org/jira/browse/PARQUET-113
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Re: Hive Parquet Reader and "repeated" field

Posted by Ryan Blue <bl...@cloudera.com>.
On 11/11/2014 01:07 PM, Jean-Pascal Billaud wrote:
> While running "select * from parquet_requests", the whole thing crashes
> with the
> following exception:
>
>    > public ArrayWritableGroupConverter(final GroupType groupType, final
> HiveGroupConverter parent,
>    >    final int index) {
>    >   this.parent = parent;
>    >   this.index = index;
>    >   int count = groupType.getFieldCount();
>    >   if (count < 1 || count > 2) {
>    >     throw new IllegalStateException("Field count must be either 1 or 2:
> " + count);
>    >   }
>    >
>
> What this means is that requests_tuple is not considered a valid list
> because
> it has more than one field. It basically expects the "repeated" keyword on
> the
> "requests (LIST)" as opposed to "requests_tuple". The actual code also does
> not
> seem to handle repeated on primitives since the ETypeConverters always call
> parent.set() hence always replacing the previous stored instance.
>
> I cooked up a patch which as far as I can tell would fix the issues here and
> I would like to have some comments to see if that patch is in the right
> direction
> before submitting a more formal pull request. Things need to be polished so
> please don't spend too much time on the form but more on the approach.
>
> https://github.com/jpbillaud/hive/commit/4c1de69b0c484903d663b920c1bfbdf8cd9b920d
>
> Moreover, I have a feeling that I should probably not pass the thrift class
> for
> the parquet table given that at this point it is totally irrelevant and the
> parquet
> schema is stored in the parquet files. I also expect some ObjectInspector
> issue
> due to the extra grouping provided by the requests_tuple entry. Thoughts?
>
> Thanks,
>

Hi Jean-Pascal,

This is a known issue that we're going to be fixing shortly. The problem 
is that there's a difference in the way Hive and Thrift (or Avro) 
represents lists. PARQUET-113 [1] is an effort to define what is 
currently being written and what we need to do to add the compatibility. 
It also specifies what should be written.

Hive is one of the first object models that will be updated with the 
backward-compatibility rules so that it can read parquet-avro and 
parquet-thrift structures correctly.

rb

[1]: https://issues.apache.org/jira/browse/PARQUET-113

-- 
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: Hive Parquet Reader and "repeated" field

Posted by Ryan Blue <bl...@cloudera.com>.
On 11/11/2014 01:07 PM, Jean-Pascal Billaud wrote:
> While running "select * from parquet_requests", the whole thing crashes
> with the
> following exception:
>
>    > public ArrayWritableGroupConverter(final GroupType groupType, final
> HiveGroupConverter parent,
>    >    final int index) {
>    >   this.parent = parent;
>    >   this.index = index;
>    >   int count = groupType.getFieldCount();
>    >   if (count < 1 || count > 2) {
>    >     throw new IllegalStateException("Field count must be either 1 or 2:
> " + count);
>    >   }
>    >
>
> What this means is that requests_tuple is not considered a valid list
> because
> it has more than one field. It basically expects the "repeated" keyword on
> the
> "requests (LIST)" as opposed to "requests_tuple". The actual code also does
> not
> seem to handle repeated on primitives since the ETypeConverters always call
> parent.set() hence always replacing the previous stored instance.
>
> I cooked up a patch which as far as I can tell would fix the issues here and
> I would like to have some comments to see if that patch is in the right
> direction
> before submitting a more formal pull request. Things need to be polished so
> please don't spend too much time on the form but more on the approach.
>
> https://github.com/jpbillaud/hive/commit/4c1de69b0c484903d663b920c1bfbdf8cd9b920d
>
> Moreover, I have a feeling that I should probably not pass the thrift class
> for
> the parquet table given that at this point it is totally irrelevant and the
> parquet
> schema is stored in the parquet files. I also expect some ObjectInspector
> issue
> due to the extra grouping provided by the requests_tuple entry. Thoughts?
>
> Thanks,
>

Hi Jean-Pascal,

This is a known issue that we're going to be fixing shortly. The problem 
is that there's a difference in the way Hive and Thrift (or Avro) 
represents lists. PARQUET-113 [1] is an effort to define what is 
currently being written and what we need to do to add the compatibility. 
It also specifies what should be written.

Hive is one of the first object models that will be updated with the 
backward-compatibility rules so that it can read parquet-avro and 
parquet-thrift structures correctly.

rb

[1]: https://issues.apache.org/jira/browse/PARQUET-113

-- 
Ryan Blue
Software Engineer
Cloudera, Inc.