You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@orc.apache.org by "Piyush Mukati (Data Platform)" <pi...@flipkart.com> on 2017/02/20 07:14:07 UTC

Converting json record to ORC.

Hi,
we have a use case where our MR job have to read from old json (data where
each line is a json with fixed schema) and ORC files. The output of the job
will be in ORC file.

I tried some approaches.

1)  Hcatalog but it was not having support for reading from multiple tables
as of now. Json data don't have hive tables too.

 2) With the help of hive ORC lib and serde.
But unable to pass orc Struct through shuffle phase. As they don't
implement writable.(I am creating ORCStruct in mapper)

3) Currently I am checking org.apache.orc.mapreduce apis. everything is
good here. I have to convert exiting json record to Orcstruct.
This looks a common use-case. Writing a converter myself look like
reinventing.

Hoping if anyone in community aware of any utils which can help me in
converting json to ORCStruct. Any other suggestion is well come.

Thanks

Re: Converting json record to ORC.

Posted by Owen O'Malley <om...@apache.org>.
The advantage of using the VectorizedRowBatch is that it is faster than
using the OrcStruct, etc. classes. If you look at the ORC mapreduce
implementation, it layers the OrcStruct on top of the VectorizedRowBatch.

I've updated the pull request with a couple of changes:
* I've made JsonReader implement org.apache.orc.RecordReader instead of
BatchReader.
* I've made the OrcMapredRecordReader and OrcMapreduceRecordReader take a
RecordReader.

So it is easy to compose the two together to get a JsonReader that gives
you OrcStructs.

How does that look?

.. Owen



On Tue, Feb 21, 2017 at 2:35 AM, Piyush Mukati (Data Platform) <
piyush.mukati@flipkart.com> wrote:

> Thanks Owen,
>
> how if we use OrcStruct/OrcList etc classes instead
> of org.apache.hadoop.hive.ql.exec.vector* in the JsonReader.
>
> we can eventually have it as a lib util which can be used to convert
> to/from ORC data type Classes to other known format classes.
> As org.apache.orc.mapred.OrcInputFormat/OrcOutputFormat also works with OrcStruct/OrcList.
> The util will be helpful in writing custom MR jobs.
>
>  let me know your thoughts on above. I can help in contributing in the
> same direction.
>
>
> On Mon, Feb 20, 2017 at 11:20 PM, Owen O'Malley <om...@apache.org>
> wrote:
>
>> A few of us have written hacky ones, but we should have an official one
>> that is more robust. Mine was in this pull request https://github.com/apa
>> che/orc/pull/43/commits/48a9f3443062bfaee4b684e49b137106bbfe
>> 9947#diff-efa8880e64e22de68f1e34c2f1d5b538 where I was converting the
>> github archives data to ORC for benchmarking.
>>
>> I've created a jira https://issues.apache.org/jira/browse/ORC-150 for
>> adding one.
>>
>> .. Owen
>>
>>
>> On Sun, Feb 19, 2017 at 11:14 PM, Piyush Mukati (Data Platform) <
>> piyush.mukati@flipkart.com> wrote:
>>
>>> Hi,
>>> we have a use case where our MR job have to read from old json (data
>>> where each line is a json with fixed schema) and ORC files. The output of
>>> the job will be in ORC file.
>>>
>>> I tried some approaches.
>>>
>>> 1)  Hcatalog but it was not having support for reading from multiple
>>> tables as of now. Json data don't have hive tables too.
>>>
>>>  2) With the help of hive ORC lib and serde.
>>> But unable to pass orc Struct through shuffle phase. As they don't
>>> implement writable.(I am creating ORCStruct in mapper)
>>>
>>> 3) Currently I am checking org.apache.orc.mapreduce apis. everything is
>>> good here. I have to convert exiting json record to Orcstruct.
>>> This looks a common use-case. Writing a converter myself look like
>>> reinventing.
>>>
>>> Hoping if anyone in community aware of any utils which can help me in
>>> converting json to ORCStruct. Any other suggestion is well come.
>>>
>>> Thanks
>>>
>>>
>>
>>
>

Re: Converting json record to ORC.

Posted by "Piyush Mukati (Data Platform)" <pi...@flipkart.com>.
Thanks Owen,

how if we use OrcStruct/OrcList etc classes instead
of org.apache.hadoop.hive.ql.exec.vector* in the JsonReader.

we can eventually have it as a lib util which can be used to convert
to/from ORC data type Classes to other known format classes.
As org.apache.orc.mapred.OrcInputFormat/OrcOutputFormat also works
with OrcStruct/OrcList.
The util will be helpful in writing custom MR jobs.

 let me know your thoughts on above. I can help in contributing in the same
direction.


On Mon, Feb 20, 2017 at 11:20 PM, Owen O'Malley <om...@apache.org> wrote:

> A few of us have written hacky ones, but we should have an official one
> that is more robust. Mine was in this pull request https://github.com/
> apache/orc/pull/43/commits/48a9f3443062bfaee4b684e49b137106bbfe9947#diff-
> efa8880e64e22de68f1e34c2f1d5b538 where I was converting the github
> archives data to ORC for benchmarking.
>
> I've created a jira https://issues.apache.org/jira/browse/ORC-150 for
> adding one.
>
> .. Owen
>
>
> On Sun, Feb 19, 2017 at 11:14 PM, Piyush Mukati (Data Platform) <
> piyush.mukati@flipkart.com> wrote:
>
>> Hi,
>> we have a use case where our MR job have to read from old json (data
>> where each line is a json with fixed schema) and ORC files. The output of
>> the job will be in ORC file.
>>
>> I tried some approaches.
>>
>> 1)  Hcatalog but it was not having support for reading from multiple
>> tables as of now. Json data don't have hive tables too.
>>
>>  2) With the help of hive ORC lib and serde.
>> But unable to pass orc Struct through shuffle phase. As they don't
>> implement writable.(I am creating ORCStruct in mapper)
>>
>> 3) Currently I am checking org.apache.orc.mapreduce apis. everything is
>> good here. I have to convert exiting json record to Orcstruct.
>> This looks a common use-case. Writing a converter myself look like
>> reinventing.
>>
>> Hoping if anyone in community aware of any utils which can help me in
>> converting json to ORCStruct. Any other suggestion is well come.
>>
>> Thanks
>>
>>
>
>

Re: Converting json record to ORC.

Posted by Owen O'Malley <om...@apache.org>.
A few of us have written hacky ones, but we should have an official one
that is more robust. Mine was in this pull request
https://github.com/apache/orc/pull/43/commits/48a9f3443062bfaee4b684e49b137106bbfe9947#diff-efa8880e64e22de68f1e34c2f1d5b538
where I was converting the github archives data to ORC for benchmarking.

I've created a jira https://issues.apache.org/jira/browse/ORC-150 for
adding one.

.. Owen


On Sun, Feb 19, 2017 at 11:14 PM, Piyush Mukati (Data Platform) <
piyush.mukati@flipkart.com> wrote:

> Hi,
> we have a use case where our MR job have to read from old json (data where
> each line is a json with fixed schema) and ORC files. The output of the job
> will be in ORC file.
>
> I tried some approaches.
>
> 1)  Hcatalog but it was not having support for reading from multiple
> tables as of now. Json data don't have hive tables too.
>
>  2) With the help of hive ORC lib and serde.
> But unable to pass orc Struct through shuffle phase. As they don't
> implement writable.(I am creating ORCStruct in mapper)
>
> 3) Currently I am checking org.apache.orc.mapreduce apis. everything is
> good here. I have to convert exiting json record to Orcstruct.
> This looks a common use-case. Writing a converter myself look like
> reinventing.
>
> Hoping if anyone in community aware of any utils which can help me in
> converting json to ORCStruct. Any other suggestion is well come.
>
> Thanks
>
>