You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by Ben Watson <be...@gmail.com> on 2015/09/14 15:29:25 UTC

Output Sequence Files into ORC

Hi all,

I'm trying to write a simple converter in Crunch to turn Sequence files
into ORC files. The only examples that I can find for dealing with ORC
files are the tutorial at
http://hortonworks.com/blog/using-orcfile-cascading-apache-crunch/ and then
the discussion at https://issues.apache.org/jira/browse/CRUNCH-450. The
tutorial seems to only show how to output data that's already in ORC
format, which isn't much use for me here.

It would be nice to be able to output ORC files like you can with Java
MapReduce -
http://hadoopathome.logdown.com/posts/277986-using-multipleoutputs-with-orc-in-mapreduce
- specifying a Struct, parsing each record into some type of object, and
letting the output do the rest. I've tried to replicate this in Crunch by
writing a MapFn that basically turns each record into an OrcWritable, but
it doesn't work, and even if it did I suspect it wouldn't be very efficient.

Is this something that's already possible that I'm missing?

Thanks,

Ben

Re: Output Sequence Files into ORC

Posted by Ben Watson <be...@gmail.com>.
Hi Micah,

Thanks for your help, it's good to see some more examples of ORC in Crunch.
The single ORC record created manually in the test setup is what I needed
to see.

Thanks,

Ben

On Mon, Sep 14, 2015 at 9:50 PM, Micah Whitacre <mk...@gmail.com>
wrote:

> Ben,
>
> You might look at the OrcSourceTarget integration tests[1].  I'm not an
> expert at OrcFiles but looks like it has a few examples for reading/writing
> data.
>
> [1] -
> https://github.com/apache/crunch/blob/master/crunch-hive/src/it/java/org/apache/crunch/io/orc/OrcFileSourceTargetIT.java#L64
>
> On Mon, Sep 14, 2015 at 8:29 AM, Ben Watson <be...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I'm trying to write a simple converter in Crunch to turn Sequence files
>> into ORC files. The only examples that I can find for dealing with ORC
>> files are the tutorial at
>> http://hortonworks.com/blog/using-orcfile-cascading-apache-crunch/ and
>> then the discussion at https://issues.apache.org/jira/browse/CRUNCH-450.
>> The tutorial seems to only show how to output data that's already in ORC
>> format, which isn't much use for me here.
>>
>> It would be nice to be able to output ORC files like you can with Java
>> MapReduce -
>> http://hadoopathome.logdown.com/posts/277986-using-multipleoutputs-with-orc-in-mapreduce
>> - specifying a Struct, parsing each record into some type of object, and
>> letting the output do the rest. I've tried to replicate this in Crunch by
>> writing a MapFn that basically turns each record into an OrcWritable, but
>> it doesn't work, and even if it did I suspect it wouldn't be very efficient.
>>
>> Is this something that's already possible that I'm missing?
>>
>> Thanks,
>>
>> Ben
>>
>
>

Re: Output Sequence Files into ORC

Posted by Micah Whitacre <mk...@gmail.com>.
Ben,

You might look at the OrcSourceTarget integration tests[1].  I'm not an
expert at OrcFiles but looks like it has a few examples for reading/writing
data.

[1] -
https://github.com/apache/crunch/blob/master/crunch-hive/src/it/java/org/apache/crunch/io/orc/OrcFileSourceTargetIT.java#L64

On Mon, Sep 14, 2015 at 8:29 AM, Ben Watson <be...@gmail.com> wrote:

> Hi all,
>
> I'm trying to write a simple converter in Crunch to turn Sequence files
> into ORC files. The only examples that I can find for dealing with ORC
> files are the tutorial at
> http://hortonworks.com/blog/using-orcfile-cascading-apache-crunch/ and
> then the discussion at https://issues.apache.org/jira/browse/CRUNCH-450.
> The tutorial seems to only show how to output data that's already in ORC
> format, which isn't much use for me here.
>
> It would be nice to be able to output ORC files like you can with Java
> MapReduce -
> http://hadoopathome.logdown.com/posts/277986-using-multipleoutputs-with-orc-in-mapreduce
> - specifying a Struct, parsing each record into some type of object, and
> letting the output do the rest. I've tried to replicate this in Crunch by
> writing a MapFn that basically turns each record into an OrcWritable, but
> it doesn't work, and even if it did I suspect it wouldn't be very efficient.
>
> Is this something that's already possible that I'm missing?
>
> Thanks,
>
> Ben
>