You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@crunch.apache.org by Jonathan Natkins <na...@cloudera.com> on 2012/12/13 19:35:43 UTC

Writing Avro data to files

Out of curiosity, is there a way to write output from a Crunch pipeline
into an Avro-format file? It seems that if you do the
collection.write(To.avroFile(path)), you end up just writing JSON. It can
certainly be read into an Avro object, but it seems like it would be more
efficient to write binary data to the file, so no parsing has to happen.

Have I missed an API, or is this a missing feature?

Thanks,
Natty

Re: Writing Avro data to files

Posted by Josh Wills <jw...@cloudera.com>.

I'm optimistic it will. Re: the MemPipeline, I think it almost always
defaults to writing text output, on the assumption that when you're using
the MemPipeline you're debugging stuff, and text is the fastest way to do
that. But that may not be the right thing to do in all cases-- if anyone
has any feedback on this, we'd be grateful for it.


On Thu, Dec 13, 2012 at 11:15 AM, Jonathan Natkins <na...@cloudera.com>wrote:

> Gotcha. Alright, I'll try a true MR pipeline, and see if that improves the
> situtation. Thanks!
>
>
> On Thu, Dec 13, 2012 at 11:12 AM, Josh Wills <jw...@cloudera.com> wrote:
>
>> Ah-- that is interesting, and almost certainly the reason why we're
>> writing JSON instead of binary Avro.
>>
>>
>> On Thu, Dec 13, 2012 at 11:08 AM, Jonathan Natkins <na...@cloudera.com>wrote:
>>
>>> It's 2.0.0 and 1.7.0. I've actually only been running MemPipelines thus
>>> far, to make sure that I've built the job correctly, so it's possible that
>>> that's the issue.
>>>
>>>
>>> On Thu, Dec 13, 2012 at 10:56 AM, Josh Wills <jw...@cloudera.com>wrote:
>>>
>>>> That surprises me-- Crunch has its own AvroOutputFormat in order to use
>>>> the mapreduce.* APIs, but they delegate much of the work to things like
>>>> DatumWriters/encoders/etc. from Avro's core libraries.
>>>>
>>>> Could I get some detail on hadoop/avro version? Is it just 1.0.x and
>>>> Avro 1.7.0?
>>>>
>>>> J
>>>>
>>>>
>>>> On Thu, Dec 13, 2012 at 10:35 AM, Jonathan Natkins <na...@cloudera.com>wrote:
>>>>
>>>>> Out of curiosity, is there a way to write output from a Crunch
>>>>> pipeline into an Avro-format file? It seems that if you do the
>>>>> collection.write(To.avroFile(path)), you end up just writing JSON. It can
>>>>> certainly be read into an Avro object, but it seems like it would be more
>>>>> efficient to write binary data to the file, so no parsing has to happen.
>>>>>
>>>>> Have I missed an API, or is this a missing feature?
>>>>>
>>>>> Thanks,
>>>>> Natty
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Director of Data Science
>>>> Cloudera <http://www.cloudera.com>
>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>
>>>>
>>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Writing Avro data to files

Posted by Jonathan Natkins <na...@cloudera.com>.

Gotcha. Alright, I'll try a true MR pipeline, and see if that improves the
situtation. Thanks!


On Thu, Dec 13, 2012 at 11:12 AM, Josh Wills <jw...@cloudera.com> wrote:

> Ah-- that is interesting, and almost certainly the reason why we're
> writing JSON instead of binary Avro.
>
>
> On Thu, Dec 13, 2012 at 11:08 AM, Jonathan Natkins <na...@cloudera.com>wrote:
>
>> It's 2.0.0 and 1.7.0. I've actually only been running MemPipelines thus
>> far, to make sure that I've built the job correctly, so it's possible that
>> that's the issue.
>>
>>
>> On Thu, Dec 13, 2012 at 10:56 AM, Josh Wills <jw...@cloudera.com> wrote:
>>
>>> That surprises me-- Crunch has its own AvroOutputFormat in order to use
>>> the mapreduce.* APIs, but they delegate much of the work to things like
>>> DatumWriters/encoders/etc. from Avro's core libraries.
>>>
>>> Could I get some detail on hadoop/avro version? Is it just 1.0.x and
>>> Avro 1.7.0?
>>>
>>> J
>>>
>>>
>>> On Thu, Dec 13, 2012 at 10:35 AM, Jonathan Natkins <na...@cloudera.com>wrote:
>>>
>>>> Out of curiosity, is there a way to write output from a Crunch pipeline
>>>> into an Avro-format file? It seems that if you do the
>>>> collection.write(To.avroFile(path)), you end up just writing JSON. It can
>>>> certainly be read into an Avro object, but it seems like it would be more
>>>> efficient to write binary data to the file, so no parsing has to happen.
>>>>
>>>> Have I missed an API, or is this a missing feature?
>>>>
>>>> Thanks,
>>>> Natty
>>>>
>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>>
>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>
>

Re: Writing Avro data to files

Posted by Josh Wills <jw...@cloudera.com>.

Ah-- that is interesting, and almost certainly the reason why we're writing
JSON instead of binary Avro.


On Thu, Dec 13, 2012 at 11:08 AM, Jonathan Natkins <na...@cloudera.com>wrote:

> It's 2.0.0 and 1.7.0. I've actually only been running MemPipelines thus
> far, to make sure that I've built the job correctly, so it's possible that
> that's the issue.
>
>
> On Thu, Dec 13, 2012 at 10:56 AM, Josh Wills <jw...@cloudera.com> wrote:
>
>> That surprises me-- Crunch has its own AvroOutputFormat in order to use
>> the mapreduce.* APIs, but they delegate much of the work to things like
>> DatumWriters/encoders/etc. from Avro's core libraries.
>>
>> Could I get some detail on hadoop/avro version? Is it just 1.0.x and Avro
>> 1.7.0?
>>
>> J
>>
>>
>> On Thu, Dec 13, 2012 at 10:35 AM, Jonathan Natkins <na...@cloudera.com>wrote:
>>
>>> Out of curiosity, is there a way to write output from a Crunch pipeline
>>> into an Avro-format file? It seems that if you do the
>>> collection.write(To.avroFile(path)), you end up just writing JSON. It can
>>> certainly be read into an Avro object, but it seems like it would be more
>>> efficient to write binary data to the file, so no parsing has to happen.
>>>
>>> Have I missed an API, or is this a missing feature?
>>>
>>> Thanks,
>>> Natty
>>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Writing Avro data to files

Posted by Jonathan Natkins <na...@cloudera.com>.

It's 2.0.0 and 1.7.0. I've actually only been running MemPipelines thus
far, to make sure that I've built the job correctly, so it's possible that
that's the issue.


On Thu, Dec 13, 2012 at 10:56 AM, Josh Wills <jw...@cloudera.com> wrote:

> That surprises me-- Crunch has its own AvroOutputFormat in order to use
> the mapreduce.* APIs, but they delegate much of the work to things like
> DatumWriters/encoders/etc. from Avro's core libraries.
>
> Could I get some detail on hadoop/avro version? Is it just 1.0.x and Avro
> 1.7.0?
>
> J
>
>
> On Thu, Dec 13, 2012 at 10:35 AM, Jonathan Natkins <na...@cloudera.com>wrote:
>
>> Out of curiosity, is there a way to write output from a Crunch pipeline
>> into an Avro-format file? It seems that if you do the
>> collection.write(To.avroFile(path)), you end up just writing JSON. It can
>> certainly be read into an Avro object, but it seems like it would be more
>> efficient to write binary data to the file, so no parsing has to happen.
>>
>> Have I missed an API, or is this a missing feature?
>>
>> Thanks,
>> Natty
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>
>

Re: Writing Avro data to files

Posted by Josh Wills <jw...@cloudera.com>.

That surprises me-- Crunch has its own AvroOutputFormat in order to use the
mapreduce.* APIs, but they delegate much of the work to things like
DatumWriters/encoders/etc. from Avro's core libraries.

Could I get some detail on hadoop/avro version? Is it just 1.0.x and Avro
1.7.0?

J

On Thu, Dec 13, 2012 at 10:35 AM, Jonathan Natkins <na...@cloudera.com>wrote:

> Out of curiosity, is there a way to write output from a Crunch pipeline
> into an Avro-format file? It seems that if you do the
> collection.write(To.avroFile(path)), you end up just writing JSON. It can
> certainly be read into an Avro object, but it seems like it would be more
> efficient to write binary data to the file, so no parsing has to happen.
>
> Have I missed an API, or is this a missing feature?
>
> Thanks,
> Natty
>

-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>