You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by Magnus Runesson <ma...@linuxalert.org> on 2014/01/27 18:12:56 UTC

Reading Avro to GenericRecord

Can I in (s)crunch read an Avro-file to GenericRecord without provide 
the schema? I want crunch to get the schema from the avro-file it reads. 
How do I do it?

/Magnus

Re: Reading Avro to GenericRecord

Posted by Josh Wills <jw...@cloudera.com>.
Committed this as CRUNCH-334. Thanks Magnus!


On Tue, Jan 28, 2014 at 1:07 AM, Magnus Runesson <ma...@linuxalert.org>wrote:

>  Thanks! Looks like it works for me.
>
> Here is a patch to expose it to scrunch:
>
> diff --git
> a/crunch-scrunch/src/main/scala/org/apache/crunch/scrunch/IO.scala
> b/crunch-scrunch/src/main/scala/org/apache/crunch/scrunch/IO.scala
> index 89b331b..b77b042 100644
> --- a/crunch-scrunch/src/main/scala/org/apache/crunch/scrunch/IO.scala
> +++ b/crunch-scrunch/src/main/scala/org/apache/crunch/scrunch/IO.scala
> @@ -19,11 +19,14 @@ package org.apache.crunch.scrunch
>
>  import org.apache.crunch.io.{From => from, To => to, At => at}
>  import org.apache.crunch.types.avro.AvroType
> -import org.apache.hadoop.fs.Path;
> +import org.apache.hadoop.fs.Path
> +import org.apache.hadoop.conf.Configuration
> +;
>
>  trait From {
>    def avroFile[T](path: String, atype: AvroType[T]) = from.avroFile(path,
> atype)
>    def avroFile[T](path: Path, atype: AvroType[T]) = from.avroFile(path,
> atype)
> +  def avroFile[T](path: Path, conf: Configuration) = from.avroFile(path,
> conf)
>    def textFile(path: String) = from.textFile(path)
>    def textFile(path: Path) = from.textFile(path)
>
>  }
>
>
> On 1/28/14 2:04 AM, Josh Wills wrote:
>
> Patch is here: https://issues.apache.org/jira/browse/CRUNCH-333
>
>
> On Mon, Jan 27, 2014 at 10:08 AM, Josh Wills <jo...@gmail.com> wrote:
>
>> Of course. I wrote up a little patch that adds a method to From.java to
>> open the Avro file and pull out the schema and return a Source of
>> GenericData.Record, but I had to roll to some meetings before I got a
>> chance to test it. I'll post something later this evening ET.
>>  On Jan 27, 2014 11:56 AM, "Magnus Runesson" <ma...@linuxalert.org>
>> wrote:
>>
>>>  Thanks for quick answer.
>>>
>>> It is totally OK and reasonable to take one file in a directory and
>>> assume all other has the same schema.
>>>
>>>
>>> On 2014-01-27 18:27, Josh Wills wrote:
>>>
>>> No, I haven't written a way to do that yet, and I feel bad about it-- a
>>> Clouderan asked me for just such a feature a couple of weeks ago and it
>>> slipped my mind. I don't think it's hard to do, just a little tedious and
>>> will require refreshing my memory of the Avro APIs. There's also the
>>> potential issue that multiple Avro files in the same input directory can
>>> have different schemas, so the one we would end up reading might be
>>> somewhat arbitrary (e.g., based on the timestamp of the files in the
>>> directory, or some such thing)-- is that ok?
>>>
>>>
>>> On Mon, Jan 27, 2014 at 9:12 AM, Magnus Runesson <ma...@linuxalert.org>wrote:
>>>
>>>> Can I in (s)crunch read an Avro-file to GenericRecord without provide
>>>> the schema? I want crunch to get the schema from the avro-file it reads.
>>>> How do I do it?
>>>>
>>>> /Magnus
>>>>
>>>
>>>
>>>
>
>
>  --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Reading Avro to GenericRecord

Posted by Magnus Runesson <ma...@linuxalert.org>.
Thanks! Looks like it works for me.

Here is a patch to expose it to scrunch:

diff --git 
a/crunch-scrunch/src/main/scala/org/apache/crunch/scrunch/IO.scala 
b/crunch-scrunch/src/main/scala/org/apache/crunch/scrunch/IO.scala
index 89b331b..b77b042 100644
--- a/crunch-scrunch/src/main/scala/org/apache/crunch/scrunch/IO.scala
+++ b/crunch-scrunch/src/main/scala/org/apache/crunch/scrunch/IO.scala
@@ -19,11 +19,14 @@ package org.apache.crunch.scrunch

  import org.apache.crunch.io.{From => from, To => to, At => at}
  import org.apache.crunch.types.avro.AvroType
-import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.conf.Configuration
+;

  trait From {
    def avroFile[T](path: String, atype: AvroType[T]) = 
from.avroFile(path, atype)
    def avroFile[T](path: Path, atype: AvroType[T]) = 
from.avroFile(path, atype)
+  def avroFile[T](path: Path, conf: Configuration) = 
from.avroFile(path, conf)
    def textFile(path: String) = from.textFile(path)
    def textFile(path: Path) = from.textFile(path)
  }


On 1/28/14 2:04 AM, Josh Wills wrote:
> Patch is here: https://issues.apache.org/jira/browse/CRUNCH-333
>
>
> On Mon, Jan 27, 2014 at 10:08 AM, Josh Wills <josh.wills@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Of course. I wrote up a little patch that adds a method to
>     From.java to open the Avro file and pull out the schema and return
>     a Source of GenericData.Record, but I had to roll to some meetings
>     before I got a chance to test it. I'll post something later this
>     evening ET.
>
>     On Jan 27, 2014 11:56 AM, "Magnus Runesson" <magru@linuxalert.org
>     <ma...@linuxalert.org>> wrote:
>
>         Thanks for quick answer.
>
>         It is totally OK and reasonable to take one file in a
>         directory and assume all other has the same schema.
>
>
>         On 2014-01-27 18:27, Josh Wills wrote:
>>         No, I haven't written a way to do that yet, and I feel bad
>>         about it-- a Clouderan asked me for just such a feature a
>>         couple of weeks ago and it slipped my mind. I don't think
>>         it's hard to do, just a little tedious and will require
>>         refreshing my memory of the Avro APIs. There's also the
>>         potential issue that multiple Avro files in the same input
>>         directory can have different schemas, so the one we would end
>>         up reading might be somewhat arbitrary (e.g., based on the
>>         timestamp of the files in the directory, or some such
>>         thing)-- is that ok?
>>
>>
>>         On Mon, Jan 27, 2014 at 9:12 AM, Magnus Runesson
>>         <magru@linuxalert.org <ma...@linuxalert.org>> wrote:
>>
>>             Can I in (s)crunch read an Avro-file to GenericRecord
>>             without provide the schema? I want crunch to get the
>>             schema from the avro-file it reads. How do I do it?
>>
>>             /Magnus
>>
>>
>
>
>
>
> -- 
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>


Re: Reading Avro to GenericRecord

Posted by Josh Wills <jw...@cloudera.com>.
Patch is here: https://issues.apache.org/jira/browse/CRUNCH-333


On Mon, Jan 27, 2014 at 10:08 AM, Josh Wills <jo...@gmail.com> wrote:

> Of course. I wrote up a little patch that adds a method to From.java to
> open the Avro file and pull out the schema and return a Source of
> GenericData.Record, but I had to roll to some meetings before I got a
> chance to test it. I'll post something later this evening ET.
>  On Jan 27, 2014 11:56 AM, "Magnus Runesson" <ma...@linuxalert.org> wrote:
>
>>  Thanks for quick answer.
>>
>> It is totally OK and reasonable to take one file in a directory and
>> assume all other has the same schema.
>>
>>
>> On 2014-01-27 18:27, Josh Wills wrote:
>>
>> No, I haven't written a way to do that yet, and I feel bad about it-- a
>> Clouderan asked me for just such a feature a couple of weeks ago and it
>> slipped my mind. I don't think it's hard to do, just a little tedious and
>> will require refreshing my memory of the Avro APIs. There's also the
>> potential issue that multiple Avro files in the same input directory can
>> have different schemas, so the one we would end up reading might be
>> somewhat arbitrary (e.g., based on the timestamp of the files in the
>> directory, or some such thing)-- is that ok?
>>
>>
>> On Mon, Jan 27, 2014 at 9:12 AM, Magnus Runesson <ma...@linuxalert.org>wrote:
>>
>>> Can I in (s)crunch read an Avro-file to GenericRecord without provide
>>> the schema? I want crunch to get the schema from the avro-file it reads.
>>> How do I do it?
>>>
>>> /Magnus
>>>
>>
>>
>>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Reading Avro to GenericRecord

Posted by Josh Wills <jo...@gmail.com>.
Of course. I wrote up a little patch that adds a method to From.java to
open the Avro file and pull out the schema and return a Source of
GenericData.Record, but I had to roll to some meetings before I got a
chance to test it. I'll post something later this evening ET.
On Jan 27, 2014 11:56 AM, "Magnus Runesson" <ma...@linuxalert.org> wrote:

>  Thanks for quick answer.
>
> It is totally OK and reasonable to take one file in a directory and assume
> all other has the same schema.
>
>
> On 2014-01-27 18:27, Josh Wills wrote:
>
> No, I haven't written a way to do that yet, and I feel bad about it-- a
> Clouderan asked me for just such a feature a couple of weeks ago and it
> slipped my mind. I don't think it's hard to do, just a little tedious and
> will require refreshing my memory of the Avro APIs. There's also the
> potential issue that multiple Avro files in the same input directory can
> have different schemas, so the one we would end up reading might be
> somewhat arbitrary (e.g., based on the timestamp of the files in the
> directory, or some such thing)-- is that ok?
>
>
> On Mon, Jan 27, 2014 at 9:12 AM, Magnus Runesson <ma...@linuxalert.org>wrote:
>
>> Can I in (s)crunch read an Avro-file to GenericRecord without provide the
>> schema? I want crunch to get the schema from the avro-file it reads. How do
>> I do it?
>>
>> /Magnus
>>
>
>
>

Re: Reading Avro to GenericRecord

Posted by Magnus Runesson <ma...@linuxalert.org>.
Thanks for quick answer.

It is totally OK and reasonable to take one file in a directory and 
assume all other has the same schema.


On 2014-01-27 18:27, Josh Wills wrote:
> No, I haven't written a way to do that yet, and I feel bad about it-- 
> a Clouderan asked me for just such a feature a couple of weeks ago and 
> it slipped my mind. I don't think it's hard to do, just a little 
> tedious and will require refreshing my memory of the Avro APIs. 
> There's also the potential issue that multiple Avro files in the same 
> input directory can have different schemas, so the one we would end up 
> reading might be somewhat arbitrary (e.g., based on the timestamp of 
> the files in the directory, or some such thing)-- is that ok?
>
>
> On Mon, Jan 27, 2014 at 9:12 AM, Magnus Runesson <magru@linuxalert.org 
> <ma...@linuxalert.org>> wrote:
>
>     Can I in (s)crunch read an Avro-file to GenericRecord without
>     provide the schema? I want crunch to get the schema from the
>     avro-file it reads. How do I do it?
>
>     /Magnus
>
>


Re: Reading Avro to GenericRecord

Posted by Josh Wills <jo...@gmail.com>.
No, I haven't written a way to do that yet, and I feel bad about it-- a
Clouderan asked me for just such a feature a couple of weeks ago and it
slipped my mind. I don't think it's hard to do, just a little tedious and
will require refreshing my memory of the Avro APIs. There's also the
potential issue that multiple Avro files in the same input directory can
have different schemas, so the one we would end up reading might be
somewhat arbitrary (e.g., based on the timestamp of the files in the
directory, or some such thing)-- is that ok?


On Mon, Jan 27, 2014 at 9:12 AM, Magnus Runesson <ma...@linuxalert.org>wrote:

> Can I in (s)crunch read an Avro-file to GenericRecord without provide the
> schema? I want crunch to get the schema from the avro-file it reads. How do
> I do it?
>
> /Magnus
>