You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Public Network Services <pu...@gmail.com> on 2013/01/17 23:11:19 UTC

Generic Avro Classification and Deserialization

Folks,

I am involved in a project to extract data from a large number of files (to
be provided at some point), in numerous formats, among which is some Avro
files (both binary and JSON-encoded), and thus I am looking for the best
way to tackle this.

One of the things we would (ideally) like to do is auto-classify the data
generically, i.e. read a few lines or bytes off a file and be able to tell
what kind of format it is.

This is fairly easy to do with, say, (non-Avro) JSON files, but I am not
sure how this would be done for Avro.

For one thing, there is the necessity of a Schema, about which the
documentation says that

   - "Avro data is always serialized with its schema. Files that store Avro
   data should always also include the schema for that data in the same file."

However, the Java code examples posted on the project website imply that
the Schema is supplied as a separate file and I am not sure whether this is
only the case with RPC.

Are there any code examples for detecting the encoding format (binary/json)
of the data file, assessing whether there is a schema embedded in it and
extracting that schema?

Thanks!

Re: Generic Avro Classification and Deserialization

Posted by Public Network Services <pu...@gmail.com>.
You mean "Avro binary files", yes?

What about Avro JSON files? Would there be a trick to assess whether such a
file is Avro and not generic JSON?


On Fri, Jan 18, 2013 at 10:49 AM, Miki Tebeka <mi...@gmail.com> wrote:

> Avro files have a "magic" prefix of "Obj\0x1", this might help.
> The schema is always embedded in the avro file in the "meta" field.
>
>
> On Thu, Jan 17, 2013 at 2:11 PM, Public Network Services <
> publicnetworkservices@gmail.com> wrote:
>
>> Folks,
>>
>> I am involved in a project to extract data from a large number of files
>> (to be provided at some point), in numerous formats, among which is some
>> Avro files (both binary and JSON-encoded), and thus I am looking for the
>> best way to tackle this.
>>
>> One of the things we would (ideally) like to do is auto-classify the data
>> generically, i.e. read a few lines or bytes off a file and be able to tell
>> what kind of format it is.
>>
>> This is fairly easy to do with, say, (non-Avro) JSON files, but I am not
>> sure how this would be done for Avro.
>>
>> For one thing, there is the necessity of a Schema, about which the
>> documentation says that
>>
>>    - "Avro data is always serialized with its schema. Files that store
>>    Avro data should always also include the schema for that data in the same
>>    file."
>>
>> However, the Java code examples posted on the project website imply that
>> the Schema is supplied as a separate file and I am not sure whether this is
>> only the case with RPC.
>>
>> Are there any code examples for detecting the encoding format
>> (binary/json) of the data file, assessing whether there is a schema
>> embedded in it and extracting that schema?
>>
>> Thanks!
>>
>
>

Re: Generic Avro Classification and Deserialization

Posted by Miki Tebeka <mi...@gmail.com>.
Avro files have a "magic" prefix of "Obj\0x1", this might help.
The schema is always embedded in the avro file in the "meta" field.


On Thu, Jan 17, 2013 at 2:11 PM, Public Network Services <
publicnetworkservices@gmail.com> wrote:

> Folks,
>
> I am involved in a project to extract data from a large number of files
> (to be provided at some point), in numerous formats, among which is some
> Avro files (both binary and JSON-encoded), and thus I am looking for the
> best way to tackle this.
>
> One of the things we would (ideally) like to do is auto-classify the data
> generically, i.e. read a few lines or bytes off a file and be able to tell
> what kind of format it is.
>
> This is fairly easy to do with, say, (non-Avro) JSON files, but I am not
> sure how this would be done for Avro.
>
> For one thing, there is the necessity of a Schema, about which the
> documentation says that
>
>    - "Avro data is always serialized with its schema. Files that store
>    Avro data should always also include the schema for that data in the same
>    file."
>
> However, the Java code examples posted on the project website imply that
> the Schema is supplied as a separate file and I am not sure whether this is
> only the case with RPC.
>
> Are there any code examples for detecting the encoding format
> (binary/json) of the data file, assessing whether there is a schema
> embedded in it and extracting that schema?
>
> Thanks!
>

Re: Generic Avro Classification and Deserialization

Posted by Miki Tebeka <mi...@gmail.com>.
On Fri, Jan 18, 2013 at 4:46 PM, Public Network Services <
publicnetworkservices@gmail.com> wrote:

> I am trying to find sample Avro files


There are some in the Avro source tree test directory.

Re: Generic Avro Classification and Deserialization

Posted by Public Network Services <pu...@gmail.com>.
Thanks for the help!

I am trying to find sample Avro files and it turns out to be surprisingly
difficult (at least via the Google searches I tried).

Would you know of any such files (preferably large-ish) in the open source?


On Fri, Jan 18, 2013 at 6:53 AM, Terry Healy <th...@bnl.gov> wrote:

> Check out avro-tools. With this you can dump the schema for a file,
> extract the metadata, or export it in several formats:
>
> ----------------
> Available tools:
>       compile  Generates Java code for the given schema.
>    fragtojson  Renders a binary-encoded Avro datum as JSON.
>      fromjson  Reads JSON records and writes an Avro data file.
>      fromtext  Imports a text file into an avro data file.
>       getmeta  Prints out the metadata of an Avro data file.
>     getschema  Prints out schema of an Avro data file.
>           idl  Generates a JSON schema from an Avro IDL file
>        induce  Induce schema/protocol from Java class/interface via
> reflection.
>    jsontofrag  Renders a JSON-encoded Avro datum as binary.
>       recodec  Alters the codec of a data file.
>    rpcreceive  Opens an RPC Server and listens for one message.
>       rpcsend  Sends a single RPC message.
>        tether  Run a tethered mapreduce job.
>        tojson  Dumps an Avro data file as JSON, one record per line.
>        totext  Converts an Avro data file to a text file.
>   trevni_meta  Dumps a Trevni file's metadata as JSON.
> trevni_random  Create a Trevni file filled with random instances of a
> schema.
> trevni_tojson  Dumps a Trevni file as JSON.
>
> -Terry
>
> On 01/17/2013 05:11 PM, Public Network Services wrote:
> > Folks,
> >
> > I am involved in a project to extract data from a large number of files
> > (to be provided at some point), in numerous formats, among which is some
> > Avro files (both binary and JSON-encoded), and thus I am looking for the
> > best way to tackle this.
> >
> > One of the things we would (ideally) like to do is auto-classify the
> > data generically, i.e. read a few lines or bytes off a file and be able
> > to tell what kind of format it is.
> >
> > This is fairly easy to do with, say, (non-Avro) JSON files, but I am not
> > sure how this would be done for Avro.
> >
> > For one thing, there is the necessity of a Schema, about which the
> > documentation says that
> >
> >   * "Avro data is always serialized with its schema. Files that store
> >     Avro data should always also include the schema for that data in the
> >     same file."
> >
> > However, the Java code examples posted on the project website imply that
> > the Schema is supplied as a separate file and I am not sure whether this
> > is only the case with RPC.
> >
> > Are there any code examples for detecting the encoding format
> > (binary/json) of the data file, assessing whether there is a schema
> > embedded in it and extracting that schema?
> >
> > Thanks!
>

Re: Generic Avro Classification and Deserialization

Posted by Terry Healy <th...@bnl.gov>.
Check out avro-tools. With this you can dump the schema for a file,
extract the metadata, or export it in several formats:

----------------
Available tools:
      compile  Generates Java code for the given schema.
   fragtojson  Renders a binary-encoded Avro datum as JSON.
     fromjson  Reads JSON records and writes an Avro data file.
     fromtext  Imports a text file into an avro data file.
      getmeta  Prints out the metadata of an Avro data file.
    getschema  Prints out schema of an Avro data file.
          idl  Generates a JSON schema from an Avro IDL file
       induce  Induce schema/protocol from Java class/interface via
reflection.
   jsontofrag  Renders a JSON-encoded Avro datum as binary.
      recodec  Alters the codec of a data file.
   rpcreceive  Opens an RPC Server and listens for one message.
      rpcsend  Sends a single RPC message.
       tether  Run a tethered mapreduce job.
       tojson  Dumps an Avro data file as JSON, one record per line.
       totext  Converts an Avro data file to a text file.
  trevni_meta  Dumps a Trevni file's metadata as JSON.
trevni_random  Create a Trevni file filled with random instances of a
schema.
trevni_tojson  Dumps a Trevni file as JSON.

-Terry

On 01/17/2013 05:11 PM, Public Network Services wrote:
> Folks,
> 
> I am involved in a project to extract data from a large number of files
> (to be provided at some point), in numerous formats, among which is some
> Avro files (both binary and JSON-encoded), and thus I am looking for the
> best way to tackle this.
> 
> One of the things we would (ideally) like to do is auto-classify the
> data generically, i.e. read a few lines or bytes off a file and be able
> to tell what kind of format it is.
> 
> This is fairly easy to do with, say, (non-Avro) JSON files, but I am not
> sure how this would be done for Avro.
> 
> For one thing, there is the necessity of a Schema, about which the
> documentation says that
> 
>   * "Avro data is always serialized with its schema. Files that store
>     Avro data should always also include the schema for that data in the
>     same file."
> 
> However, the Java code examples posted on the project website imply that
> the Schema is supplied as a separate file and I am not sure whether this
> is only the case with RPC.
> 
> Are there any code examples for detecting the encoding format
> (binary/json) of the data file, assessing whether there is a schema
> embedded in it and extracting that schema?
> 
> Thanks!