You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Public Network Services <pu...@gmail.com> on 2013/02/05 20:53:14 UTC

Generic data extraction from an Avro file

Folks,

Assuming an application that only needs to quickly examine the contents of
a bunch of Avro data files (irrespective of binary or JSON encoding and
without any prior schema or object structure knowledge), an approach could
be to just extract the Avro records as text JSON records. To this effect, a
simple approach could be:

   1. Create a DataFileStream<GenericRecord>(FileInputStream,
   GenericDatumReader<GenericRecord>) from a FileInputStream to the file. (If
   the file is not an Avro data file, an IOException is caused.)
   2. Read GenericRecord records from the DataFileStream object, while its
   hasNext() method returns true.
   3. Convert each GenericRecord object read into a JSON string, via
   calling its toString() method.

For the test datasets in the Avro 1.7.3 distribution, this actually works
fine.

My question is, does anyone see any potential problems for (binary or JSON
encoded) Avro data files, given the above logic? For example, should the
GenericRecord.toString() method always produce a valid JSON string?

Thanks!

Re: Generic data extraction from an Avro file

Posted by Doug Cutting <cu...@apache.org>.
Yes, that should be possible,  A given JsonEncoder instance only works
for a given schema.  And every generic record conforms to a schema.

http://avro.apache.org/docs/current/api/java/org/apache/avro/io/EncoderFactory.html#jsonEncoder(org.apache.avro.Schema,
java.io.OutputStream)

Doug

On Tue, Feb 5, 2013 at 3:30 PM, Public Network Services
<pu...@gmail.com> wrote:
> Thanks for the clarification.
>
> Is there any way to use JsonEncoder in the scenario I mentioned, i.e. in
> totally schema-agnostic data extraction from either binary or JSON files?
>
>
> On Tue, Feb 5, 2013 at 2:58 PM, Doug Cutting <cu...@apache.org> wrote:
>>
>> Yes, GenericData.Record#toString() should generate valid Json.  It
>> does lose some information, e.g.:
>>  - record names; and
>>  - the distinction between strings & enum symbols, ints & longs,
>> floats & doubles, and maps & records.
>>
>> JsonEncoder loses less information.  It saves enough information to,
>> with the schema, always reconstitute an equivalent object.
>>
>> Doug
>>
>>
>> On Tue, Feb 5, 2013 at 11:53 AM, Public Network Services
>> <pu...@gmail.com> wrote:
>> > Folks,
>> >
>> > Assuming an application that only needs to quickly examine the contents
>> > of a
>> > bunch of Avro data files (irrespective of binary or JSON encoding and
>> > without any prior schema or object structure knowledge), an approach
>> > could
>> > be to just extract the Avro records as text JSON records. To this
>> > effect, a
>> > simple approach could be:
>> >
>> > Create a DataFileStream<GenericRecord>(FileInputStream,
>> > GenericDatumReader<GenericRecord>) from a FileInputStream to the file.
>> > (If
>> > the file is not an Avro data file, an IOException is caused.)
>> > Read GenericRecord records from the DataFileStream object, while its
>> > hasNext() method returns true.
>> > Convert each GenericRecord object read into a JSON string, via calling
>> > its
>> > toString() method.
>> >
>> > For the test datasets in the Avro 1.7.3 distribution, this actually
>> > works
>> > fine.
>> >
>> > My question is, does anyone see any potential problems for (binary or
>> > JSON
>> > encoded) Avro data files, given the above logic? For example, should the
>> > GenericRecord.toString() method always produce a valid JSON string?
>> >
>> > Thanks!
>> >
>
>

Re: Generic data extraction from an Avro file

Posted by Public Network Services <pu...@gmail.com>.
Thanks for the clarification.

Is there any way to use JsonEncoder in the scenario I mentioned, i.e. in
totally schema-agnostic data extraction from either binary or JSON files?


On Tue, Feb 5, 2013 at 2:58 PM, Doug Cutting <cu...@apache.org> wrote:

> Yes, GenericData.Record#toString() should generate valid Json.  It
> does lose some information, e.g.:
>  - record names; and
>  - the distinction between strings & enum symbols, ints & longs,
> floats & doubles, and maps & records.
>
> JsonEncoder loses less information.  It saves enough information to,
> with the schema, always reconstitute an equivalent object.
>
> Doug
>
>
> On Tue, Feb 5, 2013 at 11:53 AM, Public Network Services
> <pu...@gmail.com> wrote:
> > Folks,
> >
> > Assuming an application that only needs to quickly examine the contents
> of a
> > bunch of Avro data files (irrespective of binary or JSON encoding and
> > without any prior schema or object structure knowledge), an approach
> could
> > be to just extract the Avro records as text JSON records. To this
> effect, a
> > simple approach could be:
> >
> > Create a DataFileStream<GenericRecord>(FileInputStream,
> > GenericDatumReader<GenericRecord>) from a FileInputStream to the file.
> (If
> > the file is not an Avro data file, an IOException is caused.)
> > Read GenericRecord records from the DataFileStream object, while its
> > hasNext() method returns true.
> > Convert each GenericRecord object read into a JSON string, via calling
> its
> > toString() method.
> >
> > For the test datasets in the Avro 1.7.3 distribution, this actually works
> > fine.
> >
> > My question is, does anyone see any potential problems for (binary or
> JSON
> > encoded) Avro data files, given the above logic? For example, should the
> > GenericRecord.toString() method always produce a valid JSON string?
> >
> > Thanks!
> >
>

Re: Generic data extraction from an Avro file

Posted by Doug Cutting <cu...@apache.org>.
Yes, GenericData.Record#toString() should generate valid Json.  It
does lose some information, e.g.:
 - record names; and
 - the distinction between strings & enum symbols, ints & longs,
floats & doubles, and maps & records.

JsonEncoder loses less information.  It saves enough information to,
with the schema, always reconstitute an equivalent object.

Doug


On Tue, Feb 5, 2013 at 11:53 AM, Public Network Services
<pu...@gmail.com> wrote:
> Folks,
>
> Assuming an application that only needs to quickly examine the contents of a
> bunch of Avro data files (irrespective of binary or JSON encoding and
> without any prior schema or object structure knowledge), an approach could
> be to just extract the Avro records as text JSON records. To this effect, a
> simple approach could be:
>
> Create a DataFileStream<GenericRecord>(FileInputStream,
> GenericDatumReader<GenericRecord>) from a FileInputStream to the file. (If
> the file is not an Avro data file, an IOException is caused.)
> Read GenericRecord records from the DataFileStream object, while its
> hasNext() method returns true.
> Convert each GenericRecord object read into a JSON string, via calling its
> toString() method.
>
> For the test datasets in the Avro 1.7.3 distribution, this actually works
> fine.
>
> My question is, does anyone see any potential problems for (binary or JSON
> encoded) Avro data files, given the above logic? For example, should the
> GenericRecord.toString() method always produce a valid JSON string?
>
> Thanks!
>