You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Gaurav <ga...@gmail.com> on 2011/12/05 16:33:03 UTC

Decode without using DataFileReader

Hi,

I am trying to read byte stream of encoded data, which is coming from some
source but File. So I should not use DataFileReader. 

I wrote following code to do that, but here I have to specify schema on my
own, which ideally should come from data itself. Is there any other way to
get decode data with explicitly specifying schema and without using
DataFileReader?
----------------------
	private static void DecodeData(byte[] buf) throws IOException {
		// TODO Auto-generated method stub
		Schema schema = createSchema();
		GenericDatumReader<GenericData.Record> datum = new
GenericDatumReader<GenericData.Record>(schema);		
		
		ByteArrayInputStream in = new ByteArrayInputStream(buf);
		BinaryDecoder decoder = DECODER_FACTORY.binaryDecoder(in, null);
		
		GenericData.Record record = new GenericData.Record(datum.getSchema());
		datum.read(record, decoder);
		
		System.out.println(record.get("trade"));
	}
---------------------

Thanks,
Gaurav Nanda

--
View this message in context: http://apache-avro.679487.n3.nabble.com/Decode-without-using-DataFileReader-tp3561722p3561722.html
Sent from the Avro - Users mailing list archive at Nabble.com.

Re: Decode without using DataFileReader

Posted by Harsh J <ha...@cloudera.com>.
I do not understand what you're trying to achieve here.

Encoders work at the primitive level - they merely serialize a given data structure (records, unions, for example), and not look at the schema (Notice - you create a record with a schema, not an encoder with a schema). Decoders could do the same and read back primitives, but if they had a schema they'd read back properly packed data structures. Since encoders do not store schema, decoders need it externally.

DataFiles solve this for you by writing the schema itself into the file as a header. The reader loads this schema into the decoder when it attempts to read it back.

On 05-Dec-2011, at 11:43 PM, Gaurav wrote:

>>> it makes no sense for the encoder to store schema for every given record,
> into a stream. 
> 
> Agree. Its not even encode/decoders job to store schema.
> 
> While writing data, I noticed that we don't even need DataFileWriter, all it
> needs is GenericDatumWriter, Encoder and any kind of output stream (which
> can also be a file output stream).
> 
> Sample:
> ------------------------------------------------
> private static ByteArrayOutputStream EncodeData() throws IOException {
> 		// TODO Auto-generated method stub
> 		Schema schema = createMetaData();
> 		
> 		GenericDatumWriter<GenericData.Record> datum = new
> GenericDatumWriter<GenericData.Record>(schema);
> 		
> 		GenericData.Record inner_record = new
> GenericData.Record(schema.getField("trade").schema());
> 		inner_record.put("inner_abc", new Long(23490843));
> 		
> 		GenericData.Record record = new GenericData.Record(schema);
> 		record.put("abc", 1050324);		
> 		record.put("trade", inner_record);
> 		
> 		ByteArrayOutputStream out = new ByteArrayOutputStream();
> 		BinaryEncoder encoder = ENCODER_FACTORY.binaryEncoder(out, null);
> 		
> 		datum.write(record, encoder);
> 		
> 		encoder.flush();
> 		out.close();
> 
> 		return out;
> 	}
> ------------------------------------------------
> 
> Then why can't I just use back the same output stream to read back metadata
> and data. It should not be the responsibility of stream reader (which in
> this case is served by FileDataReader) to parse out schema.
> 
> Thanks,
> Gaurav Nanda
> 
> --
> View this message in context: http://apache-avro.679487.n3.nabble.com/Decode-without-using-DataFileReader-tp3561722p3562127.html
> Sent from the Avro - Users mailing list archive at Nabble.com.


Re: Decode without using DataFileReader

Posted by Gaurav <ga...@gmail.com>.
>> it makes no sense for the encoder to store schema for every given record,
into a stream. 

Agree. Its not even encode/decoders job to store schema.

While writing data, I noticed that we don't even need DataFileWriter, all it
needs is GenericDatumWriter, Encoder and any kind of output stream (which
can also be a file output stream).

Sample:
------------------------------------------------
private static ByteArrayOutputStream EncodeData() throws IOException {
		// TODO Auto-generated method stub
		Schema schema = createMetaData();
		
		GenericDatumWriter<GenericData.Record> datum = new
GenericDatumWriter<GenericData.Record>(schema);
		
		GenericData.Record inner_record = new
GenericData.Record(schema.getField("trade").schema());
		inner_record.put("inner_abc", new Long(23490843));
		
		GenericData.Record record = new GenericData.Record(schema);
		record.put("abc", 1050324);		
		record.put("trade", inner_record);
		
		ByteArrayOutputStream out = new ByteArrayOutputStream();
		BinaryEncoder encoder = ENCODER_FACTORY.binaryEncoder(out, null);
		
		datum.write(record, encoder);
		
		encoder.flush();
		out.close();

		return out;
	}
------------------------------------------------

Then why can't I just use back the same output stream to read back metadata
and data. It should not be the responsibility of stream reader (which in
this case is served by FileDataReader) to parse out schema.

Thanks,
Gaurav Nanda

--
View this message in context: http://apache-avro.679487.n3.nabble.com/Decode-without-using-DataFileReader-tp3561722p3562127.html
Sent from the Avro - Users mailing list archive at Nabble.com.

Re: Decode without using DataFileReader

Posted by Harsh J <ha...@cloudera.com>.
The DataFile file-format stores the schema, as part of its header.
That's one of its advantages.

The encoder/decoder are lower levels, and do not do that. You need to
manage the schema yourself if you choose to use the encoder/decoder
instead of the datafile format (why?) - the source stream can't have
it if you do not store it - it makes no sense for the encoder to store
schema for every given record, into a stream.

On Mon, Dec 5, 2011 at 11:18 PM, Gaurav Nanda <ga...@gmail.com> wrote:
> I guess I did not put it right way.
>
> See this sample code:
> ---------------------------------------
> public static void testRead (File file) throws IOException {
>    GenericDatumReader<GenericData.Record> datum = new
> GenericDatumReader<GenericData.Record>();
>    DataFileReader<GenericData.Record> reader = new
> DataFileReader<GenericData.Record>(file, datum);
>
>    GenericData.Record record = new GenericData.Record(reader.getSchema());
>    while (reader.hasNext()) {
>      reader.next(record);
>      System.out.println("Name " + record.get("name") + " Age " +
> record.get("age"));
>    }
>
>    reader.close();
>  }
> -------------------------------
> This takes file as an input, which contains both schema and actual data.
> In my case, Instead of having a file, I have some other stream of
> schema & data which I am passing to DecodeData() function.
>
> So, the question now is, how do I extract schema from there?
>
> Thanks,
> Gaurav Nanda
>
> On Mon, Dec 5, 2011 at 9:40 PM, Matt Stevenson
> <ma...@gmail.com> wrote:
>> No, the schema needs to be present in some form to tell the reader how to
>> decode the data.
>> You can generate classes from the schema and pass in the class, but that is
>> just a different way of passing in the schema.
>>
>>
>> On Mon, Dec 5, 2011 at 9:33 AM, Gaurav <ga...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I am trying to read byte stream of encoded data, which is coming from some
>>> source but File. So I should not use DataFileReader.
>>>
>>> I wrote following code to do that, but here I have to specify schema on my
>>> own, which ideally should come from data itself. Is there any other way to
>>> get decode data with explicitly specifying schema and without using
>>> DataFileReader?
>>> ----------------------
>>>        private static void DecodeData(byte[] buf) throws IOException {
>>>                // TODO Auto-generated method stub
>>>                Schema schema = createSchema();
>>>                GenericDatumReader<GenericData.Record> datum = new
>>> GenericDatumReader<GenericData.Record>(schema);
>>>
>>>                ByteArrayInputStream in = new ByteArrayInputStream(buf);
>>>                BinaryDecoder decoder = DECODER_FACTORY.binaryDecoder(in,
>>> null);
>>>
>>>                GenericData.Record record = new
>>> GenericData.Record(datum.getSchema());
>>>                datum.read(record, decoder);
>>>
>>>                System.out.println(record.get("trade"));
>>>        }
>>> ---------------------
>>>
>>> Thanks,
>>> Gaurav Nanda
>>>
>>> --
>>> View this message in context:
>>> http://apache-avro.679487.n3.nabble.com/Decode-without-using-DataFileReader-tp3561722p3561722.html
>>> Sent from the Avro - Users mailing list archive at Nabble.com.
>>
>>
>>
>>
>> --
>> Matt Stevenson.



-- 
Harsh J

Re: Decode without using DataFileReader

Posted by Gaurav Nanda <ga...@gmail.com>.
I guess I did not put it right way.

See this sample code:
---------------------------------------
public static void testRead (File file) throws IOException {
    GenericDatumReader<GenericData.Record> datum = new
GenericDatumReader<GenericData.Record>();
    DataFileReader<GenericData.Record> reader = new
DataFileReader<GenericData.Record>(file, datum);

    GenericData.Record record = new GenericData.Record(reader.getSchema());
    while (reader.hasNext()) {
      reader.next(record);
      System.out.println("Name " + record.get("name") + " Age " +
record.get("age"));
    }

    reader.close();
  }
-------------------------------
This takes file as an input, which contains both schema and actual data.
In my case, Instead of having a file, I have some other stream of
schema & data which I am passing to DecodeData() function.

So, the question now is, how do I extract schema from there?

Thanks,
Gaurav Nanda

On Mon, Dec 5, 2011 at 9:40 PM, Matt Stevenson
<ma...@gmail.com> wrote:
> No, the schema needs to be present in some form to tell the reader how to
> decode the data.
> You can generate classes from the schema and pass in the class, but that is
> just a different way of passing in the schema.
>
>
> On Mon, Dec 5, 2011 at 9:33 AM, Gaurav <ga...@gmail.com> wrote:
>>
>> Hi,
>>
>> I am trying to read byte stream of encoded data, which is coming from some
>> source but File. So I should not use DataFileReader.
>>
>> I wrote following code to do that, but here I have to specify schema on my
>> own, which ideally should come from data itself. Is there any other way to
>> get decode data with explicitly specifying schema and without using
>> DataFileReader?
>> ----------------------
>>        private static void DecodeData(byte[] buf) throws IOException {
>>                // TODO Auto-generated method stub
>>                Schema schema = createSchema();
>>                GenericDatumReader<GenericData.Record> datum = new
>> GenericDatumReader<GenericData.Record>(schema);
>>
>>                ByteArrayInputStream in = new ByteArrayInputStream(buf);
>>                BinaryDecoder decoder = DECODER_FACTORY.binaryDecoder(in,
>> null);
>>
>>                GenericData.Record record = new
>> GenericData.Record(datum.getSchema());
>>                datum.read(record, decoder);
>>
>>                System.out.println(record.get("trade"));
>>        }
>> ---------------------
>>
>> Thanks,
>> Gaurav Nanda
>>
>> --
>> View this message in context:
>> http://apache-avro.679487.n3.nabble.com/Decode-without-using-DataFileReader-tp3561722p3561722.html
>> Sent from the Avro - Users mailing list archive at Nabble.com.
>
>
>
>
> --
> Matt Stevenson.

Re: Decode without using DataFileReader

Posted by Matt Stevenson <ma...@gmail.com>.
No, the schema needs to be present in some form to tell the reader how to
decode the data.
You can generate classes from the schema and pass in the class, but that is
just a different way of passing in the schema.

On Mon, Dec 5, 2011 at 9:33 AM, Gaurav <ga...@gmail.com> wrote:

> Hi,
>
> I am trying to read byte stream of encoded data, which is coming from some
> source but File. So I should not use DataFileReader.
>
> I wrote following code to do that, but here I have to specify schema on my
> own, which ideally should come from data itself. Is there any other way to
> get decode data with explicitly specifying schema and without using
> DataFileReader?
> ----------------------
>        private static void DecodeData(byte[] buf) throws IOException {
>                // TODO Auto-generated method stub
>                Schema schema = createSchema();
>                GenericDatumReader<GenericData.Record> datum = new
> GenericDatumReader<GenericData.Record>(schema);
>
>                ByteArrayInputStream in = new ByteArrayInputStream(buf);
>                BinaryDecoder decoder = DECODER_FACTORY.binaryDecoder(in,
> null);
>
>                GenericData.Record record = new
> GenericData.Record(datum.getSchema());
>                datum.read(record, decoder);
>
>                System.out.println(record.get("trade"));
>        }
> ---------------------
>
> Thanks,
> Gaurav Nanda
>
> --
> View this message in context:
> http://apache-avro.679487.n3.nabble.com/Decode-without-using-DataFileReader-tp3561722p3561722.html
> Sent from the Avro - Users mailing list archive at Nabble.com.
>



-- 
Matt Stevenson.