You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by Daniel Rodriguez <df...@gmail.com> on 2014/02/07 21:06:29 UTC

Create Avro from bytes, not by fields

Hi all,

Some context (not an expert Java programmer, and just starting with
AVRO/Flume):

I need to transfer avro files from different servers to HDFS I am trying to
use Flume to do it.
I have a Flume spooldir source (reading the avro files) with an avro sink
and avro sink with a HDFS sink. Like this:

           servers                      |                  hadoop
spooldir src -> avro sink     -------->       avro src -> hdfs

When Flume spooldir deserialize the avro files creates an flume event with
two fields: 1) header contains the schema; 2) and in the body field has the
binary Avro record data, not including the schema or the rest of the
container file elements. See the flume docs:
http://flume.apache.org/FlumeUserGuide.html#avro

So the avro sink creates an avro file like this:

{"headers": {"flume.avro.schema.literal":
"{\"type\":\"record\",\"name\":\"User\",\"namespace\":\"example.avro\",\"fields\":[{\"name\":\"name\",\"type\":\"string\"},{\"name\":\"favorite_number\",\"type\":[\"int\",\"null\"]},{\"name\":\"favorite_color\",\"type\":[\"string\",\"null\"]}]}"},
"body": {"bytes": "{BYTES}"}}

So now I am trying to write a serializer since flume only includes an
FlumeEvent serializer creating avro files like the one above, not the
original avro files on the servers.

I am almost there, I got the schema from the header field and the bytes
from the body field.
But now I need to create write the AVRO file based on the bytes, not the
values from the fields, I cannot do: r.put("field", "value") since I don't
have the values, just the bytes.

This is the code:

File file = TESTFILE;

DatumReader<GenericRecord> datumReader = new
GenericDatumReader<GenericRecord>();
DataFileReader<GenericRecord> dataFileReader = new
DataFileReader<GenericRecord>(file, datumReader);
GenericRecord user = null;
while (dataFileReader.hasNext()) {
    user = dataFileReader.next(user);

    Map headers = (Map) user.get("headers");

    Utf8 schemaHeaderKey = new Utf8("flume.avro.schema.literal");
    String schema = headers.get(schemaHeaderKey).toString();

    ByteBuffer body = (ByteBuffer) user.get("body");


    // Writing...
    Schema.Parser parser = new Schema.Parser();
    Schema schemaSimpleWrapper = parser.parse(schema);
    GenericRecord r =  new GenericData.Record(schemaSimpleWrapper);

    // NOT SURE WHAT COMES NEXT
}

Is possible to actually create the AVRO files from the value bytes?

I appreciate any help.

Thanks,
Daniel

Re: Create Avro from bytes, not by fields

Posted by Milind Vaidya <ka...@gmail.com>.

I have asked similar question but regarding deserialization of such records
written as Bytes.
Did you try to deserilize them ?
What does your schemaString look like?

Please refer to thread : Avro Byte Blob Ser
De<https://mail-archives.apache.org/mod_mbox/avro-user/201402.mbox/%3cCAGQuZejTTU9Sw2jMsDDUA9_XQeXM2jxEAQNX5O_HAnqABk=0rw@mail.gmail.com%3e>


Thanks





On Fri, Feb 7, 2014 at 7:29 PM, Daniel Rodriguez
<df...@gmail.com>wrote:

> Thanks you Doug!
>
> That was all I needed to make it work.
>
> Just for the record this is the code:
>
> // Writing...
> Schema.Parser parser = new Schema.Parser();
> Schema schema = parser.parse(schemaString);
>
> File outFile = new File("generated.avro");
> DatumWriter<GenericRecord> datumWriter = new
> GenericDatumWriter<GenericRecord>(schema);
> DataFileWriter<GenericRecord> dataFileWriter = new
> DataFileWriter<GenericRecord>(datumWriter);
> dataFileWriter.create(schema, outFile);
> dataFileWriter.appendEncoded(body);
> dataFileWriter.close();
>
> Thanks again!
>
>
> On Feb 7, 2014, at 2:29 PM, Doug Cutting <cu...@apache.org> wrote:
>
> You might use DataFileWriter#appendEncoded:
>
>
> http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#appendEncoded(java.nio.ByteBuffer)
>
> If the body has just single instance of the record then you'd call this
> once.  If you have multiple instances then you might change the body to
> have the schema {"type":"array", "items", "bytes"}.
>
> Doug
>
>
> On Fri, Feb 7, 2014 at 12:06 PM, Daniel Rodriguez <
> df.rodriguez143@gmail.com> wrote:
>
>> Hi all,
>>
>> Some context (not an expert Java programmer, and just starting with
>> AVRO/Flume):
>>
>> I need to transfer avro files from different servers to HDFS I am trying
>> to use Flume to do it.
>> I have a Flume spooldir source (reading the avro files) with an avro sink
>> and avro sink with a HDFS sink. Like this:
>>
>>            servers                      |                  hadoop
>> spooldir src -> avro sink     -------->       avro src -> hdfs
>>
>> When Flume spooldir deserialize the avro files creates an flume event
>> with two fields: 1) header contains the schema; 2) and in the body field
>> has the binary Avro record data, not including the schema or the rest of
>> the container file elements. See the flume docs:
>> http://flume.apache.org/FlumeUserGuide.html#avro
>>
>> So the avro sink creates an avro file like this:
>>
>> {"headers": {"flume.avro.schema.literal":
>> "{\"type\":\"record\",\"name\":\"User\",\"namespace\":\"example.avro\",\"fields\":[{\"name\":\"name\",\"type\":\"string\"},{\"name\":\"favorite_number\",\"type\":[\"int\",\"null\"]},{\"name\":\"favorite_color\",\"type\":[\"string\",\"null\"]}]}"},
>> "body": {"bytes": "{BYTES}"}}
>>
>> So now I am trying to write a serializer since flume only includes an
>> FlumeEvent serializer creating avro files like the one above, not the
>> original avro files on the servers.
>>
>> I am almost there, I got the schema from the header field and the bytes
>> from the body field.
>> But now I need to create write the AVRO file based on the bytes, not the
>> values from the fields, I cannot do: r.put("field", "value") since I
>> don't have the values, just the bytes.
>>
>> This is the code:
>>
>> File file = TESTFILE;
>>
>> DatumReader<GenericRecord> datumReader = new
>> GenericDatumReader<GenericRecord>();
>> DataFileReader<GenericRecord> dataFileReader = new
>> DataFileReader<GenericRecord>(file, datumReader);
>> GenericRecord user = null;
>> while (dataFileReader.hasNext()) {
>>     user = dataFileReader.next(user);
>>
>>     Map headers = (Map) user.get("headers");
>>
>>     Utf8 schemaHeaderKey = new Utf8("flume.avro.schema.literal");
>>     String schema = headers.get(schemaHeaderKey).toString();
>>
>>     ByteBuffer body = (ByteBuffer) user.get("body");
>>
>>
>>     // Writing...
>>     Schema.Parser parser = new Schema.Parser();
>>     Schema schemaSimpleWrapper = parser.parse(schema);
>>     GenericRecord r =  new GenericData.Record(schemaSimpleWrapper);
>>
>>     // NOT SURE WHAT COMES NEXT
>> }
>>
>> Is possible to actually create the AVRO files from the value bytes?
>>
>> I appreciate any help.
>>
>> Thanks,
>> Daniel
>>
>
>
>

Re: Create Avro from bytes, not by fields

Posted by Daniel Rodriguez <df...@gmail.com>.

Thanks you Doug!

That was all I needed to make it work.

Just for the record this is the code:

// Writing...
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(schemaString);

File outFile = new File("generated.avro");
DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter);
dataFileWriter.create(schema, outFile);
dataFileWriter.appendEncoded(body);
dataFileWriter.close();

Thanks again!


On Feb 7, 2014, at 2:29 PM, Doug Cutting <cu...@apache.org> wrote:

> You might use DataFileWriter#appendEncoded:
> 
> http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#appendEncoded(java.nio.ByteBuffer)
> 
> If the body has just single instance of the record then you'd call this once.  If you have multiple instances then you might change the body to have the schema {"type":"array", "items", "bytes"}.
> 
> Doug
> 
> 
> On Fri, Feb 7, 2014 at 12:06 PM, Daniel Rodriguez <df...@gmail.com> wrote:
> Hi all,
> 
> Some context (not an expert Java programmer, and just starting with AVRO/Flume): 
> 
> I need to transfer avro files from different servers to HDFS I am trying to use Flume to do it.
> I have a Flume spooldir source (reading the avro files) with an avro sink and avro sink with a HDFS sink. Like this:
> 
>            servers                      |                  hadoop
> spooldir src -> avro sink     -------->       avro src -> hdfs
> 
> When Flume spooldir deserialize the avro files creates an flume event with two fields: 1) header contains the schema; 2) and in the body field has the binary Avro record data, not including the schema or the rest of the container file elements. See the flume docs: http://flume.apache.org/FlumeUserGuide.html#avro
> 
> So the avro sink creates an avro file like this:
> 
> {"headers": {"flume.avro.schema.literal": "{\"type\":\"record\",\"name\":\"User\",\"namespace\":\"example.avro\",\"fields\":[{\"name\":\"name\",\"type\":\"string\"},{\"name\":\"favorite_number\",\"type\":[\"int\",\"null\"]},{\"name\":\"favorite_color\",\"type\":[\"string\",\"null\"]}]}"}, "body": {"bytes": "{BYTES}"}}
> 
> So now I am trying to write a serializer since flume only includes an FlumeEvent serializer creating avro files like the one above, not the original avro files on the servers.
> 
> I am almost there, I got the schema from the header field and the bytes from the body field.
> But now I need to create write the AVRO file based on the bytes, not the values from the fields, I cannot do: r.put("field", "value") since I don't have the values, just the bytes.
> 
> This is the code:
> 
> File file = TESTFILE;
>         
> DatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord>();
> DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(file, datumReader);
> GenericRecord user = null;
> while (dataFileReader.hasNext()) {
>     user = dataFileReader.next(user);
>     
>     Map headers = (Map) user.get("headers");
>     
>     Utf8 schemaHeaderKey = new Utf8("flume.avro.schema.literal");
>     String schema = headers.get(schemaHeaderKey).toString();
>     
>     ByteBuffer body = (ByteBuffer) user.get("body");
>     
>     
>     // Writing...
>     Schema.Parser parser = new Schema.Parser();
>     Schema schemaSimpleWrapper = parser.parse(schema);
>     GenericRecord r =  new GenericData.Record(schemaSimpleWrapper);
> 
>     // NOT SURE WHAT COMES NEXT
> }
> 
> Is possible to actually create the AVRO files from the value bytes?
> 
> I appreciate any help.
> 
> Thanks,
> Daniel
>

Re: Create Avro from bytes, not by fields

Posted by Doug Cutting <cu...@apache.org>.

You might use DataFileWriter#appendEncoded:

http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#appendEncoded(java.nio.ByteBuffer)

If the body has just single instance of the record then you'd call this
once.  If you have multiple instances then you might change the body to
have the schema {"type":"array", "items", "bytes"}.

Doug


On Fri, Feb 7, 2014 at 12:06 PM, Daniel Rodriguez <df.rodriguez143@gmail.com
> wrote:

> Hi all,
>
> Some context (not an expert Java programmer, and just starting with
> AVRO/Flume):
>
> I need to transfer avro files from different servers to HDFS I am trying
> to use Flume to do it.
> I have a Flume spooldir source (reading the avro files) with an avro sink
> and avro sink with a HDFS sink. Like this:
>
>            servers                      |                  hadoop
> spooldir src -> avro sink     -------->       avro src -> hdfs
>
> When Flume spooldir deserialize the avro files creates an flume event with
> two fields: 1) header contains the schema; 2) and in the body field has the
> binary Avro record data, not including the schema or the rest of the
> container file elements. See the flume docs:
> http://flume.apache.org/FlumeUserGuide.html#avro
>
> So the avro sink creates an avro file like this:
>
> {"headers": {"flume.avro.schema.literal":
> "{\"type\":\"record\",\"name\":\"User\",\"namespace\":\"example.avro\",\"fields\":[{\"name\":\"name\",\"type\":\"string\"},{\"name\":\"favorite_number\",\"type\":[\"int\",\"null\"]},{\"name\":\"favorite_color\",\"type\":[\"string\",\"null\"]}]}"},
> "body": {"bytes": "{BYTES}"}}
>
> So now I am trying to write a serializer since flume only includes an
> FlumeEvent serializer creating avro files like the one above, not the
> original avro files on the servers.
>
> I am almost there, I got the schema from the header field and the bytes
> from the body field.
> But now I need to create write the AVRO file based on the bytes, not the
> values from the fields, I cannot do: r.put("field", "value") since I
> don't have the values, just the bytes.
>
> This is the code:
>
> File file = TESTFILE;
>
> DatumReader<GenericRecord> datumReader = new
> GenericDatumReader<GenericRecord>();
> DataFileReader<GenericRecord> dataFileReader = new
> DataFileReader<GenericRecord>(file, datumReader);
> GenericRecord user = null;
> while (dataFileReader.hasNext()) {
>     user = dataFileReader.next(user);
>
>     Map headers = (Map) user.get("headers");
>
>     Utf8 schemaHeaderKey = new Utf8("flume.avro.schema.literal");
>     String schema = headers.get(schemaHeaderKey).toString();
>
>     ByteBuffer body = (ByteBuffer) user.get("body");
>
>
>     // Writing...
>     Schema.Parser parser = new Schema.Parser();
>     Schema schemaSimpleWrapper = parser.parse(schema);
>     GenericRecord r =  new GenericData.Record(schemaSimpleWrapper);
>
>     // NOT SURE WHAT COMES NEXT
> }
>
> Is possible to actually create the AVRO files from the value bytes?
>
> I appreciate any help.
>
> Thanks,
> Daniel
>