You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Daniel Rodriguez <df...@gmail.com> on 2014/02/07 21:06:29 UTC
Create Avro from bytes, not by fields
Hi all,
Some context (not an expert Java programmer, and just starting with
AVRO/Flume):
I need to transfer avro files from different servers to HDFS I am trying to
use Flume to do it.
I have a Flume spooldir source (reading the avro files) with an avro sink
and avro sink with a HDFS sink. Like this:
servers | hadoop
spooldir src -> avro sink --------> avro src -> hdfs
When Flume spooldir deserialize the avro files creates an flume event with
two fields: 1) header contains the schema; 2) and in the body field has the
binary Avro record data, not including the schema or the rest of the
container file elements. See the flume docs:
http://flume.apache.org/FlumeUserGuide.html#avro
So the avro sink creates an avro file like this:
{"headers": {"flume.avro.schema.literal":
"{\"type\":\"record\",\"name\":\"User\",\"namespace\":\"example.avro\",\"fields\":[{\"name\":\"name\",\"type\":\"string\"},{\"name\":\"favorite_number\",\"type\":[\"int\",\"null\"]},{\"name\":\"favorite_color\",\"type\":[\"string\",\"null\"]}]}"},
"body": {"bytes": "{BYTES}"}}
So now I am trying to write a serializer since flume only includes an
FlumeEvent serializer creating avro files like the one above, not the
original avro files on the servers.
I am almost there, I got the schema from the header field and the bytes
from the body field.
But now I need to create write the AVRO file based on the bytes, not the
values from the fields, I cannot do: r.put("field", "value") since I don't
have the values, just the bytes.
This is the code:
File file = TESTFILE;
DatumReader<GenericRecord> datumReader = new
GenericDatumReader<GenericRecord>();
DataFileReader<GenericRecord> dataFileReader = new
DataFileReader<GenericRecord>(file, datumReader);
GenericRecord user = null;
while (dataFileReader.hasNext()) {
user = dataFileReader.next(user);
Map headers = (Map) user.get("headers");
Utf8 schemaHeaderKey = new Utf8("flume.avro.schema.literal");
String schema = headers.get(schemaHeaderKey).toString();
ByteBuffer body = (ByteBuffer) user.get("body");
// Writing...
Schema.Parser parser = new Schema.Parser();
Schema schemaSimpleWrapper = parser.parse(schema);
GenericRecord r = new GenericData.Record(schemaSimpleWrapper);
// NOT SURE WHAT COMES NEXT
}
Is possible to actually create the AVRO files from the value bytes?
I appreciate any help.
Thanks,
Daniel
Re: Create Avro from bytes, not by fields
Posted by Milind Vaidya <ka...@gmail.com>.
I have asked similar question but regarding deserialization of such records
written as Bytes.
Did you try to deserilize them ?
What does your schemaString look like?
Please refer to thread : Avro Byte Blob Ser
De<https://mail-archives.apache.org/mod_mbox/avro-user/201402.mbox/%3cCAGQuZejTTU9Sw2jMsDDUA9_XQeXM2jxEAQNX5O_HAnqABk=0rw@mail.gmail.com%3e>
Thanks
On Fri, Feb 7, 2014 at 7:29 PM, Daniel Rodriguez
<df...@gmail.com>wrote:
> Thanks you Doug!
>
> That was all I needed to make it work.
>
> Just for the record this is the code:
>
> // Writing...
> Schema.Parser parser = new Schema.Parser();
> Schema schema = parser.parse(schemaString);
>
> File outFile = new File("generated.avro");
> DatumWriter<GenericRecord> datumWriter = new
> GenericDatumWriter<GenericRecord>(schema);
> DataFileWriter<GenericRecord> dataFileWriter = new
> DataFileWriter<GenericRecord>(datumWriter);
> dataFileWriter.create(schema, outFile);
> dataFileWriter.appendEncoded(body);
> dataFileWriter.close();
>
> Thanks again!
>
>
> On Feb 7, 2014, at 2:29 PM, Doug Cutting <cu...@apache.org> wrote:
>
> You might use DataFileWriter#appendEncoded:
>
>
> http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#appendEncoded(java.nio.ByteBuffer)
>
> If the body has just single instance of the record then you'd call this
> once. If you have multiple instances then you might change the body to
> have the schema {"type":"array", "items", "bytes"}.
>
> Doug
>
>
> On Fri, Feb 7, 2014 at 12:06 PM, Daniel Rodriguez <
> df.rodriguez143@gmail.com> wrote:
>
>> Hi all,
>>
>> Some context (not an expert Java programmer, and just starting with
>> AVRO/Flume):
>>
>> I need to transfer avro files from different servers to HDFS I am trying
>> to use Flume to do it.
>> I have a Flume spooldir source (reading the avro files) with an avro sink
>> and avro sink with a HDFS sink. Like this:
>>
>> servers | hadoop
>> spooldir src -> avro sink --------> avro src -> hdfs
>>
>> When Flume spooldir deserialize the avro files creates an flume event
>> with two fields: 1) header contains the schema; 2) and in the body field
>> has the binary Avro record data, not including the schema or the rest of
>> the container file elements. See the flume docs:
>> http://flume.apache.org/FlumeUserGuide.html#avro
>>
>> So the avro sink creates an avro file like this:
>>
>> {"headers": {"flume.avro.schema.literal":
>> "{\"type\":\"record\",\"name\":\"User\",\"namespace\":\"example.avro\",\"fields\":[{\"name\":\"name\",\"type\":\"string\"},{\"name\":\"favorite_number\",\"type\":[\"int\",\"null\"]},{\"name\":\"favorite_color\",\"type\":[\"string\",\"null\"]}]}"},
>> "body": {"bytes": "{BYTES}"}}
>>
>> So now I am trying to write a serializer since flume only includes an
>> FlumeEvent serializer creating avro files like the one above, not the
>> original avro files on the servers.
>>
>> I am almost there, I got the schema from the header field and the bytes
>> from the body field.
>> But now I need to create write the AVRO file based on the bytes, not the
>> values from the fields, I cannot do: r.put("field", "value") since I
>> don't have the values, just the bytes.
>>
>> This is the code:
>>
>> File file = TESTFILE;
>>
>> DatumReader<GenericRecord> datumReader = new
>> GenericDatumReader<GenericRecord>();
>> DataFileReader<GenericRecord> dataFileReader = new
>> DataFileReader<GenericRecord>(file, datumReader);
>> GenericRecord user = null;
>> while (dataFileReader.hasNext()) {
>> user = dataFileReader.next(user);
>>
>> Map headers = (Map) user.get("headers");
>>
>> Utf8 schemaHeaderKey = new Utf8("flume.avro.schema.literal");
>> String schema = headers.get(schemaHeaderKey).toString();
>>
>> ByteBuffer body = (ByteBuffer) user.get("body");
>>
>>
>> // Writing...
>> Schema.Parser parser = new Schema.Parser();
>> Schema schemaSimpleWrapper = parser.parse(schema);
>> GenericRecord r = new GenericData.Record(schemaSimpleWrapper);
>>
>> // NOT SURE WHAT COMES NEXT
>> }
>>
>> Is possible to actually create the AVRO files from the value bytes?
>>
>> I appreciate any help.
>>
>> Thanks,
>> Daniel
>>
>
>
>
Re: Create Avro from bytes, not by fields
Posted by Daniel Rodriguez <df...@gmail.com>.
Thanks you Doug!
That was all I needed to make it work.
Just for the record this is the code:
// Writing...
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(schemaString);
File outFile = new File("generated.avro");
DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter);
dataFileWriter.create(schema, outFile);
dataFileWriter.appendEncoded(body);
dataFileWriter.close();
Thanks again!
On Feb 7, 2014, at 2:29 PM, Doug Cutting <cu...@apache.org> wrote:
> You might use DataFileWriter#appendEncoded:
>
> http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#appendEncoded(java.nio.ByteBuffer)
>
> If the body has just single instance of the record then you'd call this once. If you have multiple instances then you might change the body to have the schema {"type":"array", "items", "bytes"}.
>
> Doug
>
>
> On Fri, Feb 7, 2014 at 12:06 PM, Daniel Rodriguez <df...@gmail.com> wrote:
> Hi all,
>
> Some context (not an expert Java programmer, and just starting with AVRO/Flume):
>
> I need to transfer avro files from different servers to HDFS I am trying to use Flume to do it.
> I have a Flume spooldir source (reading the avro files) with an avro sink and avro sink with a HDFS sink. Like this:
>
> servers | hadoop
> spooldir src -> avro sink --------> avro src -> hdfs
>
> When Flume spooldir deserialize the avro files creates an flume event with two fields: 1) header contains the schema; 2) and in the body field has the binary Avro record data, not including the schema or the rest of the container file elements. See the flume docs: http://flume.apache.org/FlumeUserGuide.html#avro
>
> So the avro sink creates an avro file like this:
>
> {"headers": {"flume.avro.schema.literal": "{\"type\":\"record\",\"name\":\"User\",\"namespace\":\"example.avro\",\"fields\":[{\"name\":\"name\",\"type\":\"string\"},{\"name\":\"favorite_number\",\"type\":[\"int\",\"null\"]},{\"name\":\"favorite_color\",\"type\":[\"string\",\"null\"]}]}"}, "body": {"bytes": "{BYTES}"}}
>
> So now I am trying to write a serializer since flume only includes an FlumeEvent serializer creating avro files like the one above, not the original avro files on the servers.
>
> I am almost there, I got the schema from the header field and the bytes from the body field.
> But now I need to create write the AVRO file based on the bytes, not the values from the fields, I cannot do: r.put("field", "value") since I don't have the values, just the bytes.
>
> This is the code:
>
> File file = TESTFILE;
>
> DatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord>();
> DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(file, datumReader);
> GenericRecord user = null;
> while (dataFileReader.hasNext()) {
> user = dataFileReader.next(user);
>
> Map headers = (Map) user.get("headers");
>
> Utf8 schemaHeaderKey = new Utf8("flume.avro.schema.literal");
> String schema = headers.get(schemaHeaderKey).toString();
>
> ByteBuffer body = (ByteBuffer) user.get("body");
>
>
> // Writing...
> Schema.Parser parser = new Schema.Parser();
> Schema schemaSimpleWrapper = parser.parse(schema);
> GenericRecord r = new GenericData.Record(schemaSimpleWrapper);
>
> // NOT SURE WHAT COMES NEXT
> }
>
> Is possible to actually create the AVRO files from the value bytes?
>
> I appreciate any help.
>
> Thanks,
> Daniel
>
Re: Create Avro from bytes, not by fields
Posted by Doug Cutting <cu...@apache.org>.
You might use DataFileWriter#appendEncoded:
http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#appendEncoded(java.nio.ByteBuffer)
If the body has just single instance of the record then you'd call this
once. If you have multiple instances then you might change the body to
have the schema {"type":"array", "items", "bytes"}.
Doug
On Fri, Feb 7, 2014 at 12:06 PM, Daniel Rodriguez <df.rodriguez143@gmail.com
> wrote:
> Hi all,
>
> Some context (not an expert Java programmer, and just starting with
> AVRO/Flume):
>
> I need to transfer avro files from different servers to HDFS I am trying
> to use Flume to do it.
> I have a Flume spooldir source (reading the avro files) with an avro sink
> and avro sink with a HDFS sink. Like this:
>
> servers | hadoop
> spooldir src -> avro sink --------> avro src -> hdfs
>
> When Flume spooldir deserialize the avro files creates an flume event with
> two fields: 1) header contains the schema; 2) and in the body field has the
> binary Avro record data, not including the schema or the rest of the
> container file elements. See the flume docs:
> http://flume.apache.org/FlumeUserGuide.html#avro
>
> So the avro sink creates an avro file like this:
>
> {"headers": {"flume.avro.schema.literal":
> "{\"type\":\"record\",\"name\":\"User\",\"namespace\":\"example.avro\",\"fields\":[{\"name\":\"name\",\"type\":\"string\"},{\"name\":\"favorite_number\",\"type\":[\"int\",\"null\"]},{\"name\":\"favorite_color\",\"type\":[\"string\",\"null\"]}]}"},
> "body": {"bytes": "{BYTES}"}}
>
> So now I am trying to write a serializer since flume only includes an
> FlumeEvent serializer creating avro files like the one above, not the
> original avro files on the servers.
>
> I am almost there, I got the schema from the header field and the bytes
> from the body field.
> But now I need to create write the AVRO file based on the bytes, not the
> values from the fields, I cannot do: r.put("field", "value") since I
> don't have the values, just the bytes.
>
> This is the code:
>
> File file = TESTFILE;
>
> DatumReader<GenericRecord> datumReader = new
> GenericDatumReader<GenericRecord>();
> DataFileReader<GenericRecord> dataFileReader = new
> DataFileReader<GenericRecord>(file, datumReader);
> GenericRecord user = null;
> while (dataFileReader.hasNext()) {
> user = dataFileReader.next(user);
>
> Map headers = (Map) user.get("headers");
>
> Utf8 schemaHeaderKey = new Utf8("flume.avro.schema.literal");
> String schema = headers.get(schemaHeaderKey).toString();
>
> ByteBuffer body = (ByteBuffer) user.get("body");
>
>
> // Writing...
> Schema.Parser parser = new Schema.Parser();
> Schema schemaSimpleWrapper = parser.parse(schema);
> GenericRecord r = new GenericData.Record(schemaSimpleWrapper);
>
> // NOT SURE WHAT COMES NEXT
> }
>
> Is possible to actually create the AVRO files from the value bytes?
>
> I appreciate any help.
>
> Thanks,
> Daniel
>