You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Tobias Schlottke <to...@gmail.com> on 2011/10/25 18:02:56 UTC

Logging AVRO-Binary directly?

Hi there,

sorry for the newbie question.
I really want to write Logging data in a custom AVRO schema.
Is it possible to extend the standard schema?
Or would I use the raw output format, logging serialized AVRO data in =
the message body and analyze it later in Hadoop?
Are there any problems with this? I could imagine that this won't work =
because hadoop is splitting after 64 mb?
Do we have to implement a custom source?
What is the most elegant solution for this?

Best,

Tobias

Re: Logging AVRO-Binary directly?

Posted by Mingjie Lai <mj...@gmail.com>.

Sorry for the late. Please see my response inline.

> So you would log the avro data raw in the message body?

Yes. I thought you only use avro to encode log messages, right? You 
won't use it for RPC, right?

> I thought this would be a problem.
> I mean flume has to know how to handle the data, or am I wrong?

I would use flume just working as a ``pipe'', which is only responsible 
for moving data. It doesn't have to understand the contend of the data.

> Filtering would be nice too, but that's not a hard requirement at the
> beginning.

If you want to do filtering, it would be a different story since a 
decorator has to understand the content of data. but still you don't 
have to need an avro source which knows the schema.

> As far as I can see avro is meant to encode single messages but rather
> files/streams.

AFAIK, it's not only for encoding single messages. It can be used to 
encode files, RPC, etc.

 > So there seems to be no way to encode a single binary
 > message into a string.

I don't quite understand. Flume and avro all handles data in binary. You 
don't need to worry about string, right?

> I think it is this way because the first package holds all the meta
> information (delimiters, schema).

> As far as I understand this jazz, there should be an avro source which
> understands my format (and is compatible with the original avro formats
> structure) to decode the messages into flume.
> But there is no avro source I can pass a schema to.

As I mentioned before, flume can just treat the avro messages as byte 
messages, and put them to somewhere(hdfs?) for further analytics.

-Mingjie

Re: Logging AVRO-Binary directly?

Posted by Tobias Schlottke <to...@gmail.com>.

Hey there,

> It sounds like your source is in avro? Or you want to transform your logs to avro?
> 
We're creating avro objects in our application and want to push them into some sort of log4jish sink.

> > Or would I use the raw output format, logging serialized AVRO data in =
> > the message body and analyze it later in Hadoop?
> 
> I don't see any problem.
So you would log the avro data raw in the message body?
I thought this would be a problem.
I mean flume has to know how to handle the data, or am I wrong?
Filtering would be nice too, but that's not a hard requirement at the beginning.
As far as I can see avro is meant to encode single messages but rather files/streams. So there seems to be no way to encode a single binary message into a string.
I think it is this way because the first package holds all the meta information (delimiters, schema).

As far as I understand this jazz, there should be an avro source which understands my format (and is compatible with the original avro formats structure) to decode the messages into flume.
But there is no avro source I can pass a schema to.

Maybe this is what I want?
https://issues.apache.org/jira/browse/FLUME-776


> 
> > Are there any problems with this? I could imagine that this won't work =
> > because hadoop is splitting after 64 mb?
> 
> hdfs block size should be transparent to users. You wouldn't be aware of it at all. If you write avro to hdfs, I can imagine that later on you will parse the avro file(s) with a map/reduce job and do whatever you want. I don't see why you need to bother for the 64mb block size. Or i missed anything?
> 
> Is the link helpful?
> http://www.datasalt.com/blog/2011/07/hadoop-avro/

I've tested parsing the data with pig, which seems to be no problem. This blog entry is very helpful, thanks!
I thought that block creating is "\n"-aware so that no line/package gets cut in the middle. Which would be rather funny though.

Best,

Tobias

Re: Logging AVRO-Binary directly?

Posted by Mingjie Lai <mj...@gmail.com>.

It sounds like your source is in avro? Or you want to transform your 
logs to avro?

 > Or would I use the raw output format, logging serialized AVRO data in =
 > the message body and analyze it later in Hadoop?

I don't see any problem.

 > Are there any problems with this? I could imagine that this won't work =
 > because hadoop is splitting after 64 mb?

hdfs block size should be transparent to users. You wouldn't be aware of 
it at all. If you write avro to hdfs, I can imagine that later on you 
will parse the avro file(s) with a map/reduce job and do whatever you 
want. I don't see why you need to bother for the 64mb block size. Or i 
missed anything?

Is the link helpful?
http://www.datasalt.com/blog/2011/07/hadoop-avro/

On 10/25/2011 09:02 AM, Tobias Schlottke wrote:
> Hi there,
>
> sorry for the newbie question.
> I really want to write Logging data in a custom AVRO schema.
> Is it possible to extend the standard schema?
> Or would I use the raw output format, logging serialized AVRO data in =
> the message body and analyze it later in Hadoop?
> Are there any problems with this? I could imagine that this won't work =
> because hadoop is splitting after 64 mb?
> Do we have to implement a custom source?
> What is the most elegant solution for this?
>
> Best,
>
> Tobias
>