You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Jeff Lord <jl...@cloudera.com> on 2013/11/01 00:42:19 UTC

Re: HDFS Sink Config Help

Jeremy,

Datastream fileType will let you write text files.
CompressedStream will do just that.
SequenceFile will create sequence files as you have guessed and you can use
either Text or Writeable (bytes) for your data here.

So flume is configureable out of the box with regards to the size of your
files. Yes you are correct that it is better to create files that are at
least the size of a full block.
You can roll your files based on time, size, or number of events. Rolling
on an hourly basis makes perfect sense.

With all that said we recommend writing to avro container files as that
format is most well suited for being used in the Hadoop ecosystem.
Avro has many benefits which include support for compression, code
generation, versioning and schema evolution.
You can do this with flume by specifying the avro_event type for the
serializer property in your hdfs sink.

Hope this helps.

-Jeff


On Wed, Oct 30, 2013 at 4:15 PM, Jeremy Karlson <je...@gmail.com>wrote:

> Hi everyone.
>
> I'm trying to set up Flume to log into HDFS.  Along the way, Flume
> attaches a number of headers (environment, hostname, etc) that I would also
> like to store with my log messages.  Ideally, I'd like to be able to use
> Hive to query all of this later.  I must also admit to knowing next to
> nothing about HDFS.  That probably doesn't help.  :-P
>
> I'm confused about the HDFS sink configuration.  Specifically, I'm trying
> to understand what these two options do (and how they interact):
>
> hdfs.fileType
> hdfs.writeFormat
>
> File Type:
>
> DataStream - This appears to write the event body, and loses all headers.
>  Correct?
> CompressedStream - I assume just a compressed data stream.
> SequenceFile - I think this is what I want, since it seems to be a
> key/value based thing, which I assume means it will include headers.
>
> Write Format: This seems to only apply for SequenceFile above, but lots of
> Internet examples seem to state otherwise.  I'm also unclear on the
> difference here.  Isn't "Text" just a specific type of "Writable" in HDFS?
>
> Also, I'm unclear on why Flume, by default, seems to be set up to make
> such small HDFS files.  Isn't HDFS designed (and more efficient) when
> storing larger files that are closer to the size of a full block?  I was
> thinking it made more sense to write all log data to a single file, and
> roll that file hourly (or whatever, depending on volume).  Thoughts here?
>
> Thanks a lot.
>
> -- Jeremy
>
>
>

Re: HDFS Sink Config Help

Posted by Jeff Lord <jl...@cloudera.com>.
Yes definitely use avro instead of json if you can.
HIVE-895 added support for that. Pretty much the entire Hadoop ecosystem
has support for avro at this point. The ability to evolve/version the
schema is one of the main benefits.


On Fri, Nov 1, 2013 at 9:50 AM, Jeremy Karlson <je...@gmail.com>wrote:

> Hi Jeff,
>
> Thanks for your suggestions.  My only Flume experience so far is with the
> Elasticsearch sink, which serializes (headers and body) to JSON
> automatically.  I was expecting something similar from the HDFS sink and
> when it didn't do that I started questioning the file format when I should
> have been looking at the serializer.  A misunderstanding on my part.
>
> I just finished serializing to JSON when I saw you suggested Avro.  I'll
> look into that.  Is that what you would use if you were going to query with
> Hive external tables?
>
> Thanks again!
>
> -- Jeremy
>
>
>
> On Thu, Oct 31, 2013 at 4:42 PM, Jeff Lord <jl...@cloudera.com> wrote:
>
>> Jeremy,
>>
>> Datastream fileType will let you write text files.
>> CompressedStream will do just that.
>> SequenceFile will create sequence files as you have guessed and you can
>> use either Text or Writeable (bytes) for your data here.
>>
>> So flume is configureable out of the box with regards to the size of your
>> files. Yes you are correct that it is better to create files that are at
>> least the size of a full block.
>> You can roll your files based on time, size, or number of events. Rolling
>> on an hourly basis makes perfect sense.
>>
>> With all that said we recommend writing to avro container files as that
>> format is most well suited for being used in the Hadoop ecosystem.
>> Avro has many benefits which include support for compression, code
>> generation, versioning and schema evolution.
>> You can do this with flume by specifying the avro_event type for the
>> serializer property in your hdfs sink.
>>
>> Hope this helps.
>>
>> -Jeff
>>
>>
>> On Wed, Oct 30, 2013 at 4:15 PM, Jeremy Karlson <je...@gmail.com>wrote:
>>
>>> Hi everyone.
>>>
>>> I'm trying to set up Flume to log into HDFS.  Along the way, Flume
>>> attaches a number of headers (environment, hostname, etc) that I would also
>>> like to store with my log messages.  Ideally, I'd like to be able to use
>>> Hive to query all of this later.  I must also admit to knowing next to
>>> nothing about HDFS.  That probably doesn't help.  :-P
>>>
>>> I'm confused about the HDFS sink configuration.  Specifically, I'm
>>> trying to understand what these two options do (and how they interact):
>>>
>>> hdfs.fileType
>>> hdfs.writeFormat
>>>
>>> File Type:
>>>
>>> DataStream - This appears to write the event body, and loses all
>>> headers.  Correct?
>>> CompressedStream - I assume just a compressed data stream.
>>>  SequenceFile - I think this is what I want, since it seems to be a
>>> key/value based thing, which I assume means it will include headers.
>>>
>>> Write Format: This seems to only apply for SequenceFile above, but lots
>>> of Internet examples seem to state otherwise.  I'm also unclear on the
>>> difference here.  Isn't "Text" just a specific type of "Writable" in HDFS?
>>>
>>> Also, I'm unclear on why Flume, by default, seems to be set up to make
>>> such small HDFS files.  Isn't HDFS designed (and more efficient) when
>>> storing larger files that are closer to the size of a full block?  I was
>>> thinking it made more sense to write all log data to a single file, and
>>> roll that file hourly (or whatever, depending on volume).  Thoughts here?
>>>
>>> Thanks a lot.
>>>
>>> -- Jeremy
>>>
>>>
>>>
>>
>

Re: HDFS Sink Config Help

Posted by Jeremy Karlson <je...@gmail.com>.
Hi Jeff,

Thanks for your suggestions.  My only Flume experience so far is with the
Elasticsearch sink, which serializes (headers and body) to JSON
automatically.  I was expecting something similar from the HDFS sink and
when it didn't do that I started questioning the file format when I should
have been looking at the serializer.  A misunderstanding on my part.

I just finished serializing to JSON when I saw you suggested Avro.  I'll
look into that.  Is that what you would use if you were going to query with
Hive external tables?

Thanks again!

-- Jeremy


On Thu, Oct 31, 2013 at 4:42 PM, Jeff Lord <jl...@cloudera.com> wrote:

> Jeremy,
>
> Datastream fileType will let you write text files.
> CompressedStream will do just that.
> SequenceFile will create sequence files as you have guessed and you can
> use either Text or Writeable (bytes) for your data here.
>
> So flume is configureable out of the box with regards to the size of your
> files. Yes you are correct that it is better to create files that are at
> least the size of a full block.
> You can roll your files based on time, size, or number of events. Rolling
> on an hourly basis makes perfect sense.
>
> With all that said we recommend writing to avro container files as that
> format is most well suited for being used in the Hadoop ecosystem.
> Avro has many benefits which include support for compression, code
> generation, versioning and schema evolution.
> You can do this with flume by specifying the avro_event type for the
> serializer property in your hdfs sink.
>
> Hope this helps.
>
> -Jeff
>
>
> On Wed, Oct 30, 2013 at 4:15 PM, Jeremy Karlson <je...@gmail.com>wrote:
>
>> Hi everyone.
>>
>> I'm trying to set up Flume to log into HDFS.  Along the way, Flume
>> attaches a number of headers (environment, hostname, etc) that I would also
>> like to store with my log messages.  Ideally, I'd like to be able to use
>> Hive to query all of this later.  I must also admit to knowing next to
>> nothing about HDFS.  That probably doesn't help.  :-P
>>
>> I'm confused about the HDFS sink configuration.  Specifically, I'm trying
>> to understand what these two options do (and how they interact):
>>
>> hdfs.fileType
>> hdfs.writeFormat
>>
>> File Type:
>>
>> DataStream - This appears to write the event body, and loses all headers.
>>  Correct?
>> CompressedStream - I assume just a compressed data stream.
>>  SequenceFile - I think this is what I want, since it seems to be a
>> key/value based thing, which I assume means it will include headers.
>>
>> Write Format: This seems to only apply for SequenceFile above, but lots
>> of Internet examples seem to state otherwise.  I'm also unclear on the
>> difference here.  Isn't "Text" just a specific type of "Writable" in HDFS?
>>
>> Also, I'm unclear on why Flume, by default, seems to be set up to make
>> such small HDFS files.  Isn't HDFS designed (and more efficient) when
>> storing larger files that are closer to the size of a full block?  I was
>> thinking it made more sense to write all log data to a single file, and
>> roll that file hourly (or whatever, depending on volume).  Thoughts here?
>>
>> Thanks a lot.
>>
>> -- Jeremy
>>
>>
>>
>