You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Mohammad Tariq <do...@gmail.com> on 2012/06/12 19:51:37 UTC

Unable to dump data into the hdfs

Hello list,

    I am trying to collect apache web server logs and put them into
the hdfs, but I am not able to do it properly..only first few rows
from the log file are going into the hdfs..my conf file looks like
this -

agent1.sources = tail
agent1.channels = MemoryChannel-2
agent1.sinks = HDFS

agent1.sources.tail.type = exec
agent1.sources.tail.command = tail -f /var/log/apache2/access.log.1
agent1.sources.tail.channels = MemoryChannel-2

agent1.sinks.HDFS.channel = MemoryChannel-2
agent1.sinks.HDFS.type = hdfs
agent1.sinks.HDFS.hdfs.path = hdfs://localhost:9000/flume
agent1.sinks.HDFS.hdfs.file.Type = DataStream

agent1.channels.MemoryChannel-2.type = memory

Regards,
    Mohammad Tariq

Re: Unable to dump data into the hdfs

Posted by Mohammad Tariq <do...@gmail.com>.
Thank you so much Eric for pointing out the difference between -F and
-f...I have not tuned flush/rotation configuration..Also, two files
are getting generated every time I start the agent.Is it normal??Also
I would like to ask you if there is any link where I can find info on
agent configuration(specially for hbase-sink).Many thanks.

Regards,
    Mohammad Tariq


On Wed, Jun 13, 2012 at 12:13 AM, Eric Sammer <es...@cloudera.com> wrote:
> Mohammad:
>
> There's a few reasons why this could be.
>
> On Tue, Jun 12, 2012 at 10:51 AM, Mohammad Tariq <do...@gmail.com> wrote:
>>
>> Hello list,
>>
>>    I am trying to collect apache web server logs and put them into
>> the hdfs, but I am not able to do it properly..only first few rows
>> from the log file are going into the hdfs..my conf file looks like
>> this -
>>
>> agent1.sources = tail
>> agent1.channels = MemoryChannel-2
>> agent1.sinks = HDFS
>>
>> agent1.sources.tail.type = exec
>> agent1.sources.tail.command = tail -f /var/log/apache2/access.log.1
>
>
> You probably want to use tail -F rather than tail -f. The former will follow
> file truncation where as the latter will not. Also, I'm not familiar with
> how your apache logs are being written, but access.log.1 is usually a
> rotated out (i.e. non-changing) file. Do you mean to tail access.log
> instead?
>
>> agent1.sources.tail.channels = MemoryChannel-2
>>
>> agent1.sinks.HDFS.channel = MemoryChannel-2
>> agent1.sinks.HDFS.type = hdfs
>> agent1.sinks.HDFS.hdfs.path = hdfs://localhost:9000/flume
>> agent1.sinks.HDFS.hdfs.file.Type = DataStream
>
>
> The frequency with which you flush the open file handle in HDFS can effect
> the rate that data "appears" in HDFS. If you never flush or rotate, data
> appears in HDFS block sized increments (e.g. with a block size of 128MB,
> data appears in chunks of 128MB as blocks are completed). Presumably, data
> is arriving in significant quantity to avoid this problem (or you've tuned
> the flush / rotation configuration appropriately).
>
>>
>> agent1.channels.MemoryChannel-2.type = memory
>>
>> Regards,
>>     Mohammad Tariq
>
>
>
>
> --
> Eric Sammer
> twitter: esammer
> data: www.cloudera.com

Re: Unable to dump data into the hdfs

Posted by Eric Sammer <es...@cloudera.com>.
Mohammad:

There's a few reasons why this could be.

On Tue, Jun 12, 2012 at 10:51 AM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello list,
>
>    I am trying to collect apache web server logs and put them into
> the hdfs, but I am not able to do it properly..only first few rows
> from the log file are going into the hdfs..my conf file looks like
> this -
>
> agent1.sources = tail
> agent1.channels = MemoryChannel-2
> agent1.sinks = HDFS
>
> agent1.sources.tail.type = exec
> agent1.sources.tail.command = tail -f /var/log/apache2/access.log.1
>

You probably want to use tail -F rather than tail -f. The former will
follow file truncation where as the latter will not. Also, I'm not familiar
with how your apache logs are being written, but access.log.1 is usually a
rotated out (i.e. non-changing) file. Do you mean to tail access.log
instead?

agent1.sources.tail.channels = MemoryChannel-2
>
> agent1.sinks.HDFS.channel = MemoryChannel-2
> agent1.sinks.HDFS.type = hdfs
> agent1.sinks.HDFS.hdfs.path = hdfs://localhost:9000/flume
> agent1.sinks.HDFS.hdfs.file.Type = DataStream
>

The frequency with which you flush the open file handle in HDFS can effect
the rate that data "appears" in HDFS. If you never flush or rotate, data
appears in HDFS block sized increments (e.g. with a block size of 128MB,
data appears in chunks of 128MB as blocks are completed). Presumably, data
is arriving in significant quantity to avoid this problem (or you've tuned
the flush / rotation configuration appropriately).


> agent1.channels.MemoryChannel-2.type = memory
>
> Regards,
>     Mohammad Tariq
>



-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com