You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by DSuiter RDX <ds...@rdx.com> on 2013/10/07 18:00:14 UTC

Problem aggregating syslogTCP > avro > HDFS

Hi, this may be a problem with our understanding, or my configuration.

I am trying to take data from rsyslog via remote forwarding over TCP into a
syslogTCP source, collect it as an avro sink, connect the avro sink to an
avro source, and then into an HDFS sink.

Everything is connected and the data is flowing from the remote source into
HDFS in an avro container, so that is not the problem.

The problem is that it is closing files when they are very small, only KBs
in size, even though I have the hdfs roll_Interval and rollCount properties
set to 0. I set the hdfs.rollSize property to 3072 for 3MB. I expected it
to aggregate the files into larger blocks before closing them. Is this
happening because of the HDFS directory-building escape sequences forcing
new directory writes and making new files prematurely?

Here are my agent configs:

syslogTCP Source > Avro Sink (first tier, pretty sure everything is ok here
but maybe not)

####RT Listener Agent####
rtlv1.sources=srclv1
rtlv1.sinks=snklv1
rtlv1.channels=chnlv1

#sources
rtlv1.sources.srclv1.type=syslogtcp
rtlv1.sources.srclv1.host=192.168.1.2
rtlv1.sources.srclv1.port=5140
rtlv1.sources.srclv1.channels=chnlv1

#channels
rtlv1.channels.chnlv1.type=memory
rtlv1.channels.chnlv1.capacity=1500
rtlv1.channels.chnlv1.transactionCapacity=1500

#sinks
rtlv1.sinks.snklv1.type=avro
rtlv1.sinks.snklv1.hostname=192.168.1.2
rtlv1.sinks.snklv1.port=5141
rtlv1.sinks.snklv1.batch-size=1500
rtlv1.sinks.snklv1.channel=chnlv1

Avro Source > HDFS (second tier)

####RT Aggregate Writer Agent####
rtlv2.sources=srclv2
rtlv2.sinks=snklv2
rtlv2.channels=chnlv2

#sources
rtlv2.sources.srclv2.type=avro
rtlv2.sources.srclv2.bind=192.168.1.2
rtlv2.sources.srclv2.port=5141
rtlv2.sources.srclv2.channels=chnlv2

#channels
rtlv2.channels.chnlv2.type=memory
rtlv2.channels.chnlv2.capacity=1500
rtlv2.channels.chnlv2.transactioncapacity=1500

#sinks
rtlv2.sinks.snklv2.type=hdfs
rtlv2.sinks.snklv2.channel=chnlv2
rtlv2.sinks.snklv2.hdfs.path=/user/flume/avro/%y-%m-%d/%H%M
rtlv2.sinks.snklv2.hdfs.fileSuffix=.avro
rtlv2.sinks.snklv2.serializer=avro_event
rtlv2.sinks.snklv2.hdfs.fileType=DataStream
rtlv2.sinks.snklv2.hdfs.rollInterval=0
rtlv2.sinks.snklv2.hdfs.rollSize=3072
rtlv2.sinks.snklv2.hdfs.batchSize=1500
rtlv2.sinks.snklv2.hdfs.rollCount=0
rtlv2.sinks.snklv2.hdfs.round=true
rtlv2.sinks.snklv2.hdfs.roundValue=10
rtlv2.sinks.snklv2.hdfs.roundUnit=minute

Thanks!
*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com

Re: Problem aggregating syslogTCP > avro > HDFS

Posted by DSuiter RDX <ds...@rdx.com>.
Ok, I just realized that I am missing a 0 on the rollSize, and it is
probably doing exactly what it is supposed to since I told it close the
file at 3 KB not 3 MB...

Sorry everyone!

Thanks!
*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Mon, Oct 7, 2013 at 12:00 PM, DSuiter RDX <ds...@rdx.com> wrote:

> Hi, this may be a problem with our understanding, or my configuration.
>
> I am trying to take data from rsyslog via remote forwarding over TCP into
> a syslogTCP source, collect it as an avro sink, connect the avro sink to an
> avro source, and then into an HDFS sink.
>
> Everything is connected and the data is flowing from the remote source
> into HDFS in an avro container, so that is not the problem.
>
> The problem is that it is closing files when they are very small, only KBs
> in size, even though I have the hdfs roll_Interval and rollCount properties
> set to 0. I set the hdfs.rollSize property to 3072 for 3MB. I expected it
> to aggregate the files into larger blocks before closing them. Is this
> happening because of the HDFS directory-building escape sequences forcing
> new directory writes and making new files prematurely?
>
> Here are my agent configs:
>
> syslogTCP Source > Avro Sink (first tier, pretty sure everything is ok
> here but maybe not)
>
> ####RT Listener Agent####
> rtlv1.sources=srclv1
> rtlv1.sinks=snklv1
> rtlv1.channels=chnlv1
>
> #sources
> rtlv1.sources.srclv1.type=syslogtcp
> rtlv1.sources.srclv1.host=192.168.1.2
> rtlv1.sources.srclv1.port=5140
> rtlv1.sources.srclv1.channels=chnlv1
>
> #channels
> rtlv1.channels.chnlv1.type=memory
> rtlv1.channels.chnlv1.capacity=1500
> rtlv1.channels.chnlv1.transactionCapacity=1500
>
> #sinks
> rtlv1.sinks.snklv1.type=avro
> rtlv1.sinks.snklv1.hostname=192.168.1.2
> rtlv1.sinks.snklv1.port=5141
> rtlv1.sinks.snklv1.batch-size=1500
> rtlv1.sinks.snklv1.channel=chnlv1
>
> Avro Source > HDFS (second tier)
>
> ####RT Aggregate Writer Agent####
> rtlv2.sources=srclv2
> rtlv2.sinks=snklv2
> rtlv2.channels=chnlv2
>
> #sources
> rtlv2.sources.srclv2.type=avro
> rtlv2.sources.srclv2.bind=192.168.1.2
> rtlv2.sources.srclv2.port=5141
> rtlv2.sources.srclv2.channels=chnlv2
>
> #channels
> rtlv2.channels.chnlv2.type=memory
> rtlv2.channels.chnlv2.capacity=1500
> rtlv2.channels.chnlv2.transactioncapacity=1500
>
> #sinks
> rtlv2.sinks.snklv2.type=hdfs
> rtlv2.sinks.snklv2.channel=chnlv2
> rtlv2.sinks.snklv2.hdfs.path=/user/flume/avro/%y-%m-%d/%H%M
> rtlv2.sinks.snklv2.hdfs.fileSuffix=.avro
> rtlv2.sinks.snklv2.serializer=avro_event
> rtlv2.sinks.snklv2.hdfs.fileType=DataStream
> rtlv2.sinks.snklv2.hdfs.rollInterval=0
> rtlv2.sinks.snklv2.hdfs.rollSize=3072
> rtlv2.sinks.snklv2.hdfs.batchSize=1500
> rtlv2.sinks.snklv2.hdfs.rollCount=0
> rtlv2.sinks.snklv2.hdfs.round=true
> rtlv2.sinks.snklv2.hdfs.roundValue=10
> rtlv2.sinks.snklv2.hdfs.roundUnit=minute
>
> Thanks!
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
>