You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Ryan Suarez <ry...@sheridancollege.ca> on 2014/04/01 17:02:29 UTC
Re: preserve syslog header in hdfs sink
Thanks for the tip! I was indeed missing the interceptors. I've added
them now but the timestamp and hostname is still not showing up in the
hdfs log. Any advice?
------- sample event in HDFS ------
SEQ!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritable??????cc?c??I?[???\?????`?????E?????Tsu[28432]:
pam_unix(su:session): session opened for user root by myuser(uid=31043)
------ same event in syslog ------
Mar 31 16:18:32 hadoop-t1 su[28432]: pam_unix(su:session): session
opened for user root by myuser(uid=31043)
------- flume-conf.properties --------
# Name the components on this agent
hadoop-t1.sources = r1
hadoop-t1.sinks = s1
hadoop-t1.channels = mem1
# Describe/configure the source
hadoop-t1.sources.r1.type = syslogtcp
hadoop-t1.sources.r1.host = localhost
hadoop-t1.sources.r1.port = 10005
hadoop-t1.sources.r1.portHeader = port
hadoop-t1.sources.r1.interceptors = i1 i2
hadoop-t1.sources.r1.interceptors.i1.type = timestamp
hadoop-t1.sources.r1.interceptors.i2.type = host
hadoop-t1.sources.r1.interceptors.i2.hostHeader = hostname
##HDFS Sink
hadoop-t1.sinks.s1.type = hdfs
hadoop-t1.sinks.s1.hdfs.path =
hdfs://hadoop-t1.mydomain.org:8020/opt/logs/%{host}/%Y-%m-%d
hadoop-t1.sinks.s1.hdfs.batchSize = 1
hadoop-t1.sinks.s1.serializer =
org.apache.flume.serialization.HeaderAndBodyTextEventSerializer$Builder
hadoop-t1.sinks.s1.serializer.columns = timestamp hostname
hadoop-t1.sinks.s1.serializer.format = CSV
hadoop-t1.sinks.s1.serializer.appendNewline = true
## MEM Use a channel which buffers events in memory
hadoop-t1.channels.mem1.type = memory
hadoop-t1.channels.mem1.capacity = 1000
hadoop-t1.channels.mem1.transactionCapacity = 100
# Bind the source and sink to the channel
hadoop-t1.sources.r1.channels = mem1
hadoop-t1.sinks.s1.channel = mem1
On 14-03-28 3:37 PM, Jeff Lord wrote:
> Do you have the appropriate interceptors configured?
>
>
> On Fri, Mar 28, 2014 at 12:28 PM, Ryan Suarez
> <ryan.suarez@sheridancollege.ca
> <ma...@sheridancollege.ca>> wrote:
>
> RTFM indicates I need the following sink properties:
>
> ---
> hadoop-t1.sinks.hdfs1.serializer =
> org.apache.flume.serialization.HeaderAndBodyTextEventSerializer
> hadoop-t1.sinks.hdfs1.serializer.columns = timestamp hostname msg
> hadoop-t1.sinks.hdfs1.serializer.format = CSV
> hadoop-t1.sinks.hdfs1.serializer.appendNewline = true
> ---
>
> But I'm still not getting timestamp information. How would I get
> hostname and timestamp information in the logs?
>
>
> On 14-03-26 3:02 PM, Ryan Suarez wrote:
>
> Greetings,
>
> I'm running flume that's shipped with Hortonworks HDP2 to feed
> syslogs to hdfs. The problem is the timestamp and hostname of
> the event is not logged to hdfs.
>
> ---
> flume@hadoop-t1:~$ hadoop fs -cat
> /opt/logs/hadoop-t1/2014-03-26/FlumeData.1395859766307
> SEQ!org.apache.hadoop.io
> <http://org.apache.hadoop.io>.LongWritable"org.apache.hadoop.io.BytesWritable??Ak?i<??G??`D??$hTsu[22209]:
> pam_unix(su:session): session opened for user root by
> someuser(uid=11111)
> ---
>
> How do I configure the sink to add hostname and timestamp info
> the the event?
>
> Here's my flume-conf.properties:
>
> ---
> flume@hadoop-t1:/etc/flume/conf$ cat flume-conf.properties
> # Name the components on this agent
> hadoop-t1.sources = syslog1
> hadoop-t1.sinks = hdfs1
> hadoop-t1.channels = mem1
>
> # Describe/configure the source
> hadoop-t1.sources.syslog1.type = syslogtcp
> hadoop-t1.sources.syslog1.host = localhost
> hadoop-t1.sources.syslog1.port = 10005
> hadoop-t1.sources.syslog1.portHeader = port
>
> ##HDFS Sink
> hadoop-t1.sinks.hdfs1.type = hdfs
> hadoop-t1.sinks.hdfs1.hdfs.path =
> hdfs://hadoop-t1.mydomain.org:8020/opt/logs/%{host}/%Y-%m-%d
> <http://hadoop-t1.mydomain.org:8020/opt/logs/%%7Bhost%7D/%Y-%m-%d>
> hadoop-t1.sinks.hdfs1.hdfs.batchSize = 1
>
> # Use a channel which buffers events in memory
> hadoop-t1.channels.mem1.type = memory
> hadoop-t1.channels.mem1.capacity = 1000
> hadoop-t1.channels.mem1.transactionCapacity = 100
>
> # Bind the source and sink to the channel
> hadoop-t1.sources.syslog1.channels = mem1
> hadoop-t1.sinks.hdfs1.channel = mem1
> ---
>
> ---
> flume@hadoop-t1:~$ flume-ng version
> Flume 1.4.0.2.0.11.0-1
> Source code repository:
> https://git-wip-us.apache.org/repos/asf/flume.git
> Revision: fcdc3d29a1f249bef653b10b149aea2bc5df892e
> Compiled by jenkins on Wed Mar 12 05:11:30 PDT 2014
> From source with checksum dea9ae30ce2c27486ae7c76ab7aba020
> ---
>
>
>
Re: preserve syslog header in hdfs sink
Posted by Ryan Suarez <ry...@sheridancollege.ca>.
oops, my bad. Typo in my config file. I incorrectly put
fileType=datastream instead of hdfs.fileType=datastream. Thanks Jeff!
It's working for me now. I see timestamp and hostname.
regards,
Ryan
On 14-04-02 2:21 PM, Ryan Suarez wrote:
> Ok, I've added hdfs.fileType = datastream and sink.serializer =
> header_and_text. But I'm still seeing the logs written in sequence
> format. Any ideas?
>
> -----
> flume@hadoop-t1:~$ flume-ng version
> Flume 1.4.0.2.0.11.0-1
> Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
> Revision: fcdc3d29a1f249bef653b10b149aea2bc5df892e
> Compiled by jenkins on Wed Mar 12 05:11:30 PDT 2014
> From source with checksum dea9ae30ce2c27486ae7c76ab7aba020
>
>
> -----
> root@hadoop-t1:/etc/flume/conf# cat flume-conf.properties
> # Name the components on this agent
> hadoop-t1.sources = r1
> hadoop-t1.sinks = s1
> hadoop-t1.channels = mem1
>
> # Describe/configure the source
> hadoop-t1.sources.r1.type = syslogtcp
> hadoop-t1.sources.r1.host = localhost
> hadoop-t1.sources.r1.port = 10005
> hadoop-t1.sources.r1.portHeader = port
> hadoop-t1.sources.r1.interceptors = i1 i2
> hadoop-t1.sources.r1.interceptors.i1.type = timestamp
> hadoop-t1.sources.r1.interceptors.i2.type = host
> hadoop-t1.sources.r1.interceptors.i2.hostHeader = hostname
>
> ##HDFS Sink
> hadoop-t1.sinks.s1.type = hdfs
> hadoop-t1.sinks.s1.fileType = *DataStream*
> hadoop-t1.sinks.s1.hdfs.path =
> hdfs://hadoop-t1.mydomain.org:8020/opt/logs/%{host}/%Y-%m-%d
> hadoop-t1.sinks.s1.hdfs.batchSize = 1
> hadoop-t1.sinks.s1.serializer = *header_and_text*
> hadoop-t1.sinks.s1.serializer.columns = timestamp hostname
> hadoop-t1.sinks.s1.serializer.format = CSV
> hadoop-t1.sinks.s1.serializer.appendNewline = true
>
> ## MEM Use a channel which buffers events in memory
> hadoop-t1.channels.mem1.type = memory
> hadoop-t1.channels.mem1.capacity = 1000
> hadoop-t1.channels.mem1.transactionCapacity = 100
>
> # Bind the source and sink to the channel
> hadoop-t1.sources.r1.channels = mem1
> hadoop-t1.sinks.s1.channel = mem1
>
> On 14-04-01 12:13 PM, Jeff Lord wrote:
>> Well you are writing a sequence file (default) Is that what you want?
>> If you want text use:
>>
>> hdfs.fileType = datastream
>>
>> and for the serializer you should be able to just use:
>>
>> a1.sinks.k1.sink.serializer = header_and_text
>>
>>
>>
>> On Tue, Apr 1, 2014 at 8:02 AM, Ryan Suarez
>> <ryan.suarez@sheridancollege.ca
>> <ma...@sheridancollege.ca>> wrote:
>>
>> Thanks for the tip! I was indeed missing the interceptors. I've
>> added them now but the timestamp and hostname is still not
>> showing up in the hdfs log. Any advice?
>>
>>
>> ------- sample event in HDFS ------
>> SEQ
>> !org.apache.hadoop.io.LongWritable”org.apache.hadoop.io.BytesWritable������cc�c��I�[��ڳ\�����`���
>> �� E � ����Tsu[28432]: pam_unix(su:session): session opened for
>> user root by myuser(uid=31043)
>>
>> ------ same event in syslog ------
>> Mar 31 16:18:32 hadoop-t1 su[28432]: pam_unix(su:session):
>> session opened for user root by myuser(uid=31043)
>>
>> ------- flume-conf.properties --------
>>
>> # Name the components on this agent
>> hadoop-t1.sources = r1
>> hadoop-t1.sinks = s1
>>
>> hadoop-t1.channels = mem1
>>
>> # Describe/configure the source
>> hadoop-t1.sources.r1.type = syslogtcp
>> hadoop-t1.sources.r1.host = localhost
>> hadoop-t1.sources.r1.port = 10005
>> hadoop-t1.sources.r1.portHeader = port
>> hadoop-t1.sources.r1.interceptors = i1 i2
>> hadoop-t1.sources.r1.interceptors.i1.type = timestamp
>> hadoop-t1.sources.r1.interceptors.i2.type = host
>> hadoop-t1.sources.r1.interceptors.i2.hostHeader = hostname
>>
>> ##HDFS Sink
>> hadoop-t1.sinks.s1.type = hdfs
>> hadoop-t1.sinks.s1.hdfs.path =
>> hdfs://hadoop-t1.mydomain.org:8020/opt/logs/%{host}/%Y-%m-%d
>> <http://hadoop-t1.mydomain.org:8020/opt/logs/%%7Bhost%7D/%Y-%m-%d>
>> hadoop-t1.sinks.s1.hdfs.batchSize = 1
>> hadoop-t1.sinks.s1.serializer =
>> org.apache.flume.serialization.HeaderAndBodyTextEventSerializer$Builder
>> hadoop-t1.sinks.s1.serializer.columns = timestamp hostname
>> hadoop-t1.sinks.s1.serializer.format = CSV
>> hadoop-t1.sinks.s1.serializer.appendNewline = true
>>
>> ## MEM Use a channel which buffers events in memory
>>
>> hadoop-t1.channels.mem1.type = memory
>> hadoop-t1.channels.mem1.capacity = 1000
>> hadoop-t1.channels.mem1.transactionCapacity = 100
>>
>> # Bind the source and sink to the channel
>> hadoop-t1.sources.r1.channels = mem1
>> hadoop-t1.sinks.s1.channel = mem1
>>
>>
>>
>> On 14-03-28 3:37 PM, Jeff Lord wrote:
>>> Do you have the appropriate interceptors configured?
>>>
>>>
>>> On Fri, Mar 28, 2014 at 12:28 PM, Ryan Suarez
>>> <ryan.suarez@sheridancollege.ca
>>> <ma...@sheridancollege.ca>> wrote:
>>>
>>> RTFM indicates I need the following sink properties:
>>>
>>> ---
>>> hadoop-t1.sinks.hdfs1.serializer =
>>> org.apache.flume.serialization.HeaderAndBodyTextEventSerializer
>>> hadoop-t1.sinks.hdfs1.serializer.columns = timestamp
>>> hostname msg
>>> hadoop-t1.sinks.hdfs1.serializer.format = CSV
>>> hadoop-t1.sinks.hdfs1.serializer.appendNewline = true
>>> ---
>>>
>>> But I'm still not getting timestamp information. How would
>>> I get hostname and timestamp information in the logs?
>>>
>>>
>>> On 14-03-26 3:02 PM, Ryan Suarez wrote:
>>>
>>> Greetings,
>>>
>>> I'm running flume that's shipped with Hortonworks HDP2
>>> to feed syslogs to hdfs. The problem is the timestamp
>>> and hostname of the event is not logged to hdfs.
>>>
>>> ---
>>> flume@hadoop-t1:~$ hadoop fs -cat
>>> /opt/logs/hadoop-t1/2014-03-26/FlumeData.1395859766307
>>> SEQ!org.apache.hadoop.io
>>> <http://org.apache.hadoop.io>.LongWritable"org.apache.hadoop.io.BytesWritable??Ak?i<??G??`D??$hTsu[22209]:
>>> pam_unix(su:session): session opened for user root by
>>> someuser(uid=11111)
>>> ---
>>>
>>> How do I configure the sink to add hostname and
>>> timestamp info the the event?
>>>
>>> Here's my flume-conf.properties:
>>>
>>> ---
>>> flume@hadoop-t1:/etc/flume/conf$ cat flume-conf.properties
>>> # Name the components on this agent
>>> hadoop-t1.sources = syslog1
>>> hadoop-t1.sinks = hdfs1
>>> hadoop-t1.channels = mem1
>>>
>>> # Describe/configure the source
>>> hadoop-t1.sources.syslog1.type = syslogtcp
>>> hadoop-t1.sources.syslog1.host = localhost
>>> hadoop-t1.sources.syslog1.port = 10005
>>> hadoop-t1.sources.syslog1.portHeader = port
>>>
>>> ##HDFS Sink
>>> hadoop-t1.sinks.hdfs1.type = hdfs
>>> hadoop-t1.sinks.hdfs1.hdfs.path =
>>> hdfs://hadoop-t1.mydomain.org:8020/opt/logs/%{host}/%Y-%m-%d
>>> <http://hadoop-t1.mydomain.org:8020/opt/logs/%%7Bhost%7D/%Y-%m-%d>
>>> hadoop-t1.sinks.hdfs1.hdfs.batchSize = 1
>>>
>>> # Use a channel which buffers events in memory
>>> hadoop-t1.channels.mem1.type = memory
>>> hadoop-t1.channels.mem1.capacity = 1000
>>> hadoop-t1.channels.mem1.transactionCapacity = 100
>>>
>>> # Bind the source and sink to the channel
>>> hadoop-t1.sources.syslog1.channels = mem1
>>> hadoop-t1.sinks.hdfs1.channel = mem1
>>> ---
>>>
>>> ---
>>> flume@hadoop-t1:~$ flume-ng version
>>> Flume 1.4.0.2.0.11.0-1
>>> Source code repository:
>>> https://git-wip-us.apache.org/repos/asf/flume.git
>>> Revision: fcdc3d29a1f249bef653b10b149aea2bc5df892e
>>> Compiled by jenkins on Wed Mar 12 05:11:30 PDT 2014
>>> From source with checksum dea9ae30ce2c27486ae7c76ab7aba020
>>> ---
>>>
>>>
>>>
>>
>>
>
Re: preserve syslog header in hdfs sink
Posted by Ryan Suarez <ry...@sheridancollege.ca>.
Ok, I've added hdfs.fileType = datastream and sink.serializer =
header_and_text. But I'm still seeing the logs written in sequence
format. Any ideas?
-----
flume@hadoop-t1:~$ flume-ng version
Flume 1.4.0.2.0.11.0-1
Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
Revision: fcdc3d29a1f249bef653b10b149aea2bc5df892e
Compiled by jenkins on Wed Mar 12 05:11:30 PDT 2014
From source with checksum dea9ae30ce2c27486ae7c76ab7aba020
-----
root@hadoop-t1:/etc/flume/conf# cat flume-conf.properties
# Name the components on this agent
hadoop-t1.sources = r1
hadoop-t1.sinks = s1
hadoop-t1.channels = mem1
# Describe/configure the source
hadoop-t1.sources.r1.type = syslogtcp
hadoop-t1.sources.r1.host = localhost
hadoop-t1.sources.r1.port = 10005
hadoop-t1.sources.r1.portHeader = port
hadoop-t1.sources.r1.interceptors = i1 i2
hadoop-t1.sources.r1.interceptors.i1.type = timestamp
hadoop-t1.sources.r1.interceptors.i2.type = host
hadoop-t1.sources.r1.interceptors.i2.hostHeader = hostname
##HDFS Sink
hadoop-t1.sinks.s1.type = hdfs
hadoop-t1.sinks.s1.fileType = *DataStream*
hadoop-t1.sinks.s1.hdfs.path =
hdfs://hadoop-t1.mydomain.org:8020/opt/logs/%{host}/%Y-%m-%d
hadoop-t1.sinks.s1.hdfs.batchSize = 1
hadoop-t1.sinks.s1.serializer = *header_and_text*
hadoop-t1.sinks.s1.serializer.columns = timestamp hostname
hadoop-t1.sinks.s1.serializer.format = CSV
hadoop-t1.sinks.s1.serializer.appendNewline = true
## MEM Use a channel which buffers events in memory
hadoop-t1.channels.mem1.type = memory
hadoop-t1.channels.mem1.capacity = 1000
hadoop-t1.channels.mem1.transactionCapacity = 100
# Bind the source and sink to the channel
hadoop-t1.sources.r1.channels = mem1
hadoop-t1.sinks.s1.channel = mem1
On 14-04-01 12:13 PM, Jeff Lord wrote:
> Well you are writing a sequence file (default) Is that what you want?
> If you want text use:
>
> hdfs.fileType = datastream
>
> and for the serializer you should be able to just use:
>
> a1.sinks.k1.sink.serializer = header_and_text
>
>
>
> On Tue, Apr 1, 2014 at 8:02 AM, Ryan Suarez
> <ryan.suarez@sheridancollege.ca
> <ma...@sheridancollege.ca>> wrote:
>
> Thanks for the tip! I was indeed missing the interceptors. I've
> added them now but the timestamp and hostname is still not showing
> up in the hdfs log. Any advice?
>
>
> ------- sample event in HDFS ------
> SEQ
> !org.apache.hadoop.io.LongWritable”org.apache.hadoop.io.BytesWritable������cc�c��I�[��ڳ\�����`���
> �� E � ����Tsu[28432]: pam_unix(su:session): session opened for
> user root by myuser(uid=31043)
>
> ------ same event in syslog ------
> Mar 31 16:18:32 hadoop-t1 su[28432]: pam_unix(su:session): session
> opened for user root by myuser(uid=31043)
>
> ------- flume-conf.properties --------
>
> # Name the components on this agent
> hadoop-t1.sources = r1
> hadoop-t1.sinks = s1
>
> hadoop-t1.channels = mem1
>
> # Describe/configure the source
> hadoop-t1.sources.r1.type = syslogtcp
> hadoop-t1.sources.r1.host = localhost
> hadoop-t1.sources.r1.port = 10005
> hadoop-t1.sources.r1.portHeader = port
> hadoop-t1.sources.r1.interceptors = i1 i2
> hadoop-t1.sources.r1.interceptors.i1.type = timestamp
> hadoop-t1.sources.r1.interceptors.i2.type = host
> hadoop-t1.sources.r1.interceptors.i2.hostHeader = hostname
>
> ##HDFS Sink
> hadoop-t1.sinks.s1.type = hdfs
> hadoop-t1.sinks.s1.hdfs.path =
> hdfs://hadoop-t1.mydomain.org:8020/opt/logs/%{host}/%Y-%m-%d
> <http://hadoop-t1.mydomain.org:8020/opt/logs/%%7Bhost%7D/%Y-%m-%d>
> hadoop-t1.sinks.s1.hdfs.batchSize = 1
> hadoop-t1.sinks.s1.serializer =
> org.apache.flume.serialization.HeaderAndBodyTextEventSerializer$Builder
> hadoop-t1.sinks.s1.serializer.columns = timestamp hostname
> hadoop-t1.sinks.s1.serializer.format = CSV
> hadoop-t1.sinks.s1.serializer.appendNewline = true
>
> ## MEM Use a channel which buffers events in memory
>
> hadoop-t1.channels.mem1.type = memory
> hadoop-t1.channels.mem1.capacity = 1000
> hadoop-t1.channels.mem1.transactionCapacity = 100
>
> # Bind the source and sink to the channel
> hadoop-t1.sources.r1.channels = mem1
> hadoop-t1.sinks.s1.channel = mem1
>
>
>
> On 14-03-28 3:37 PM, Jeff Lord wrote:
>> Do you have the appropriate interceptors configured?
>>
>>
>> On Fri, Mar 28, 2014 at 12:28 PM, Ryan Suarez
>> <ryan.suarez@sheridancollege.ca
>> <ma...@sheridancollege.ca>> wrote:
>>
>> RTFM indicates I need the following sink properties:
>>
>> ---
>> hadoop-t1.sinks.hdfs1.serializer =
>> org.apache.flume.serialization.HeaderAndBodyTextEventSerializer
>> hadoop-t1.sinks.hdfs1.serializer.columns = timestamp hostname msg
>> hadoop-t1.sinks.hdfs1.serializer.format = CSV
>> hadoop-t1.sinks.hdfs1.serializer.appendNewline = true
>> ---
>>
>> But I'm still not getting timestamp information. How would I
>> get hostname and timestamp information in the logs?
>>
>>
>> On 14-03-26 3:02 PM, Ryan Suarez wrote:
>>
>> Greetings,
>>
>> I'm running flume that's shipped with Hortonworks HDP2 to
>> feed syslogs to hdfs. The problem is the timestamp and
>> hostname of the event is not logged to hdfs.
>>
>> ---
>> flume@hadoop-t1:~$ hadoop fs -cat
>> /opt/logs/hadoop-t1/2014-03-26/FlumeData.1395859766307
>> SEQ!org.apache.hadoop.io
>> <http://org.apache.hadoop.io>.LongWritable"org.apache.hadoop.io.BytesWritable??Ak?i<??G??`D??$hTsu[22209]:
>> pam_unix(su:session): session opened for user root by
>> someuser(uid=11111)
>> ---
>>
>> How do I configure the sink to add hostname and timestamp
>> info the the event?
>>
>> Here's my flume-conf.properties:
>>
>> ---
>> flume@hadoop-t1:/etc/flume/conf$ cat flume-conf.properties
>> # Name the components on this agent
>> hadoop-t1.sources = syslog1
>> hadoop-t1.sinks = hdfs1
>> hadoop-t1.channels = mem1
>>
>> # Describe/configure the source
>> hadoop-t1.sources.syslog1.type = syslogtcp
>> hadoop-t1.sources.syslog1.host = localhost
>> hadoop-t1.sources.syslog1.port = 10005
>> hadoop-t1.sources.syslog1.portHeader = port
>>
>> ##HDFS Sink
>> hadoop-t1.sinks.hdfs1.type = hdfs
>> hadoop-t1.sinks.hdfs1.hdfs.path =
>> hdfs://hadoop-t1.mydomain.org:8020/opt/logs/%{host}/%Y-%m-%d
>> <http://hadoop-t1.mydomain.org:8020/opt/logs/%%7Bhost%7D/%Y-%m-%d>
>> hadoop-t1.sinks.hdfs1.hdfs.batchSize = 1
>>
>> # Use a channel which buffers events in memory
>> hadoop-t1.channels.mem1.type = memory
>> hadoop-t1.channels.mem1.capacity = 1000
>> hadoop-t1.channels.mem1.transactionCapacity = 100
>>
>> # Bind the source and sink to the channel
>> hadoop-t1.sources.syslog1.channels = mem1
>> hadoop-t1.sinks.hdfs1.channel = mem1
>> ---
>>
>> ---
>> flume@hadoop-t1:~$ flume-ng version
>> Flume 1.4.0.2.0.11.0-1
>> Source code repository:
>> https://git-wip-us.apache.org/repos/asf/flume.git
>> Revision: fcdc3d29a1f249bef653b10b149aea2bc5df892e
>> Compiled by jenkins on Wed Mar 12 05:11:30 PDT 2014
>> From source with checksum dea9ae30ce2c27486ae7c76ab7aba020
>> ---
>>
>>
>>
>
>
Re: preserve syslog header in hdfs sink
Posted by Jeff Lord <jl...@cloudera.com>.
Well you are writing a sequence file (default) Is that what you want?
If you want text use:
hdfs.fileType = datastream
and for the serializer you should be able to just use:
a1.sinks.k1.sink.serializer = header_and_text
On Tue, Apr 1, 2014 at 8:02 AM, Ryan Suarez
<ry...@sheridancollege.ca>wrote:
> Thanks for the tip! I was indeed missing the interceptors. I've added
> them now but the timestamp and hostname is still not showing up in the hdfs
> log. Any advice?
>
>
> ------- sample event in HDFS ------
> SEQ
> !org.apache.hadoop.io.LongWritable”org.apache.hadoop.io.BytesWritable������cc�c��I�[��ڳ\�����`���
> �� E � ����Tsu[28432]: pam_unix(su:session): session opened for user root
> by myuser(uid=31043)
>
> ------ same event in syslog ------
> Mar 31 16:18:32 hadoop-t1 su[28432]: pam_unix(su:session): session opened
> for user root by myuser(uid=31043)
>
> ------- flume-conf.properties --------
>
> # Name the components on this agent
> hadoop-t1.sources = r1
> hadoop-t1.sinks = s1
>
> hadoop-t1.channels = mem1
>
> # Describe/configure the source
> hadoop-t1.sources.r1.type = syslogtcp
> hadoop-t1.sources.r1.host = localhost
> hadoop-t1.sources.r1.port = 10005
> hadoop-t1.sources.r1.portHeader = port
> hadoop-t1.sources.r1.interceptors = i1 i2
> hadoop-t1.sources.r1.interceptors.i1.type = timestamp
> hadoop-t1.sources.r1.interceptors.i2.type = host
> hadoop-t1.sources.r1.interceptors.i2.hostHeader = hostname
>
> ##HDFS Sink
> hadoop-t1.sinks.s1.type = hdfs
> hadoop-t1.sinks.s1.hdfs.path = hdfs://
> hadoop-t1.mydomain.org:8020/opt/logs/%{host}/%Y-%m-%d
> hadoop-t1.sinks.s1.hdfs.batchSize = 1
> hadoop-t1.sinks.s1.serializer =
> org.apache.flume.serialization.HeaderAndBodyTextEventSerializer$Builder
> hadoop-t1.sinks.s1.serializer.columns = timestamp hostname
> hadoop-t1.sinks.s1.serializer.format = CSV
> hadoop-t1.sinks.s1.serializer.appendNewline = true
>
> ## MEM Use a channel which buffers events in memory
>
> hadoop-t1.channels.mem1.type = memory
> hadoop-t1.channels.mem1.capacity = 1000
> hadoop-t1.channels.mem1.transactionCapacity = 100
>
> # Bind the source and sink to the channel
> hadoop-t1.sources.r1.channels = mem1
> hadoop-t1.sinks.s1.channel = mem1
>
>
>
> On 14-03-28 3:37 PM, Jeff Lord wrote:
>
> Do you have the appropriate interceptors configured?
>
>
> On Fri, Mar 28, 2014 at 12:28 PM, Ryan Suarez <
> ryan.suarez@sheridancollege.ca> wrote:
>
>> RTFM indicates I need the following sink properties:
>>
>> ---
>> hadoop-t1.sinks.hdfs1.serializer =
>> org.apache.flume.serialization.HeaderAndBodyTextEventSerializer
>> hadoop-t1.sinks.hdfs1.serializer.columns = timestamp hostname msg
>> hadoop-t1.sinks.hdfs1.serializer.format = CSV
>> hadoop-t1.sinks.hdfs1.serializer.appendNewline = true
>> ---
>>
>> But I'm still not getting timestamp information. How would I get
>> hostname and timestamp information in the logs?
>>
>>
>> On 14-03-26 3:02 PM, Ryan Suarez wrote:
>>
>>> Greetings,
>>>
>>> I'm running flume that's shipped with Hortonworks HDP2 to feed syslogs
>>> to hdfs. The problem is the timestamp and hostname of the event is not
>>> logged to hdfs.
>>>
>>> ---
>>> flume@hadoop-t1:~$ hadoop fs -cat
>>> /opt/logs/hadoop-t1/2014-03-26/FlumeData.1395859766307
>>> SEQ!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritable??Ak?i<??G??`D??$hTsu[22209]:
>>> pam_unix(su:session): session opened for user root by someuser(uid=11111)
>>> ---
>>>
>>> How do I configure the sink to add hostname and timestamp info the the
>>> event?
>>>
>>> Here's my flume-conf.properties:
>>>
>>> ---
>>> flume@hadoop-t1:/etc/flume/conf$ cat flume-conf.properties
>>> # Name the components on this agent
>>> hadoop-t1.sources = syslog1
>>> hadoop-t1.sinks = hdfs1
>>> hadoop-t1.channels = mem1
>>>
>>> # Describe/configure the source
>>> hadoop-t1.sources.syslog1.type = syslogtcp
>>> hadoop-t1.sources.syslog1.host = localhost
>>> hadoop-t1.sources.syslog1.port = 10005
>>> hadoop-t1.sources.syslog1.portHeader = port
>>>
>>> ##HDFS Sink
>>> hadoop-t1.sinks.hdfs1.type = hdfs
>>> hadoop-t1.sinks.hdfs1.hdfs.path = hdfs://
>>> hadoop-t1.mydomain.org:8020/opt/logs/%{host}/%Y-%m-%d
>>> hadoop-t1.sinks.hdfs1.hdfs.batchSize = 1
>>>
>>> # Use a channel which buffers events in memory
>>> hadoop-t1.channels.mem1.type = memory
>>> hadoop-t1.channels.mem1.capacity = 1000
>>> hadoop-t1.channels.mem1.transactionCapacity = 100
>>>
>>> # Bind the source and sink to the channel
>>> hadoop-t1.sources.syslog1.channels = mem1
>>> hadoop-t1.sinks.hdfs1.channel = mem1
>>> ---
>>>
>>> ---
>>> flume@hadoop-t1:~$ flume-ng version
>>> Flume 1.4.0.2.0.11.0-1
>>> Source code repository:
>>> https://git-wip-us.apache.org/repos/asf/flume.git
>>> Revision: fcdc3d29a1f249bef653b10b149aea2bc5df892e
>>> Compiled by jenkins on Wed Mar 12 05:11:30 PDT 2014
>>> From source with checksum dea9ae30ce2c27486ae7c76ab7aba020
>>> ---
>>>
>>
>>
>
>