You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Ryan Suarez <ry...@sheridancollege.ca> on 2014/03/26 20:02:44 UTC

preserve syslog header in hdfs sink

Greetings,

I'm running flume that's shipped with Hortonworks HDP2 to feed syslogs 
to hdfs.  The problem is the timestamp and hostname of the event is not 
logged to hdfs.

---
flume@hadoop-t1:~$ hadoop fs -cat 
/opt/logs/hadoop-t1/2014-03-26/FlumeData.1395859766307
SEQ!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritable??Ak?i<??G??`D??$hTsu[22209]: 
pam_unix(su:session): session opened for user root by someuser(uid=11111)
---

How do I configure the sink to add hostname and timestamp info the the 
event?

Here's my flume-conf.properties:

---
flume@hadoop-t1:/etc/flume/conf$ cat flume-conf.properties
# Name the components on this agent
hadoop-t1.sources = syslog1
hadoop-t1.sinks = hdfs1
hadoop-t1.channels = mem1

# Describe/configure the source
hadoop-t1.sources.syslog1.type = syslogtcp
hadoop-t1.sources.syslog1.host = localhost
hadoop-t1.sources.syslog1.port = 10005
hadoop-t1.sources.syslog1.portHeader = port

##HDFS Sink
hadoop-t1.sinks.hdfs1.type = hdfs
hadoop-t1.sinks.hdfs1.hdfs.path = 
hdfs://hadoop-t1.mydomain.org:8020/opt/logs/%{host}/%Y-%m-%d
hadoop-t1.sinks.hdfs1.hdfs.batchSize = 1

# Use a channel which buffers events in memory
hadoop-t1.channels.mem1.type = memory
hadoop-t1.channels.mem1.capacity = 1000
hadoop-t1.channels.mem1.transactionCapacity = 100

# Bind the source and sink to the channel
hadoop-t1.sources.syslog1.channels = mem1
hadoop-t1.sinks.hdfs1.channel = mem1
---

---
flume@hadoop-t1:~$ flume-ng version
Flume 1.4.0.2.0.11.0-1
Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
Revision: fcdc3d29a1f249bef653b10b149aea2bc5df892e
Compiled by jenkins on Wed Mar 12 05:11:30 PDT 2014
 From source with checksum dea9ae30ce2c27486ae7c76ab7aba020
---

Re: preserve syslog header in hdfs sink

Posted by Christopher Shannon <cs...@gmail.com>.

Regex Extractor Intetcrptor can get info from the event body and add to
event headers.
On Mar 28, 2014 2:28 PM, "Ryan Suarez" <ry...@sheridancollege.ca>
wrote:

> RTFM indicates I need the following sink properties:
>
> ---
> hadoop-t1.sinks.hdfs1.serializer = org.apache.flume.serialization.
> HeaderAndBodyTextEventSerializer
> hadoop-t1.sinks.hdfs1.serializer.columns = timestamp hostname msg
> hadoop-t1.sinks.hdfs1.serializer.format = CSV
> hadoop-t1.sinks.hdfs1.serializer.appendNewline = true
> ---
>
> But I'm still not getting timestamp information.  How would I get hostname
> and timestamp information in the logs?
>
> On 14-03-26 3:02 PM, Ryan Suarez wrote:
>
>> Greetings,
>>
>> I'm running flume that's shipped with Hortonworks HDP2 to feed syslogs to
>> hdfs.  The problem is the timestamp and hostname of the event is not logged
>> to hdfs.
>>
>> ---
>> flume@hadoop-t1:~$ hadoop fs -cat /opt/logs/hadoop-t1/2014-03-
>> 26/FlumeData.1395859766307
>> SEQ!org.apache.hadoop.io.LongWritable"org.apache.
>> hadoop.io.BytesWritable??Ak?i<??G??`D??$hTsu[22209]:
>> pam_unix(su:session): session opened for user root by someuser(uid=11111)
>> ---
>>
>> How do I configure the sink to add hostname and timestamp info the the
>> event?
>>
>> Here's my flume-conf.properties:
>>
>> ---
>> flume@hadoop-t1:/etc/flume/conf$ cat flume-conf.properties
>> # Name the components on this agent
>> hadoop-t1.sources = syslog1
>> hadoop-t1.sinks = hdfs1
>> hadoop-t1.channels = mem1
>>
>> # Describe/configure the source
>> hadoop-t1.sources.syslog1.type = syslogtcp
>> hadoop-t1.sources.syslog1.host = localhost
>> hadoop-t1.sources.syslog1.port = 10005
>> hadoop-t1.sources.syslog1.portHeader = port
>>
>> ##HDFS Sink
>> hadoop-t1.sinks.hdfs1.type = hdfs
>> hadoop-t1.sinks.hdfs1.hdfs.path = hdfs://hadoop-t1.mydomain.org:
>> 8020/opt/logs/%{host}/%Y-%m-%d
>> hadoop-t1.sinks.hdfs1.hdfs.batchSize = 1
>>
>> # Use a channel which buffers events in memory
>> hadoop-t1.channels.mem1.type = memory
>> hadoop-t1.channels.mem1.capacity = 1000
>> hadoop-t1.channels.mem1.transactionCapacity = 100
>>
>> # Bind the source and sink to the channel
>> hadoop-t1.sources.syslog1.channels = mem1
>> hadoop-t1.sinks.hdfs1.channel = mem1
>> ---
>>
>> ---
>> flume@hadoop-t1:~$ flume-ng version
>> Flume 1.4.0.2.0.11.0-1
>> Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
>> Revision: fcdc3d29a1f249bef653b10b149aea2bc5df892e
>> Compiled by jenkins on Wed Mar 12 05:11:30 PDT 2014
>> From source with checksum dea9ae30ce2c27486ae7c76ab7aba020
>> ---
>>
>
>

Re: preserve syslog header in hdfs sink

Posted by Ryan Suarez <ry...@sheridancollege.ca>.

oops, my bad.  Typo in my config file. I incorrectly put 
fileType=datastream instead of hdfs.fileType=datastream.  Thanks Jeff!  
It's working for me now. I see timestamp and hostname.

regards,
Ryan

On 14-04-02 2:21 PM, Ryan Suarez wrote:
> Ok, I've added hdfs.fileType = datastream and sink.serializer = 
> header_and_text.  But I'm still seeing the logs written in sequence 
> format.  Any ideas?
>
> -----
> flume@hadoop-t1:~$ flume-ng version
> Flume 1.4.0.2.0.11.0-1
> Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
> Revision: fcdc3d29a1f249bef653b10b149aea2bc5df892e
> Compiled by jenkins on Wed Mar 12 05:11:30 PDT 2014
> From source with checksum dea9ae30ce2c27486ae7c76ab7aba020
>
>
> -----
> root@hadoop-t1:/etc/flume/conf# cat flume-conf.properties
> # Name the components on this agent
> hadoop-t1.sources = r1
> hadoop-t1.sinks = s1
> hadoop-t1.channels = mem1
>
> # Describe/configure the source
> hadoop-t1.sources.r1.type = syslogtcp
> hadoop-t1.sources.r1.host = localhost
> hadoop-t1.sources.r1.port = 10005
> hadoop-t1.sources.r1.portHeader = port
> hadoop-t1.sources.r1.interceptors = i1 i2
> hadoop-t1.sources.r1.interceptors.i1.type = timestamp
> hadoop-t1.sources.r1.interceptors.i2.type = host
> hadoop-t1.sources.r1.interceptors.i2.hostHeader = hostname
>
> ##HDFS Sink
> hadoop-t1.sinks.s1.type = hdfs
> hadoop-t1.sinks.s1.fileType = *DataStream*
> hadoop-t1.sinks.s1.hdfs.path = 
> hdfs://hadoop-t1.mydomain.org:8020/opt/logs/%{host}/%Y-%m-%d
> hadoop-t1.sinks.s1.hdfs.batchSize = 1
> hadoop-t1.sinks.s1.serializer = *header_and_text*
> hadoop-t1.sinks.s1.serializer.columns = timestamp hostname
> hadoop-t1.sinks.s1.serializer.format = CSV
> hadoop-t1.sinks.s1.serializer.appendNewline = true
>
> ## MEM  Use a channel which buffers events in memory
> hadoop-t1.channels.mem1.type = memory
> hadoop-t1.channels.mem1.capacity = 1000
> hadoop-t1.channels.mem1.transactionCapacity = 100
>
> # Bind the source and sink to the channel
> hadoop-t1.sources.r1.channels = mem1
> hadoop-t1.sinks.s1.channel = mem1
>
> On 14-04-01 12:13 PM, Jeff Lord wrote:
>> Well you are writing a sequence file (default) Is that what you want?
>> If you want text use:
>>
>> hdfs.fileType = datastream
>>
>> and for the serializer you should be able to just use:
>>
>> a1.sinks.k1.sink.serializer = header_and_text
>>
>>
>>
>> On Tue, Apr 1, 2014 at 8:02 AM, Ryan Suarez 
>> <ryan.suarez@sheridancollege.ca 
>> <ma...@sheridancollege.ca>> wrote:
>>
>>     Thanks for the tip!  I was indeed missing the interceptors.  I've
>>     added them now but the timestamp and hostname is still not
>>     showing up in the hdfs log. Any advice?
>>
>>
>>     ------- sample event in HDFS ------
>>     SEQ
>>     !org.apache.hadoop.io.LongWritable”org.apache.hadoop.io.BytesWritable������cc�c��I�[��ڳ\�����`���
>>     �� E � ����Tsu[28432]: pam_unix(su:session): session opened for
>>     user root by myuser(uid=31043)
>>
>>     ------ same event in syslog ------
>>     Mar 31 16:18:32 hadoop-t1 su[28432]: pam_unix(su:session):
>>     session opened for user root by myuser(uid=31043)
>>
>>     ------- flume-conf.properties --------
>>
>>     # Name the components on this agent
>>     hadoop-t1.sources = r1
>>     hadoop-t1.sinks = s1
>>
>>     hadoop-t1.channels = mem1
>>
>>     # Describe/configure the source
>>     hadoop-t1.sources.r1.type = syslogtcp
>>     hadoop-t1.sources.r1.host = localhost
>>     hadoop-t1.sources.r1.port = 10005
>>     hadoop-t1.sources.r1.portHeader = port
>>     hadoop-t1.sources.r1.interceptors = i1 i2
>>     hadoop-t1.sources.r1.interceptors.i1.type = timestamp
>>     hadoop-t1.sources.r1.interceptors.i2.type = host
>>     hadoop-t1.sources.r1.interceptors.i2.hostHeader = hostname
>>
>>     ##HDFS Sink
>>     hadoop-t1.sinks.s1.type = hdfs
>>     hadoop-t1.sinks.s1.hdfs.path =
>>     hdfs://hadoop-t1.mydomain.org:8020/opt/logs/%{host}/%Y-%m-%d
>>     <http://hadoop-t1.mydomain.org:8020/opt/logs/%%7Bhost%7D/%Y-%m-%d>
>>     hadoop-t1.sinks.s1.hdfs.batchSize = 1
>>     hadoop-t1.sinks.s1.serializer =
>>     org.apache.flume.serialization.HeaderAndBodyTextEventSerializer$Builder
>>     hadoop-t1.sinks.s1.serializer.columns = timestamp hostname
>>     hadoop-t1.sinks.s1.serializer.format = CSV
>>     hadoop-t1.sinks.s1.serializer.appendNewline = true
>>
>>     ## MEM  Use a channel which buffers events in memory
>>
>>     hadoop-t1.channels.mem1.type = memory
>>     hadoop-t1.channels.mem1.capacity = 1000
>>     hadoop-t1.channels.mem1.transactionCapacity = 100
>>
>>     # Bind the source and sink to the channel
>>     hadoop-t1.sources.r1.channels = mem1
>>     hadoop-t1.sinks.s1.channel = mem1
>>
>>
>>
>>     On 14-03-28 3:37 PM, Jeff Lord wrote:
>>>     Do you have the appropriate interceptors configured?
>>>
>>>
>>>     On Fri, Mar 28, 2014 at 12:28 PM, Ryan Suarez
>>>     <ryan.suarez@sheridancollege.ca
>>>     <ma...@sheridancollege.ca>> wrote:
>>>
>>>         RTFM indicates I need the following sink properties:
>>>
>>>         ---
>>>         hadoop-t1.sinks.hdfs1.serializer =
>>>         org.apache.flume.serialization.HeaderAndBodyTextEventSerializer
>>>         hadoop-t1.sinks.hdfs1.serializer.columns = timestamp
>>>         hostname msg
>>>         hadoop-t1.sinks.hdfs1.serializer.format = CSV
>>>         hadoop-t1.sinks.hdfs1.serializer.appendNewline = true
>>>         ---
>>>
>>>         But I'm still not getting timestamp information.  How would
>>>         I get hostname and timestamp information in the logs?
>>>
>>>
>>>         On 14-03-26 3:02 PM, Ryan Suarez wrote:
>>>
>>>             Greetings,
>>>
>>>             I'm running flume that's shipped with Hortonworks HDP2
>>>             to feed syslogs to hdfs.  The problem is the timestamp
>>>             and hostname of the event is not logged to hdfs.
>>>
>>>             ---
>>>             flume@hadoop-t1:~$ hadoop fs -cat
>>>             /opt/logs/hadoop-t1/2014-03-26/FlumeData.1395859766307
>>>             SEQ!org.apache.hadoop.io
>>>             <http://org.apache.hadoop.io>.LongWritable"org.apache.hadoop.io.BytesWritable??Ak?i<??G??`D??$hTsu[22209]:
>>>             pam_unix(su:session): session opened for user root by
>>>             someuser(uid=11111)
>>>             ---
>>>
>>>             How do I configure the sink to add hostname and
>>>             timestamp info the the event?
>>>
>>>             Here's my flume-conf.properties:
>>>
>>>             ---
>>>             flume@hadoop-t1:/etc/flume/conf$ cat flume-conf.properties
>>>             # Name the components on this agent
>>>             hadoop-t1.sources = syslog1
>>>             hadoop-t1.sinks = hdfs1
>>>             hadoop-t1.channels = mem1
>>>
>>>             # Describe/configure the source
>>>             hadoop-t1.sources.syslog1.type = syslogtcp
>>>             hadoop-t1.sources.syslog1.host = localhost
>>>             hadoop-t1.sources.syslog1.port = 10005
>>>             hadoop-t1.sources.syslog1.portHeader = port
>>>
>>>             ##HDFS Sink
>>>             hadoop-t1.sinks.hdfs1.type = hdfs
>>>             hadoop-t1.sinks.hdfs1.hdfs.path =
>>>             hdfs://hadoop-t1.mydomain.org:8020/opt/logs/%{host}/%Y-%m-%d
>>>             <http://hadoop-t1.mydomain.org:8020/opt/logs/%%7Bhost%7D/%Y-%m-%d>
>>>             hadoop-t1.sinks.hdfs1.hdfs.batchSize = 1
>>>
>>>             # Use a channel which buffers events in memory
>>>             hadoop-t1.channels.mem1.type = memory
>>>             hadoop-t1.channels.mem1.capacity = 1000
>>>             hadoop-t1.channels.mem1.transactionCapacity = 100
>>>
>>>             # Bind the source and sink to the channel
>>>             hadoop-t1.sources.syslog1.channels = mem1
>>>             hadoop-t1.sinks.hdfs1.channel = mem1
>>>             ---
>>>
>>>             ---
>>>             flume@hadoop-t1:~$ flume-ng version
>>>             Flume 1.4.0.2.0.11.0-1
>>>             Source code repository:
>>>             https://git-wip-us.apache.org/repos/asf/flume.git
>>>             Revision: fcdc3d29a1f249bef653b10b149aea2bc5df892e
>>>             Compiled by jenkins on Wed Mar 12 05:11:30 PDT 2014
>>>             From source with checksum dea9ae30ce2c27486ae7c76ab7aba020
>>>             ---
>>>
>>>
>>>
>>
>>
>

Re: preserve syslog header in hdfs sink

Posted by Ryan Suarez <ry...@sheridancollege.ca>.

Ok, I've added hdfs.fileType = datastream and sink.serializer = 
header_and_text.  But I'm still seeing the logs written in sequence 
format.  Any ideas?

-----
flume@hadoop-t1:~$ flume-ng version
Flume 1.4.0.2.0.11.0-1
Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
Revision: fcdc3d29a1f249bef653b10b149aea2bc5df892e
Compiled by jenkins on Wed Mar 12 05:11:30 PDT 2014
 From source with checksum dea9ae30ce2c27486ae7c76ab7aba020


-----
root@hadoop-t1:/etc/flume/conf# cat flume-conf.properties
# Name the components on this agent
hadoop-t1.sources = r1
hadoop-t1.sinks = s1
hadoop-t1.channels = mem1

# Describe/configure the source
hadoop-t1.sources.r1.type = syslogtcp
hadoop-t1.sources.r1.host = localhost
hadoop-t1.sources.r1.port = 10005
hadoop-t1.sources.r1.portHeader = port
hadoop-t1.sources.r1.interceptors = i1 i2
hadoop-t1.sources.r1.interceptors.i1.type = timestamp
hadoop-t1.sources.r1.interceptors.i2.type = host
hadoop-t1.sources.r1.interceptors.i2.hostHeader = hostname

##HDFS Sink
hadoop-t1.sinks.s1.type = hdfs
hadoop-t1.sinks.s1.fileType = *DataStream*
hadoop-t1.sinks.s1.hdfs.path = 
hdfs://hadoop-t1.mydomain.org:8020/opt/logs/%{host}/%Y-%m-%d
hadoop-t1.sinks.s1.hdfs.batchSize = 1
hadoop-t1.sinks.s1.serializer = *header_and_text*
hadoop-t1.sinks.s1.serializer.columns = timestamp hostname
hadoop-t1.sinks.s1.serializer.format = CSV
hadoop-t1.sinks.s1.serializer.appendNewline = true

## MEM  Use a channel which buffers events in memory
hadoop-t1.channels.mem1.type = memory
hadoop-t1.channels.mem1.capacity = 1000
hadoop-t1.channels.mem1.transactionCapacity = 100

# Bind the source and sink to the channel
hadoop-t1.sources.r1.channels = mem1
hadoop-t1.sinks.s1.channel = mem1

On 14-04-01 12:13 PM, Jeff Lord wrote:
> Well you are writing a sequence file (default) Is that what you want?
> If you want text use:
>
> hdfs.fileType = datastream
>
> and for the serializer you should be able to just use:
>
> a1.sinks.k1.sink.serializer = header_and_text
>
>
>
> On Tue, Apr 1, 2014 at 8:02 AM, Ryan Suarez 
> <ryan.suarez@sheridancollege.ca 
> <ma...@sheridancollege.ca>> wrote:
>
>     Thanks for the tip!  I was indeed missing the interceptors.  I've
>     added them now but the timestamp and hostname is still not showing
>     up in the hdfs log.  Any advice?
>
>
>     ------- sample event in HDFS ------
>     SEQ
>     !org.apache.hadoop.io.LongWritable”org.apache.hadoop.io.BytesWritable������cc�c��I�[��ڳ\�����`���
>     �� E � ����Tsu[28432]: pam_unix(su:session): session opened for
>     user root by myuser(uid=31043)
>
>     ------ same event in syslog ------
>     Mar 31 16:18:32 hadoop-t1 su[28432]: pam_unix(su:session): session
>     opened for user root by myuser(uid=31043)
>
>     ------- flume-conf.properties --------
>
>     # Name the components on this agent
>     hadoop-t1.sources = r1
>     hadoop-t1.sinks = s1
>
>     hadoop-t1.channels = mem1
>
>     # Describe/configure the source
>     hadoop-t1.sources.r1.type = syslogtcp
>     hadoop-t1.sources.r1.host = localhost
>     hadoop-t1.sources.r1.port = 10005
>     hadoop-t1.sources.r1.portHeader = port
>     hadoop-t1.sources.r1.interceptors = i1 i2
>     hadoop-t1.sources.r1.interceptors.i1.type = timestamp
>     hadoop-t1.sources.r1.interceptors.i2.type = host
>     hadoop-t1.sources.r1.interceptors.i2.hostHeader = hostname
>
>     ##HDFS Sink
>     hadoop-t1.sinks.s1.type = hdfs
>     hadoop-t1.sinks.s1.hdfs.path =
>     hdfs://hadoop-t1.mydomain.org:8020/opt/logs/%{host}/%Y-%m-%d
>     <http://hadoop-t1.mydomain.org:8020/opt/logs/%%7Bhost%7D/%Y-%m-%d>
>     hadoop-t1.sinks.s1.hdfs.batchSize = 1
>     hadoop-t1.sinks.s1.serializer =
>     org.apache.flume.serialization.HeaderAndBodyTextEventSerializer$Builder
>     hadoop-t1.sinks.s1.serializer.columns = timestamp hostname
>     hadoop-t1.sinks.s1.serializer.format = CSV
>     hadoop-t1.sinks.s1.serializer.appendNewline = true
>
>     ## MEM  Use a channel which buffers events in memory
>
>     hadoop-t1.channels.mem1.type = memory
>     hadoop-t1.channels.mem1.capacity = 1000
>     hadoop-t1.channels.mem1.transactionCapacity = 100
>
>     # Bind the source and sink to the channel
>     hadoop-t1.sources.r1.channels = mem1
>     hadoop-t1.sinks.s1.channel = mem1
>
>
>
>     On 14-03-28 3:37 PM, Jeff Lord wrote:
>>     Do you have the appropriate interceptors configured?
>>
>>
>>     On Fri, Mar 28, 2014 at 12:28 PM, Ryan Suarez
>>     <ryan.suarez@sheridancollege.ca
>>     <ma...@sheridancollege.ca>> wrote:
>>
>>         RTFM indicates I need the following sink properties:
>>
>>         ---
>>         hadoop-t1.sinks.hdfs1.serializer =
>>         org.apache.flume.serialization.HeaderAndBodyTextEventSerializer
>>         hadoop-t1.sinks.hdfs1.serializer.columns = timestamp hostname msg
>>         hadoop-t1.sinks.hdfs1.serializer.format = CSV
>>         hadoop-t1.sinks.hdfs1.serializer.appendNewline = true
>>         ---
>>
>>         But I'm still not getting timestamp information.  How would I
>>         get hostname and timestamp information in the logs?
>>
>>
>>         On 14-03-26 3:02 PM, Ryan Suarez wrote:
>>
>>             Greetings,
>>
>>             I'm running flume that's shipped with Hortonworks HDP2 to
>>             feed syslogs to hdfs.  The problem is the timestamp and
>>             hostname of the event is not logged to hdfs.
>>
>>             ---
>>             flume@hadoop-t1:~$ hadoop fs -cat
>>             /opt/logs/hadoop-t1/2014-03-26/FlumeData.1395859766307
>>             SEQ!org.apache.hadoop.io
>>             <http://org.apache.hadoop.io>.LongWritable"org.apache.hadoop.io.BytesWritable??Ak?i<??G??`D??$hTsu[22209]:
>>             pam_unix(su:session): session opened for user root by
>>             someuser(uid=11111)
>>             ---
>>
>>             How do I configure the sink to add hostname and timestamp
>>             info the the event?
>>
>>             Here's my flume-conf.properties:
>>
>>             ---
>>             flume@hadoop-t1:/etc/flume/conf$ cat flume-conf.properties
>>             # Name the components on this agent
>>             hadoop-t1.sources = syslog1
>>             hadoop-t1.sinks = hdfs1
>>             hadoop-t1.channels = mem1
>>
>>             # Describe/configure the source
>>             hadoop-t1.sources.syslog1.type = syslogtcp
>>             hadoop-t1.sources.syslog1.host = localhost
>>             hadoop-t1.sources.syslog1.port = 10005
>>             hadoop-t1.sources.syslog1.portHeader = port
>>
>>             ##HDFS Sink
>>             hadoop-t1.sinks.hdfs1.type = hdfs
>>             hadoop-t1.sinks.hdfs1.hdfs.path =
>>             hdfs://hadoop-t1.mydomain.org:8020/opt/logs/%{host}/%Y-%m-%d
>>             <http://hadoop-t1.mydomain.org:8020/opt/logs/%%7Bhost%7D/%Y-%m-%d>
>>             hadoop-t1.sinks.hdfs1.hdfs.batchSize = 1
>>
>>             # Use a channel which buffers events in memory
>>             hadoop-t1.channels.mem1.type = memory
>>             hadoop-t1.channels.mem1.capacity = 1000
>>             hadoop-t1.channels.mem1.transactionCapacity = 100
>>
>>             # Bind the source and sink to the channel
>>             hadoop-t1.sources.syslog1.channels = mem1
>>             hadoop-t1.sinks.hdfs1.channel = mem1
>>             ---
>>
>>             ---
>>             flume@hadoop-t1:~$ flume-ng version
>>             Flume 1.4.0.2.0.11.0-1
>>             Source code repository:
>>             https://git-wip-us.apache.org/repos/asf/flume.git
>>             Revision: fcdc3d29a1f249bef653b10b149aea2bc5df892e
>>             Compiled by jenkins on Wed Mar 12 05:11:30 PDT 2014
>>             From source with checksum dea9ae30ce2c27486ae7c76ab7aba020
>>             ---
>>
>>
>>
>
>

Re: preserve syslog header in hdfs sink

Posted by Jeff Lord <jl...@cloudera.com>.

Well you are writing a sequence file (default) Is that what you want?
If you want text use:

hdfs.fileType = datastream

and for the serializer you should be able to just use:

a1.sinks.k1.sink.serializer = header_and_text



On Tue, Apr 1, 2014 at 8:02 AM, Ryan Suarez
<ry...@sheridancollege.ca>wrote:

>  Thanks for the tip!  I was indeed missing the interceptors.  I've added
> them now but the timestamp and hostname is still not showing up in the hdfs
> log.  Any advice?
>
>
> ------- sample event in HDFS ------
> SEQ
> !org.apache.hadoop.io.LongWritable”org.apache.hadoop.io.BytesWritable������cc�c��I�[��ڳ\�����`���
> �� E � ����Tsu[28432]: pam_unix(su:session): session opened for user root
> by myuser(uid=31043)
>
> ------ same event in syslog ------
> Mar 31 16:18:32 hadoop-t1 su[28432]: pam_unix(su:session): session opened
> for user root by myuser(uid=31043)
>
> ------- flume-conf.properties --------
>
> # Name the components on this agent
> hadoop-t1.sources = r1
> hadoop-t1.sinks = s1
>
> hadoop-t1.channels = mem1
>
> # Describe/configure the source
> hadoop-t1.sources.r1.type = syslogtcp
> hadoop-t1.sources.r1.host = localhost
> hadoop-t1.sources.r1.port = 10005
> hadoop-t1.sources.r1.portHeader = port
> hadoop-t1.sources.r1.interceptors = i1 i2
> hadoop-t1.sources.r1.interceptors.i1.type = timestamp
> hadoop-t1.sources.r1.interceptors.i2.type = host
> hadoop-t1.sources.r1.interceptors.i2.hostHeader = hostname
>
> ##HDFS Sink
> hadoop-t1.sinks.s1.type = hdfs
> hadoop-t1.sinks.s1.hdfs.path = hdfs://
> hadoop-t1.mydomain.org:8020/opt/logs/%{host}/%Y-%m-%d
> hadoop-t1.sinks.s1.hdfs.batchSize = 1
> hadoop-t1.sinks.s1.serializer =
> org.apache.flume.serialization.HeaderAndBodyTextEventSerializer$Builder
> hadoop-t1.sinks.s1.serializer.columns = timestamp hostname
> hadoop-t1.sinks.s1.serializer.format = CSV
> hadoop-t1.sinks.s1.serializer.appendNewline = true
>
> ## MEM  Use a channel which buffers events in memory
>
> hadoop-t1.channels.mem1.type = memory
> hadoop-t1.channels.mem1.capacity = 1000
> hadoop-t1.channels.mem1.transactionCapacity = 100
>
> # Bind the source and sink to the channel
> hadoop-t1.sources.r1.channels = mem1
> hadoop-t1.sinks.s1.channel = mem1
>
>
>
> On 14-03-28 3:37 PM, Jeff Lord wrote:
>
> Do you have the appropriate interceptors configured?
>
>
> On Fri, Mar 28, 2014 at 12:28 PM, Ryan Suarez <
> ryan.suarez@sheridancollege.ca> wrote:
>
>> RTFM indicates I need the following sink properties:
>>
>> ---
>> hadoop-t1.sinks.hdfs1.serializer =
>> org.apache.flume.serialization.HeaderAndBodyTextEventSerializer
>> hadoop-t1.sinks.hdfs1.serializer.columns = timestamp hostname msg
>> hadoop-t1.sinks.hdfs1.serializer.format = CSV
>> hadoop-t1.sinks.hdfs1.serializer.appendNewline = true
>> ---
>>
>> But I'm still not getting timestamp information.  How would I get
>> hostname and timestamp information in the logs?
>>
>>
>> On 14-03-26 3:02 PM, Ryan Suarez wrote:
>>
>>> Greetings,
>>>
>>> I'm running flume that's shipped with Hortonworks HDP2 to feed syslogs
>>> to hdfs.  The problem is the timestamp and hostname of the event is not
>>> logged to hdfs.
>>>
>>> ---
>>> flume@hadoop-t1:~$ hadoop fs -cat
>>> /opt/logs/hadoop-t1/2014-03-26/FlumeData.1395859766307
>>> SEQ!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritable??Ak?i<??G??`D??$hTsu[22209]:
>>> pam_unix(su:session): session opened for user root by someuser(uid=11111)
>>> ---
>>>
>>> How do I configure the sink to add hostname and timestamp info the the
>>> event?
>>>
>>> Here's my flume-conf.properties:
>>>
>>> ---
>>> flume@hadoop-t1:/etc/flume/conf$ cat flume-conf.properties
>>> # Name the components on this agent
>>> hadoop-t1.sources = syslog1
>>> hadoop-t1.sinks = hdfs1
>>> hadoop-t1.channels = mem1
>>>
>>> # Describe/configure the source
>>> hadoop-t1.sources.syslog1.type = syslogtcp
>>> hadoop-t1.sources.syslog1.host = localhost
>>> hadoop-t1.sources.syslog1.port = 10005
>>> hadoop-t1.sources.syslog1.portHeader = port
>>>
>>> ##HDFS Sink
>>> hadoop-t1.sinks.hdfs1.type = hdfs
>>> hadoop-t1.sinks.hdfs1.hdfs.path = hdfs://
>>> hadoop-t1.mydomain.org:8020/opt/logs/%{host}/%Y-%m-%d
>>> hadoop-t1.sinks.hdfs1.hdfs.batchSize = 1
>>>
>>> # Use a channel which buffers events in memory
>>> hadoop-t1.channels.mem1.type = memory
>>> hadoop-t1.channels.mem1.capacity = 1000
>>> hadoop-t1.channels.mem1.transactionCapacity = 100
>>>
>>> # Bind the source and sink to the channel
>>> hadoop-t1.sources.syslog1.channels = mem1
>>> hadoop-t1.sinks.hdfs1.channel = mem1
>>> ---
>>>
>>> ---
>>> flume@hadoop-t1:~$ flume-ng version
>>> Flume 1.4.0.2.0.11.0-1
>>> Source code repository:
>>> https://git-wip-us.apache.org/repos/asf/flume.git
>>> Revision: fcdc3d29a1f249bef653b10b149aea2bc5df892e
>>> Compiled by jenkins on Wed Mar 12 05:11:30 PDT 2014
>>> From source with checksum dea9ae30ce2c27486ae7c76ab7aba020
>>> ---
>>>
>>
>>
>
>

Re: preserve syslog header in hdfs sink

Posted by Ryan Suarez <ry...@sheridancollege.ca>.

Thanks for the tip!  I was indeed missing the interceptors.  I've added 
them now but the timestamp and hostname is still not showing up in the 
hdfs log.  Any advice?


------- sample event in HDFS ------
SEQ!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritable??????cc?c??I?[???\?????`?????E?????Tsu[28432]: 
pam_unix(su:session): session opened for user root by myuser(uid=31043)

------ same event in syslog ------
Mar 31 16:18:32 hadoop-t1 su[28432]: pam_unix(su:session): session 
opened for user root by myuser(uid=31043)

------- flume-conf.properties --------
# Name the components on this agent
hadoop-t1.sources = r1
hadoop-t1.sinks = s1
hadoop-t1.channels = mem1

# Describe/configure the source
hadoop-t1.sources.r1.type = syslogtcp
hadoop-t1.sources.r1.host = localhost
hadoop-t1.sources.r1.port = 10005
hadoop-t1.sources.r1.portHeader = port
hadoop-t1.sources.r1.interceptors = i1 i2
hadoop-t1.sources.r1.interceptors.i1.type = timestamp
hadoop-t1.sources.r1.interceptors.i2.type = host
hadoop-t1.sources.r1.interceptors.i2.hostHeader = hostname

##HDFS Sink
hadoop-t1.sinks.s1.type = hdfs
hadoop-t1.sinks.s1.hdfs.path = 
hdfs://hadoop-t1.mydomain.org:8020/opt/logs/%{host}/%Y-%m-%d
hadoop-t1.sinks.s1.hdfs.batchSize = 1
hadoop-t1.sinks.s1.serializer = 
org.apache.flume.serialization.HeaderAndBodyTextEventSerializer$Builder
hadoop-t1.sinks.s1.serializer.columns = timestamp hostname
hadoop-t1.sinks.s1.serializer.format = CSV
hadoop-t1.sinks.s1.serializer.appendNewline = true

## MEM  Use a channel which buffers events in memory
hadoop-t1.channels.mem1.type = memory
hadoop-t1.channels.mem1.capacity = 1000
hadoop-t1.channels.mem1.transactionCapacity = 100

# Bind the source and sink to the channel
hadoop-t1.sources.r1.channels = mem1
hadoop-t1.sinks.s1.channel = mem1


On 14-03-28 3:37 PM, Jeff Lord wrote:
> Do you have the appropriate interceptors configured?
>
>
> On Fri, Mar 28, 2014 at 12:28 PM, Ryan Suarez 
> <ryan.suarez@sheridancollege.ca 
> <ma...@sheridancollege.ca>> wrote:
>
>     RTFM indicates I need the following sink properties:
>
>     ---
>     hadoop-t1.sinks.hdfs1.serializer =
>     org.apache.flume.serialization.HeaderAndBodyTextEventSerializer
>     hadoop-t1.sinks.hdfs1.serializer.columns = timestamp hostname msg
>     hadoop-t1.sinks.hdfs1.serializer.format = CSV
>     hadoop-t1.sinks.hdfs1.serializer.appendNewline = true
>     ---
>
>     But I'm still not getting timestamp information.  How would I get
>     hostname and timestamp information in the logs?
>
>
>     On 14-03-26 3:02 PM, Ryan Suarez wrote:
>
>         Greetings,
>
>         I'm running flume that's shipped with Hortonworks HDP2 to feed
>         syslogs to hdfs.  The problem is the timestamp and hostname of
>         the event is not logged to hdfs.
>
>         ---
>         flume@hadoop-t1:~$ hadoop fs -cat
>         /opt/logs/hadoop-t1/2014-03-26/FlumeData.1395859766307
>         SEQ!org.apache.hadoop.io
>         <http://org.apache.hadoop.io>.LongWritable"org.apache.hadoop.io.BytesWritable??Ak?i<??G??`D??$hTsu[22209]:
>         pam_unix(su:session): session opened for user root by
>         someuser(uid=11111)
>         ---
>
>         How do I configure the sink to add hostname and timestamp info
>         the the event?
>
>         Here's my flume-conf.properties:
>
>         ---
>         flume@hadoop-t1:/etc/flume/conf$ cat flume-conf.properties
>         # Name the components on this agent
>         hadoop-t1.sources = syslog1
>         hadoop-t1.sinks = hdfs1
>         hadoop-t1.channels = mem1
>
>         # Describe/configure the source
>         hadoop-t1.sources.syslog1.type = syslogtcp
>         hadoop-t1.sources.syslog1.host = localhost
>         hadoop-t1.sources.syslog1.port = 10005
>         hadoop-t1.sources.syslog1.portHeader = port
>
>         ##HDFS Sink
>         hadoop-t1.sinks.hdfs1.type = hdfs
>         hadoop-t1.sinks.hdfs1.hdfs.path =
>         hdfs://hadoop-t1.mydomain.org:8020/opt/logs/%{host}/%Y-%m-%d
>         <http://hadoop-t1.mydomain.org:8020/opt/logs/%%7Bhost%7D/%Y-%m-%d>
>         hadoop-t1.sinks.hdfs1.hdfs.batchSize = 1
>
>         # Use a channel which buffers events in memory
>         hadoop-t1.channels.mem1.type = memory
>         hadoop-t1.channels.mem1.capacity = 1000
>         hadoop-t1.channels.mem1.transactionCapacity = 100
>
>         # Bind the source and sink to the channel
>         hadoop-t1.sources.syslog1.channels = mem1
>         hadoop-t1.sinks.hdfs1.channel = mem1
>         ---
>
>         ---
>         flume@hadoop-t1:~$ flume-ng version
>         Flume 1.4.0.2.0.11.0-1
>         Source code repository:
>         https://git-wip-us.apache.org/repos/asf/flume.git
>         Revision: fcdc3d29a1f249bef653b10b149aea2bc5df892e
>         Compiled by jenkins on Wed Mar 12 05:11:30 PDT 2014
>         From source with checksum dea9ae30ce2c27486ae7c76ab7aba020
>         ---
>
>
>

Re: preserve syslog header in hdfs sink

Posted by Jeff Lord <jl...@cloudera.com>.

Do you have the appropriate interceptors configured?


On Fri, Mar 28, 2014 at 12:28 PM, Ryan Suarez <
ryan.suarez@sheridancollege.ca> wrote:

> RTFM indicates I need the following sink properties:
>
> ---
> hadoop-t1.sinks.hdfs1.serializer = org.apache.flume.serialization.
> HeaderAndBodyTextEventSerializer
> hadoop-t1.sinks.hdfs1.serializer.columns = timestamp hostname msg
> hadoop-t1.sinks.hdfs1.serializer.format = CSV
> hadoop-t1.sinks.hdfs1.serializer.appendNewline = true
> ---
>
> But I'm still not getting timestamp information.  How would I get hostname
> and timestamp information in the logs?
>
>
> On 14-03-26 3:02 PM, Ryan Suarez wrote:
>
>> Greetings,
>>
>> I'm running flume that's shipped with Hortonworks HDP2 to feed syslogs to
>> hdfs.  The problem is the timestamp and hostname of the event is not logged
>> to hdfs.
>>
>> ---
>> flume@hadoop-t1:~$ hadoop fs -cat /opt/logs/hadoop-t1/2014-03-
>> 26/FlumeData.1395859766307
>> SEQ!org.apache.hadoop.io.LongWritable"org.apache.
>> hadoop.io.BytesWritable??Ak?i<??G??`D??$hTsu[22209]:
>> pam_unix(su:session): session opened for user root by someuser(uid=11111)
>> ---
>>
>> How do I configure the sink to add hostname and timestamp info the the
>> event?
>>
>> Here's my flume-conf.properties:
>>
>> ---
>> flume@hadoop-t1:/etc/flume/conf$ cat flume-conf.properties
>> # Name the components on this agent
>> hadoop-t1.sources = syslog1
>> hadoop-t1.sinks = hdfs1
>> hadoop-t1.channels = mem1
>>
>> # Describe/configure the source
>> hadoop-t1.sources.syslog1.type = syslogtcp
>> hadoop-t1.sources.syslog1.host = localhost
>> hadoop-t1.sources.syslog1.port = 10005
>> hadoop-t1.sources.syslog1.portHeader = port
>>
>> ##HDFS Sink
>> hadoop-t1.sinks.hdfs1.type = hdfs
>> hadoop-t1.sinks.hdfs1.hdfs.path = hdfs://hadoop-t1.mydomain.org:
>> 8020/opt/logs/%{host}/%Y-%m-%d
>> hadoop-t1.sinks.hdfs1.hdfs.batchSize = 1
>>
>> # Use a channel which buffers events in memory
>> hadoop-t1.channels.mem1.type = memory
>> hadoop-t1.channels.mem1.capacity = 1000
>> hadoop-t1.channels.mem1.transactionCapacity = 100
>>
>> # Bind the source and sink to the channel
>> hadoop-t1.sources.syslog1.channels = mem1
>> hadoop-t1.sinks.hdfs1.channel = mem1
>> ---
>>
>> ---
>> flume@hadoop-t1:~$ flume-ng version
>> Flume 1.4.0.2.0.11.0-1
>> Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
>> Revision: fcdc3d29a1f249bef653b10b149aea2bc5df892e
>> Compiled by jenkins on Wed Mar 12 05:11:30 PDT 2014
>> From source with checksum dea9ae30ce2c27486ae7c76ab7aba020
>> ---
>>
>
>

Re: preserve syslog header in hdfs sink

Posted by Ryan Suarez <ry...@sheridancollege.ca>.

RTFM indicates I need the following sink properties:

---
hadoop-t1.sinks.hdfs1.serializer = 
org.apache.flume.serialization.HeaderAndBodyTextEventSerializer
hadoop-t1.sinks.hdfs1.serializer.columns = timestamp hostname msg
hadoop-t1.sinks.hdfs1.serializer.format = CSV
hadoop-t1.sinks.hdfs1.serializer.appendNewline = true
---

But I'm still not getting timestamp information.  How would I get 
hostname and timestamp information in the logs?

On 14-03-26 3:02 PM, Ryan Suarez wrote:
> Greetings,
>
> I'm running flume that's shipped with Hortonworks HDP2 to feed syslogs 
> to hdfs.  The problem is the timestamp and hostname of the event is 
> not logged to hdfs.
>
> ---
> flume@hadoop-t1:~$ hadoop fs -cat 
> /opt/logs/hadoop-t1/2014-03-26/FlumeData.1395859766307
> SEQ!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritable??Ak?i<??G??`D??$hTsu[22209]: 
> pam_unix(su:session): session opened for user root by someuser(uid=11111)
> ---
>
> How do I configure the sink to add hostname and timestamp info the the 
> event?
>
> Here's my flume-conf.properties:
>
> ---
> flume@hadoop-t1:/etc/flume/conf$ cat flume-conf.properties
> # Name the components on this agent
> hadoop-t1.sources = syslog1
> hadoop-t1.sinks = hdfs1
> hadoop-t1.channels = mem1
>
> # Describe/configure the source
> hadoop-t1.sources.syslog1.type = syslogtcp
> hadoop-t1.sources.syslog1.host = localhost
> hadoop-t1.sources.syslog1.port = 10005
> hadoop-t1.sources.syslog1.portHeader = port
>
> ##HDFS Sink
> hadoop-t1.sinks.hdfs1.type = hdfs
> hadoop-t1.sinks.hdfs1.hdfs.path = 
> hdfs://hadoop-t1.mydomain.org:8020/opt/logs/%{host}/%Y-%m-%d
> hadoop-t1.sinks.hdfs1.hdfs.batchSize = 1
>
> # Use a channel which buffers events in memory
> hadoop-t1.channels.mem1.type = memory
> hadoop-t1.channels.mem1.capacity = 1000
> hadoop-t1.channels.mem1.transactionCapacity = 100
>
> # Bind the source and sink to the channel
> hadoop-t1.sources.syslog1.channels = mem1
> hadoop-t1.sinks.hdfs1.channel = mem1
> ---
>
> ---
> flume@hadoop-t1:~$ flume-ng version
> Flume 1.4.0.2.0.11.0-1
> Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
> Revision: fcdc3d29a1f249bef653b10b149aea2bc5df892e
> Compiled by jenkins on Wed Mar 12 05:11:30 PDT 2014
> From source with checksum dea9ae30ce2c27486ae7c76ab7aba020
> ---