You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Sutanu Das <sd...@att.com> on 2016/02/18 02:06:18 UTC

regex_extractor NOT replacing the HDFS path vaiable

Hi Hari/Community,

We are trying to replace the hdfs path with the regex_extrator interceptor but apparently the variable is not getting replaced in the HDFS path in the HDFS Sink.

We are trying to replace the HDFS path of the HDFS Sink with /prod/hadoop/smallsite/flume_ingest_ale2/%{host}/%Y/%m/%d/%H..... Where /%{host} is the regex = .*host=(ale-\d+-\w+.attwifi.com).* of type = regex_extractor

We know the regex works b/c we checked in python that the source data output has the regex match

>>> pattern = re.compile("host=(\w+-\d+-\w+.attwifi.com)\s.*")
>>> pattern.match(s)
<_sre.SRE_Match object at 0x7f8ca5cb4f30>
>>> s
'host=ale-1-sa.attwifi.com seq=478237182 timestamp=1455754889 op=1 topic_seq=540549 lic_info=10 topic=station sta_eth_mac=60:f8:1d:95:74:79 username=Javiers-phone role=centerwifi bssid=40:e3:d6:b0:02:52 device_type=iPhone sta_ip_address=192.168.21.14 hashed_sta_eth_mac=928ebc57036a2df7909c70ea5fce35774687835f hashed_sta_ip_address=8c76d83c5afb6aa1ca814d8902943a42a58d0a23 vlan=0 ht=0 ap_name=BoA-AP564'
>>>


Is my config incorrect or do we need to write a custom interceptor on this?


Here is my Flume config:

multi-ale2-station.sources = source1
multi-ale2-station.channels = channel1
multi-ale2-station.sinks =  sink1

# Define the sources
multi-ale2-station.sources.source1.type = exec
multi-ale2-station.sources.source1.command =  /usr/local/bin/multi_ale2.py -f /etc/flume/ale_station_conf/m_s.cfg
multi-ale2-station.sources.source1.channels = channel1


# Define the channels
multi-ale2-station.channels.channel1.type = memory
multi-ale2-station.channels.channel1.capacity = 10000000
multi-ale2-station.channels.channel1.transactionCapacity = 10000000


# Define the interceptors
multi-ale2-station.sources.source1.interceptors = i1
multi-ale2-station.sources.source1.interceptors.i1.type = regex_extractor
multi-ale2-station.sources.source1.interceptors.i1.regex = .*host=(ale-\d+-\w+.attwifi.com).*
multi-ale2-station.sources.source1.interceptors.i1.serializers = s1
multi-ale2-station.sources.source1.interceptors.i1.serializers.type = default
multi-ale2-station.sources.source1.interceptors.i1.serializers.s1.name = host


# Define a logging sink
multi-ale2-station.sinks.sink1.type = hdfs
multi-ale2-station.sinks.sink1.channel = channel1
multi-ale2-station.sinks.sink1.hdfs.path = /prod/hadoop/smallsite/flume_ingest_ale2/%{host}/%Y/%m/%d/%H
multi-ale2-station.sinks.sink1.hdfs.fileType = DataStream
multi-ale2-station.sinks.sink1.hdfs.writeFormat = Text
multi-ale2-station.sinks.sink1.hdfs.filePrefix = Sutanu_regex_ALE_2_Station_topic
multi-ale2-station.sinks.sink1.hdfs.useLocalTimeStamp = true

Re: regex_extractor NOT replacing the HDFS path vaiable

Posted by iain wright <ia...@gmail.com>.
Awesome, glad I was able to help!

Cheers,

-- 
Iain Wright

This email message is confidential, intended only for the recipient(s)
named above and may contain information that is privileged, exempt from
disclosure under applicable law. If you are not the intended recipient, do
not disclose or disseminate the message to anyone except the intended
recipient. If you have received this message in error, or are not the named
recipient(s), please immediately notify the sender by return email, and
delete all copies of this message.

On Wed, Feb 17, 2016 at 7:50 PM, Sutanu Das <sd...@att.com> wrote:

> HI Ian,
>
>
>
> It is working with your regex with extra \
>
>
>
> Wow Ian, Big thank you
>
>
>
> I’ll test some more stuff and report tomorrow, thanks again Ian, Huge Help
>
>
>
> *From:* iain wright [mailto:iainwrig@gmail.com]
> *Sent:* Wednesday, February 17, 2016 9:27 PM
>
> *To:* user@flume.apache.org
> *Subject:* Re: regex_extractor NOT replacing the HDFS path vaiable
>
>
>
> Hi Sutanu,
>
>
>
> This is working out as well:
>
>
>
> multi-ale2-station.sources.source1.interceptors.i1.regex =
> host=(\\w+-\\d+-\\w+.attwifi.com)
>
>
>
> When in doubt....escape i guess :p
>
>
>
> Cheers,
>
>
> --
>
> Iain Wright
>
>
>
> This email message is confidential, intended only for the recipient(s)
> named above and may contain information that is privileged, exempt from
> disclosure under applicable law. If you are not the intended recipient, do
> not disclose or disseminate the message to anyone except the intended
> recipient. If you have received this message in error, or are not the named
> recipient(s), please immediately notify the sender by return email, and
> delete all copies of this message.
>
>
>
> On Wed, Feb 17, 2016 at 7:01 PM, iain wright <ia...@gmail.com> wrote:
>
> It's definitely something to do with the regex or how flume/java is using
> it/pulling it in from config
>
>
>
> Specifically the \w+-\d+-\w+ isn't matching when used in the regex (but
> matches in regex testers)
>
>
>
> The below works you don't mind being less strict about the contents of
> host when matching:
>
>
>
> multi-ale2-station.sources = source1
>
> multi-ale2-station.channels = channel1
>
> multi-ale2-station.sinks =  sink1
>
>
>
> # Define the sources
>
> multi-ale2-station.sources.source1.type = exec
>
> multi-ale2-station.sources.source1.command = cat
> /home/iain/Desktop/flumetest/source.file
>
> multi-ale2-station.sources.source1.channels = channel1
>
>
>
> # Define the channels
>
> multi-ale2-station.channels.channel1.type = memory
>
> multi-ale2-station.channels.channel1.capacity = 10000000
>
> multi-ale2-station.channels.channel1.transactionCapacity = 10000000
>
>
>
> # Define the interceptors
>
> multi-ale2-station.sources.source1.interceptors = i1
>
> multi-ale2-station.sources.source1.interceptors.i1.type = regex_extractor
>
> multi-ale2-station.sources.source1.interceptors.i1.regex = host=(.*.
> attwifi.com)
>
>
>
> multi-ale2-station.sources.source1.interceptors.i1.serializers = s1
>
> multi-ale2-station.sources.source1.interceptors.i1.serializers.type =
> default
>
> multi-ale2-station.sources.source1.interceptors.i1.serializers.s1.name =
> host
>
>
>
> # Define a logging sink
>
> multi-ale2-station.sinks.sink1.type = logger
>
> multi-ale2-station.sinks.sink1.channel = channel1
>
>
>
>
>
> Log:
>
>
>
> 17 Feb 2016 18:57:47,079 INFO  [pool-3-thread-1]
> (org.apache.flume.source.ExecSource$ExecRunnable.run:376)  - Command [cat
> /home/iain/Desktop/flumetest/source.file] exited with 0
>
> 17 Feb 2016 18:57:47,081 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{} body:
> 61 63 74 75 61 6C 6C 79 20 72 61 6E 20 46 4C 5A actually ran FLZ }
>
> 17 Feb 2016 18:57:47,082 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
> ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61
> 74 host=ale-1-sa.at }
>
> 17 Feb 2016 18:57:47,082 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
> ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61
> 74 host=ale-1-sa.at }
>
> 17 Feb 2016 18:57:47,082 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
> ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61
> 74 host=ale-1-sa.at }
>
> 17 Feb 2016 18:57:47,082 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
> ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61
> 74 host=ale-1-sa.at }
>
> 17 Feb 2016 18:57:47,083 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
> ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61
> 74 host=ale-1-sa.at }
>
> 17 Feb 2016 18:57:47,083 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
> ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61
> 74 host=ale-1-sa.at }
>
> 17 Feb 2016 18:57:47,083 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
> ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61
> 74 host=ale-1-sa.at }
>
> 17 Feb 2016 18:57:47,083 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
> ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61
> 74 host=ale-1-sa.at }
>
> 17 Feb 2016 18:57:47,083 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
> ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61
> 74 host=ale-1-sa.at }
>
> 17 Feb 2016 18:57:47,084 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
> ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61
> 74 host=ale-1-sa.at }
>
> 17 Feb 2016 18:57:47,084 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
> ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61
> 74 host=ale-1-sa.at }
>
> 17 Feb 2016 18:57:47,084 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
> ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61
> 74 host=ale-1-sa.at }
>
> 17 Feb 2016 18:57:47,084 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
> ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61
> 74 host=ale-1-sa.at }
>
> 17 Feb 2016 18:57:47,084 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
> ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61
> 74 host=ale-1-sa.at }
>
> 17 Feb 2016 18:57:47,086 INFO  [conf-file-poller-0] (
> org.mortbay.log.Slf4jLog.info:67)  - Logging to
> org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
> org.mortbay.log.Slf4jLog
>
>
>
>
>
> A step in the right direction at least, good luck and please let us know
> if you sort out whats going on w/the regex!
>
>
>
> Cheers,
>
>
> --
>
> Iain Wright
>
>
>
> This email message is confidential, intended only for the recipient(s)
> named above and may contain information that is privileged, exempt from
> disclosure under applicable law. If you are not the intended recipient, do
> not disclose or disseminate the message to anyone except the intended
> recipient. If you have received this message in error, or are not the named
> recipient(s), please immediately notify the sender by return email, and
> delete all copies of this message.
>
>
>
> On Wed, Feb 17, 2016 at 6:18 PM, Sutanu Das <sd...@att.com> wrote:
>
> Thanks Ian,
>
>
>
> Here is the s.out which is a text file of the python script output
>
>
>
> We run Hortonworks and we are on HDP 2.3 – I think it is Flume 1.5
>
>
>
> I look forward to your testing, thanks again Ian.
>
>
>
> *From:* iain wright [mailto:iainwrig@gmail.com]
> *Sent:* Wednesday, February 17, 2016 8:06 PM
>
>
> *To:* user@flume.apache.org
> *Subject:* Re: regex_extractor NOT replacing the HDFS path vaiable
>
>
>
> Hi Sutanu,
>
>
>
> Bummer. Its definitely supported, we use it for writing to S3 in the exact
> manner you intend too.
>
>
>
> If you want to run this to generate some data as its presented to the
> source:
>
> /usr/local/bin/multi_ale2.py -f /etc/flume/ale_station_conf/m_s.cfg >>
> out.txt
>
>
>
> And throw it in a pastebin, or send me the file (please obfuscate any info
> you deem sensitive), I will play it with it as well.
>
>
>
> I remember having a hurdle with this, and running a debug/logger sink
> until I could see it emitting the header with the event into logs
>
>
>
> Safe to assume you're using latest stable version?
>
>
>
> Best,
>
> iain
>
>
> --
>
> Iain Wright
>
>
>
> This email message is confidential, intended only for the recipient(s)
> named above and may contain information that is privileged, exempt from
> disclosure under applicable law. If you are not the intended recipient, do
> not disclose or disseminate the message to anyone except the intended
> recipient. If you have received this message in error, or are not the named
> recipient(s), please immediately notify the sender by return email, and
> delete all copies of this message.
>
>
>
> On Wed, Feb 17, 2016 at 5:53 PM, Sutanu Das <sd...@att.com> wrote:
>
> Hi Ian,
>
>
>
> Yes, events are getting written to but the regex_extractor variable is not
> getting substituted in the HDFS path
>
>
>
> I’ve tried both hostname with the regex you advised yet, No luck
>
>
>
> Is regex_extrator for the HDFS path of Sink even supported ?
>
>
>
>
>
>
>
> 18 Feb 2016 00:58:40,855 INFO
> [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.hdfs.BucketWriter.open:265) - Creating
> /prod/hadoop/smallsite/flume_ingest_ale2*//*
> 2016/02/18/00/Sutanu_regex_ALE_2_Station_topic.1455757120803.tmp
>
>
>
>
>
> *From:* iain wright [mailto:iainwrig@gmail.com]
> *Sent:* Wednesday, February 17, 2016 7:39 PM
> *To:* user@flume.apache.org
> *Subject:* Re: regex_extractor NOT replacing the HDFS path vaiable
>
>
>
> Config looks sane,
>
>
>
> Are events being written
> to /prod/hadoop/smallsite/flume_ingest_ale2//%Y/%m/%d/%H?
>
>
>
> A couple things that may be worth trying if you haven't yet:
>
>
>
> - Try host=(ale-\d+-\w+.attwifi.com) instead of .*host=(ale-\d+-\w+.
> attwifi.com).*
>
> - Try hostname or another header instead of host, since host is a header
> used by the host interceptor
>
>
>
>
> --
>
> Iain Wright
>
>
>
> This email message is confidential, intended only for the recipient(s)
> named above and may contain information that is privileged, exempt from
> disclosure under applicable law. If you are not the intended recipient, do
> not disclose or disseminate the message to anyone except the intended
> recipient. If you have received this message in error, or are not the named
> recipient(s), please immediately notify the sender by return email, and
> delete all copies of this message.
>
>
>
> On Wed, Feb 17, 2016 at 5:06 PM, Sutanu Das <sd...@att.com> wrote:
>
> Hi Hari/Community,
>
>
>
> We are trying to replace the hdfs path with the regex_extrator interceptor
> but apparently the variable is not getting replaced in the HDFS path in the
> HDFS Sink.
>
>
>
> We are trying to replace the HDFS path of the HDFS Sink with
> /prod/hadoop/smallsite/flume_ingest_ale2*/%{host*}/%Y/%m/%d/%H….. Where
> */%{host*} is the regex = .*host=(ale-\d+-\w+.attwifi.com).* of type =
> regex_extractor
>
>
>
> We know the regex works b/c we checked in python that the source data
> output has the regex match
>
>
>
> >>> pattern = re.compile("host=(\w+-\d+-\w+.attwifi.com)\s.*")
>
> >>> pattern.match(s)
>
> <_sre.SRE_Match object at 0x7f8ca5cb4f30>
>
> >>> s
>
> *'host=ale-1-sa.attwifi.com <http://ale-1-sa.attwifi.com>* seq=478237182
> timestamp=1455754889 op=1 topic_seq=540549 lic_info=10 topic=station
> sta_eth_mac=60:f8:1d:95:74:79 username=Javiers-phone role=centerwifi
> bssid=40:e3:d6:b0:02:52 device_type=iPhone sta_ip_address=192.168.21.14
> hashed_sta_eth_mac=928ebc57036a2df7909c70ea5fce35774687835f
> hashed_sta_ip_address=8c76d83c5afb6aa1ca814d8902943a42a58d0a23 vlan=0 ht=0
> ap_name=BoA-AP564'
>
> >>>
>
>
>
>
>
> Is my config incorrect or do we need to write a custom interceptor on this?
>
>
>
>
>
> Here is my Flume config:
>
>
>
> multi-ale2-station.sources = source1
>
> multi-ale2-station.channels = channel1
>
> multi-ale2-station.sinks =  sink1
>
>
>
> # Define the sources
>
> multi-ale2-station.sources.source1.type = exec
>
> multi-ale2-station.sources.source1.command =  /usr/local/bin/multi_ale2.py
> -f /etc/flume/ale_station_conf/m_s.cfg
>
> multi-ale2-station.sources.source1.channels = channel1
>
>
>
>
>
> # Define the channels
>
> multi-ale2-station.channels.channel1.type = memory
>
> multi-ale2-station.channels.channel1.capacity = 10000000
>
> multi-ale2-station.channels.channel1.transactionCapacity = 10000000
>
>
>
>
>
> # Define the interceptors
>
> multi-ale2-station.sources.source1.interceptors = i1
>
> multi-ale2-station.sources.source1.interceptors.i1.type = regex_extractor
>
> multi-ale2-station.sources.source1.interceptors.i1.regex =
> .*host=(ale-\d+-\w+.attwifi.com).*
>
> multi-ale2-station.sources.source1.interceptors.i1.serializers = s1
>
> multi-ale2-station.sources.source1.interceptors.i1.serializers.type =
> default
>
> multi-ale2-station.sources.source1.interceptors.i1.serializers.s1.name =
> host
>
>
>
>
>
> # Define a logging sink
>
> multi-ale2-station.sinks.sink1.type = hdfs
>
> multi-ale2-station.sinks.sink1.channel = channel1
>
> multi-ale2-station.sinks.sink1.hdfs.path =
> /prod/hadoop/smallsite/flume_ingest_ale2/%{host}/%Y/%m/%d/%H
>
> multi-ale2-station.sinks.sink1.hdfs.fileType = DataStream
>
> multi-ale2-station.sinks.sink1.hdfs.writeFormat = Text
>
> multi-ale2-station.sinks.sink1.hdfs.filePrefix =
> Sutanu_regex_ALE_2_Station_topic
>
> multi-ale2-station.sinks.sink1.hdfs.useLocalTimeStamp = true
>
>
>
>
>
>
>
>
>

RE: regex_extractor NOT replacing the HDFS path vaiable

Posted by Sutanu Das <sd...@att.com>.
HI Ian,

It is working with your regex with extra \

Wow Ian, Big thank you

I’ll test some more stuff and report tomorrow, thanks again Ian, Huge Help

From: iain wright [mailto:iainwrig@gmail.com]
Sent: Wednesday, February 17, 2016 9:27 PM
To: user@flume.apache.org
Subject: Re: regex_extractor NOT replacing the HDFS path vaiable

Hi Sutanu,

This is working out as well:

multi-ale2-station.sources.source1.interceptors.i1.regex = host=(\\w+-\\d+-\\w+.attwifi.com<http://attwifi.com>)

When in doubt....escape i guess :p

Cheers,

--
Iain Wright

This email message is confidential, intended only for the recipient(s) named above and may contain information that is privileged, exempt from disclosure under applicable law. If you are not the intended recipient, do not disclose or disseminate the message to anyone except the intended recipient. If you have received this message in error, or are not the named recipient(s), please immediately notify the sender by return email, and delete all copies of this message.

On Wed, Feb 17, 2016 at 7:01 PM, iain wright <ia...@gmail.com>> wrote:
It's definitely something to do with the regex or how flume/java is using it/pulling it in from config

Specifically the \w+-\d+-\w+ isn't matching when used in the regex (but matches in regex testers)

The below works you don't mind being less strict about the contents of host when matching:

multi-ale2-station.sources = source1
multi-ale2-station.channels = channel1
multi-ale2-station.sinks =  sink1

# Define the sources
multi-ale2-station.sources.source1.type = exec
multi-ale2-station.sources.source1.command = cat /home/iain/Desktop/flumetest/source.file
multi-ale2-station.sources.source1.channels = channel1

# Define the channels
multi-ale2-station.channels.channel1.type = memory
multi-ale2-station.channels.channel1.capacity = 10000000
multi-ale2-station.channels.channel1.transactionCapacity = 10000000

# Define the interceptors
multi-ale2-station.sources.source1.interceptors = i1
multi-ale2-station.sources.source1.interceptors.i1.type = regex_extractor
multi-ale2-station.sources.source1.interceptors.i1.regex = host=(.*.attwifi.com<http://attwifi.com>)

multi-ale2-station.sources.source1.interceptors.i1.serializers = s1
multi-ale2-station.sources.source1.interceptors.i1.serializers.type = default
multi-ale2-station.sources.source1.interceptors.i1.serializers.s1.name<http://multi-ale2-station.sources.source1.interceptors.i1.serializers.s1.name> = host

# Define a logging sink
multi-ale2-station.sinks.sink1.type = logger
multi-ale2-station.sinks.sink1.channel = channel1


Log:

17 Feb 2016 18:57:47,079 INFO  [pool-3-thread-1] (org.apache.flume.source.ExecSource$ExecRunnable.run:376)  - Command [cat /home/iain/Desktop/flumetest/source.file] exited with 0
17 Feb 2016 18:57:47,081 INFO  [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{} body: 61 63 74 75 61 6C 6C 79 20 72 61 6E 20 46 4C 5A actually ran FLZ }
17 Feb 2016 18:57:47,082 INFO  [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=ale-1-sa.attwifi.com<http://ale-1-sa.attwifi.com>} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61 74 host=ale-1-sa.at<http://ale-1-sa.at> }
17 Feb 2016 18:57:47,082 INFO  [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=ale-1-sa.attwifi.com<http://ale-1-sa.attwifi.com>} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61 74 host=ale-1-sa.at<http://ale-1-sa.at> }
17 Feb 2016 18:57:47,082 INFO  [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=ale-1-sa.attwifi.com<http://ale-1-sa.attwifi.com>} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61 74 host=ale-1-sa.at<http://ale-1-sa.at> }
17 Feb 2016 18:57:47,082 INFO  [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=ale-1-sa.attwifi.com<http://ale-1-sa.attwifi.com>} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61 74 host=ale-1-sa.at<http://ale-1-sa.at> }
17 Feb 2016 18:57:47,083 INFO  [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=ale-1-sa.attwifi.com<http://ale-1-sa.attwifi.com>} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61 74 host=ale-1-sa.at<http://ale-1-sa.at> }
17 Feb 2016 18:57:47,083 INFO  [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=ale-1-sa.attwifi.com<http://ale-1-sa.attwifi.com>} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61 74 host=ale-1-sa.at<http://ale-1-sa.at> }
17 Feb 2016 18:57:47,083 INFO  [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=ale-1-sa.attwifi.com<http://ale-1-sa.attwifi.com>} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61 74 host=ale-1-sa.at<http://ale-1-sa.at> }
17 Feb 2016 18:57:47,083 INFO  [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=ale-1-sa.attwifi.com<http://ale-1-sa.attwifi.com>} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61 74 host=ale-1-sa.at<http://ale-1-sa.at> }
17 Feb 2016 18:57:47,083 INFO  [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=ale-1-sa.attwifi.com<http://ale-1-sa.attwifi.com>} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61 74 host=ale-1-sa.at<http://ale-1-sa.at> }
17 Feb 2016 18:57:47,084 INFO  [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=ale-1-sa.attwifi.com<http://ale-1-sa.attwifi.com>} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61 74 host=ale-1-sa.at<http://ale-1-sa.at> }
17 Feb 2016 18:57:47,084 INFO  [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=ale-1-sa.attwifi.com<http://ale-1-sa.attwifi.com>} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61 74 host=ale-1-sa.at<http://ale-1-sa.at> }
17 Feb 2016 18:57:47,084 INFO  [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=ale-1-sa.attwifi.com<http://ale-1-sa.attwifi.com>} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61 74 host=ale-1-sa.at<http://ale-1-sa.at> }
17 Feb 2016 18:57:47,084 INFO  [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=ale-1-sa.attwifi.com<http://ale-1-sa.attwifi.com>} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61 74 host=ale-1-sa.at<http://ale-1-sa.at> }
17 Feb 2016 18:57:47,084 INFO  [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=ale-1-sa.attwifi.com<http://ale-1-sa.attwifi.com>} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61 74 host=ale-1-sa.at<http://ale-1-sa.at> }
17 Feb 2016 18:57:47,086 INFO  [conf-file-poller-0] (org.mortbay.log.Slf4jLog.info:67<http://org.mortbay.log.Slf4jLog.info:67>)  - Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog


A step in the right direction at least, good luck and please let us know if you sort out whats going on w/the regex!

Cheers,

--
Iain Wright

This email message is confidential, intended only for the recipient(s) named above and may contain information that is privileged, exempt from disclosure under applicable law. If you are not the intended recipient, do not disclose or disseminate the message to anyone except the intended recipient. If you have received this message in error, or are not the named recipient(s), please immediately notify the sender by return email, and delete all copies of this message.

On Wed, Feb 17, 2016 at 6:18 PM, Sutanu Das <sd...@att.com>> wrote:
Thanks Ian,

Here is the s.out which is a text file of the python script output

We run Hortonworks and we are on HDP 2.3 – I think it is Flume 1.5

I look forward to your testing, thanks again Ian.

From: iain wright [mailto:iainwrig@gmail.com<ma...@gmail.com>]
Sent: Wednesday, February 17, 2016 8:06 PM

To: user@flume.apache.org<ma...@flume.apache.org>
Subject: Re: regex_extractor NOT replacing the HDFS path vaiable

Hi Sutanu,

Bummer. Its definitely supported, we use it for writing to S3 in the exact manner you intend too.

If you want to run this to generate some data as its presented to the source:
/usr/local/bin/multi_ale2.py -f /etc/flume/ale_station_conf/m_s.cfg >> out.txt

And throw it in a pastebin, or send me the file (please obfuscate any info you deem sensitive), I will play it with it as well.

I remember having a hurdle with this, and running a debug/logger sink until I could see it emitting the header with the event into logs

Safe to assume you're using latest stable version?

Best,
iain

--
Iain Wright

This email message is confidential, intended only for the recipient(s) named above and may contain information that is privileged, exempt from disclosure under applicable law. If you are not the intended recipient, do not disclose or disseminate the message to anyone except the intended recipient. If you have received this message in error, or are not the named recipient(s), please immediately notify the sender by return email, and delete all copies of this message.

On Wed, Feb 17, 2016 at 5:53 PM, Sutanu Das <sd...@att.com>> wrote:
Hi Ian,

Yes, events are getting written to but the regex_extractor variable is not getting substituted in the HDFS path

I’ve tried both hostname with the regex you advised yet, No luck

Is regex_extrator for the HDFS path of Sink even supported ?



18 Feb 2016 00:58:40,855 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.hdfs.BucketWriter.open:265) - Creating /prod/hadoop/smallsite/flume_ingest_ale2//2016/02/18/00/Sutanu_regex_ALE_2_Station_topic.1455757120803.tmp


From: iain wright [mailto:iainwrig@gmail.com<ma...@gmail.com>]
Sent: Wednesday, February 17, 2016 7:39 PM
To: user@flume.apache.org<ma...@flume.apache.org>
Subject: Re: regex_extractor NOT replacing the HDFS path vaiable

Config looks sane,

Are events being written to /prod/hadoop/smallsite/flume_ingest_ale2//%Y/%m/%d/%H?

A couple things that may be worth trying if you haven't yet:

- Try host=(ale-\d+-\w+.attwifi.com<http://attwifi.com>) instead of .*host=(ale-\d+-\w+.attwifi.com<http://attwifi.com/>).*
- Try hostname or another header instead of host, since host is a header used by the host interceptor


--
Iain Wright

This email message is confidential, intended only for the recipient(s) named above and may contain information that is privileged, exempt from disclosure under applicable law. If you are not the intended recipient, do not disclose or disseminate the message to anyone except the intended recipient. If you have received this message in error, or are not the named recipient(s), please immediately notify the sender by return email, and delete all copies of this message.

On Wed, Feb 17, 2016 at 5:06 PM, Sutanu Das <sd...@att.com>> wrote:
Hi Hari/Community,

We are trying to replace the hdfs path with the regex_extrator interceptor but apparently the variable is not getting replaced in the HDFS path in the HDFS Sink.

We are trying to replace the HDFS path of the HDFS Sink with /prod/hadoop/smallsite/flume_ingest_ale2/%{host}/%Y/%m/%d/%H….. Where /%{host} is the regex = .*host=(ale-\d+-\w+.attwifi.com<http://attwifi.com>).* of type = regex_extractor

We know the regex works b/c we checked in python that the source data output has the regex match

>>> pattern = re.compile("host=(\w+-\d+-\w+.attwifi.com<http://attwifi.com>)\s.*")
>>> pattern.match(s)
<_sre.SRE_Match object at 0x7f8ca5cb4f30>
>>> s
'host=ale-1-sa.attwifi.com<http://ale-1-sa.attwifi.com> seq=478237182 timestamp=1455754889 op=1 topic_seq=540549 lic_info=10 topic=station sta_eth_mac=60:f8:1d:95:74:79 username=Javiers-phone role=centerwifi bssid=40:e3:d6:b0:02:52 device_type=iPhone sta_ip_address=192.168.21.14 hashed_sta_eth_mac=928ebc57036a2df7909c70ea5fce35774687835f hashed_sta_ip_address=8c76d83c5afb6aa1ca814d8902943a42a58d0a23 vlan=0 ht=0 ap_name=BoA-AP564'
>>>


Is my config incorrect or do we need to write a custom interceptor on this?


Here is my Flume config:

multi-ale2-station.sources = source1
multi-ale2-station.channels = channel1
multi-ale2-station.sinks =  sink1

# Define the sources
multi-ale2-station.sources.source1.type = exec
multi-ale2-station.sources.source1.command =  /usr/local/bin/multi_ale2.py -f /etc/flume/ale_station_conf/m_s.cfg
multi-ale2-station.sources.source1.channels = channel1


# Define the channels
multi-ale2-station.channels.channel1.type = memory
multi-ale2-station.channels.channel1.capacity = 10000000
multi-ale2-station.channels.channel1.transactionCapacity = 10000000


# Define the interceptors
multi-ale2-station.sources.source1.interceptors = i1
multi-ale2-station.sources.source1.interceptors.i1.type = regex_extractor
multi-ale2-station.sources.source1.interceptors.i1.regex = .*host=(ale-\d+-\w+.attwifi.com<http://attwifi.com>).*
multi-ale2-station.sources.source1.interceptors.i1.serializers = s1
multi-ale2-station.sources.source1.interceptors.i1.serializers.type = default
multi-ale2-station.sources.source1.interceptors.i1.serializers.s1.name<http://multi-ale2-station.sources.source1.interceptors.i1.serializers.s1.name> = host


# Define a logging sink
multi-ale2-station.sinks.sink1.type = hdfs
multi-ale2-station.sinks.sink1.channel = channel1
multi-ale2-station.sinks.sink1.hdfs.path = /prod/hadoop/smallsite/flume_ingest_ale2/%{host}/%Y/%m/%d/%H
multi-ale2-station.sinks.sink1.hdfs.fileType = DataStream
multi-ale2-station.sinks.sink1.hdfs.writeFormat = Text
multi-ale2-station.sinks.sink1.hdfs.filePrefix = Sutanu_regex_ALE_2_Station_topic
multi-ale2-station.sinks.sink1.hdfs.useLocalTimeStamp = true





Re: regex_extractor NOT replacing the HDFS path vaiable

Posted by iain wright <ia...@gmail.com>.
Hi Sutanu,

This is working out as well:

multi-ale2-station.sources.source1.interceptors.i1.regex =
host=(\\w+-\\d+-\\w+.attwifi.com)

When in doubt....escape i guess :p

Cheers,

-- 
Iain Wright

This email message is confidential, intended only for the recipient(s)
named above and may contain information that is privileged, exempt from
disclosure under applicable law. If you are not the intended recipient, do
not disclose or disseminate the message to anyone except the intended
recipient. If you have received this message in error, or are not the named
recipient(s), please immediately notify the sender by return email, and
delete all copies of this message.

On Wed, Feb 17, 2016 at 7:01 PM, iain wright <ia...@gmail.com> wrote:

> It's definitely something to do with the regex or how flume/java is using
> it/pulling it in from config
>
> Specifically the \w+-\d+-\w+ isn't matching when used in the regex (but
> matches in regex testers)
>
> The below works you don't mind being less strict about the contents of
> host when matching:
>
>
> multi-ale2-station.sources = source1
> multi-ale2-station.channels = channel1
> multi-ale2-station.sinks =  sink1
>
> # Define the sources
> multi-ale2-station.sources.source1.type = exec
> multi-ale2-station.sources.source1.command = cat
> /home/iain/Desktop/flumetest/source.file
> multi-ale2-station.sources.source1.channels = channel1
>
> # Define the channels
> multi-ale2-station.channels.channel1.type = memory
> multi-ale2-station.channels.channel1.capacity = 10000000
> multi-ale2-station.channels.channel1.transactionCapacity = 10000000
>
> # Define the interceptors
> multi-ale2-station.sources.source1.interceptors = i1
> multi-ale2-station.sources.source1.interceptors.i1.type = regex_extractor
> multi-ale2-station.sources.source1.interceptors.i1.regex = host=(.*.
> attwifi.com)
>
> multi-ale2-station.sources.source1.interceptors.i1.serializers = s1
> multi-ale2-station.sources.source1.interceptors.i1.serializers.type =
> default
> multi-ale2-station.sources.source1.interceptors.i1.serializers.s1.name =
> host
>
> # Define a logging sink
> multi-ale2-station.sinks.sink1.type = logger
> multi-ale2-station.sinks.sink1.channel = channel1
>
>
> Log:
>
>
> 17 Feb 2016 18:57:47,079 INFO  [pool-3-thread-1]
> (org.apache.flume.source.ExecSource$ExecRunnable.run:376)  - Command [cat
> /home/iain/Desktop/flumetest/source.file] exited with 0
> 17 Feb 2016 18:57:47,081 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{} body:
> 61 63 74 75 61 6C 6C 79 20 72 61 6E 20 46 4C 5A actually ran FLZ }
> 17 Feb 2016 18:57:47,082 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
> ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61
> 74 host=ale-1-sa.at }
> 17 Feb 2016 18:57:47,082 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
> ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61
> 74 host=ale-1-sa.at }
> 17 Feb 2016 18:57:47,082 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
> ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61
> 74 host=ale-1-sa.at }
> 17 Feb 2016 18:57:47,082 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
> ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61
> 74 host=ale-1-sa.at }
> 17 Feb 2016 18:57:47,083 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
> ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61
> 74 host=ale-1-sa.at }
> 17 Feb 2016 18:57:47,083 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
> ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61
> 74 host=ale-1-sa.at }
> 17 Feb 2016 18:57:47,083 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
> ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61
> 74 host=ale-1-sa.at }
> 17 Feb 2016 18:57:47,083 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
> ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61
> 74 host=ale-1-sa.at }
> 17 Feb 2016 18:57:47,083 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
> ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61
> 74 host=ale-1-sa.at }
> 17 Feb 2016 18:57:47,084 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
> ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61
> 74 host=ale-1-sa.at }
> 17 Feb 2016 18:57:47,084 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
> ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61
> 74 host=ale-1-sa.at }
> 17 Feb 2016 18:57:47,084 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
> ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61
> 74 host=ale-1-sa.at }
> 17 Feb 2016 18:57:47,084 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
> ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61
> 74 host=ale-1-sa.at }
> 17 Feb 2016 18:57:47,084 INFO
>  [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
> ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61
> 74 host=ale-1-sa.at }
> 17 Feb 2016 18:57:47,086 INFO  [conf-file-poller-0] (
> org.mortbay.log.Slf4jLog.info:67)  - Logging to
> org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
> org.mortbay.log.Slf4jLog
>
>
> A step in the right direction at least, good luck and please let us know
> if you sort out whats going on w/the regex!
>
> Cheers,
>
> --
> Iain Wright
>
> This email message is confidential, intended only for the recipient(s)
> named above and may contain information that is privileged, exempt from
> disclosure under applicable law. If you are not the intended recipient, do
> not disclose or disseminate the message to anyone except the intended
> recipient. If you have received this message in error, or are not the named
> recipient(s), please immediately notify the sender by return email, and
> delete all copies of this message.
>
> On Wed, Feb 17, 2016 at 6:18 PM, Sutanu Das <sd...@att.com> wrote:
>
>> Thanks Ian,
>>
>>
>>
>> Here is the s.out which is a text file of the python script output
>>
>>
>>
>> We run Hortonworks and we are on HDP 2.3 – I think it is Flume 1.5
>>
>>
>>
>> I look forward to your testing, thanks again Ian.
>>
>>
>>
>> *From:* iain wright [mailto:iainwrig@gmail.com]
>> *Sent:* Wednesday, February 17, 2016 8:06 PM
>>
>> *To:* user@flume.apache.org
>> *Subject:* Re: regex_extractor NOT replacing the HDFS path vaiable
>>
>>
>>
>> Hi Sutanu,
>>
>>
>>
>> Bummer. Its definitely supported, we use it for writing to S3 in the
>> exact manner you intend too.
>>
>>
>>
>> If you want to run this to generate some data as its presented to the
>> source:
>>
>> /usr/local/bin/multi_ale2.py -f /etc/flume/ale_station_conf/m_s.cfg >>
>> out.txt
>>
>>
>>
>> And throw it in a pastebin, or send me the file (please obfuscate any
>> info you deem sensitive), I will play it with it as well.
>>
>>
>>
>> I remember having a hurdle with this, and running a debug/logger sink
>> until I could see it emitting the header with the event into logs
>>
>>
>>
>> Safe to assume you're using latest stable version?
>>
>>
>>
>> Best,
>>
>> iain
>>
>>
>> --
>>
>> Iain Wright
>>
>>
>>
>> This email message is confidential, intended only for the recipient(s)
>> named above and may contain information that is privileged, exempt from
>> disclosure under applicable law. If you are not the intended recipient, do
>> not disclose or disseminate the message to anyone except the intended
>> recipient. If you have received this message in error, or are not the named
>> recipient(s), please immediately notify the sender by return email, and
>> delete all copies of this message.
>>
>>
>>
>> On Wed, Feb 17, 2016 at 5:53 PM, Sutanu Das <sd...@att.com> wrote:
>>
>> Hi Ian,
>>
>>
>>
>> Yes, events are getting written to but the regex_extractor variable is
>> not getting substituted in the HDFS path
>>
>>
>>
>> I’ve tried both hostname with the regex you advised yet, No luck
>>
>>
>>
>> Is regex_extrator for the HDFS path of Sink even supported ?
>>
>>
>>
>>
>>
>>
>>
>> 18 Feb 2016 00:58:40,855 INFO
>> [SinkRunner-PollingRunner-DefaultSinkProcessor]
>> (org.apache.flume.sink.hdfs.BucketWriter.open:265) - Creating
>> /prod/hadoop/smallsite/flume_ingest_ale2*//*
>> 2016/02/18/00/Sutanu_regex_ALE_2_Station_topic.1455757120803.tmp
>>
>>
>>
>>
>>
>> *From:* iain wright [mailto:iainwrig@gmail.com]
>> *Sent:* Wednesday, February 17, 2016 7:39 PM
>> *To:* user@flume.apache.org
>> *Subject:* Re: regex_extractor NOT replacing the HDFS path vaiable
>>
>>
>>
>> Config looks sane,
>>
>>
>>
>> Are events being written
>> to /prod/hadoop/smallsite/flume_ingest_ale2//%Y/%m/%d/%H?
>>
>>
>>
>> A couple things that may be worth trying if you haven't yet:
>>
>>
>>
>> - Try host=(ale-\d+-\w+.attwifi.com) instead of .*host=(ale-\d+-\w+.
>> attwifi.com).*
>>
>> - Try hostname or another header instead of host, since host is a header
>> used by the host interceptor
>>
>>
>>
>>
>> --
>>
>> Iain Wright
>>
>>
>>
>> This email message is confidential, intended only for the recipient(s)
>> named above and may contain information that is privileged, exempt from
>> disclosure under applicable law. If you are not the intended recipient, do
>> not disclose or disseminate the message to anyone except the intended
>> recipient. If you have received this message in error, or are not the named
>> recipient(s), please immediately notify the sender by return email, and
>> delete all copies of this message.
>>
>>
>>
>> On Wed, Feb 17, 2016 at 5:06 PM, Sutanu Das <sd...@att.com> wrote:
>>
>> Hi Hari/Community,
>>
>>
>>
>> We are trying to replace the hdfs path with the regex_extrator
>> interceptor but apparently the variable is not getting replaced in the HDFS
>> path in the HDFS Sink.
>>
>>
>>
>> We are trying to replace the HDFS path of the HDFS Sink with
>> /prod/hadoop/smallsite/flume_ingest_ale2*/%{host*}/%Y/%m/%d/%H….. Where
>> */%{host*} is the regex = .*host=(ale-\d+-\w+.attwifi.com).* of type =
>> regex_extractor
>>
>>
>>
>> We know the regex works b/c we checked in python that the source data
>> output has the regex match
>>
>>
>>
>> >>> pattern = re.compile("host=(\w+-\d+-\w+.attwifi.com)\s.*")
>>
>> >>> pattern.match(s)
>>
>> <_sre.SRE_Match object at 0x7f8ca5cb4f30>
>>
>> >>> s
>>
>> *'host=ale-1-sa.attwifi.com <http://ale-1-sa.attwifi.com>* seq=478237182
>> timestamp=1455754889 op=1 topic_seq=540549 lic_info=10 topic=station
>> sta_eth_mac=60:f8:1d:95:74:79 username=Javiers-phone role=centerwifi
>> bssid=40:e3:d6:b0:02:52 device_type=iPhone sta_ip_address=192.168.21.14
>> hashed_sta_eth_mac=928ebc57036a2df7909c70ea5fce35774687835f
>> hashed_sta_ip_address=8c76d83c5afb6aa1ca814d8902943a42a58d0a23 vlan=0 ht=0
>> ap_name=BoA-AP564'
>>
>> >>>
>>
>>
>>
>>
>>
>> Is my config incorrect or do we need to write a custom interceptor on
>> this?
>>
>>
>>
>>
>>
>> Here is my Flume config:
>>
>>
>>
>> multi-ale2-station.sources = source1
>>
>> multi-ale2-station.channels = channel1
>>
>> multi-ale2-station.sinks =  sink1
>>
>>
>>
>> # Define the sources
>>
>> multi-ale2-station.sources.source1.type = exec
>>
>> multi-ale2-station.sources.source1.command =
>> /usr/local/bin/multi_ale2.py -f /etc/flume/ale_station_conf/m_s.cfg
>>
>> multi-ale2-station.sources.source1.channels = channel1
>>
>>
>>
>>
>>
>> # Define the channels
>>
>> multi-ale2-station.channels.channel1.type = memory
>>
>> multi-ale2-station.channels.channel1.capacity = 10000000
>>
>> multi-ale2-station.channels.channel1.transactionCapacity = 10000000
>>
>>
>>
>>
>>
>> # Define the interceptors
>>
>> multi-ale2-station.sources.source1.interceptors = i1
>>
>> multi-ale2-station.sources.source1.interceptors.i1.type = regex_extractor
>>
>> multi-ale2-station.sources.source1.interceptors.i1.regex =
>> .*host=(ale-\d+-\w+.attwifi.com).*
>>
>> multi-ale2-station.sources.source1.interceptors.i1.serializers = s1
>>
>> multi-ale2-station.sources.source1.interceptors.i1.serializers.type =
>> default
>>
>> multi-ale2-station.sources.source1.interceptors.i1.serializers.s1.name =
>> host
>>
>>
>>
>>
>>
>> # Define a logging sink
>>
>> multi-ale2-station.sinks.sink1.type = hdfs
>>
>> multi-ale2-station.sinks.sink1.channel = channel1
>>
>> multi-ale2-station.sinks.sink1.hdfs.path =
>> /prod/hadoop/smallsite/flume_ingest_ale2/%{host}/%Y/%m/%d/%H
>>
>> multi-ale2-station.sinks.sink1.hdfs.fileType = DataStream
>>
>> multi-ale2-station.sinks.sink1.hdfs.writeFormat = Text
>>
>> multi-ale2-station.sinks.sink1.hdfs.filePrefix =
>> Sutanu_regex_ALE_2_Station_topic
>>
>> multi-ale2-station.sinks.sink1.hdfs.useLocalTimeStamp = true
>>
>>
>>
>>
>>
>
>

Re: regex_extractor NOT replacing the HDFS path vaiable

Posted by iain wright <ia...@gmail.com>.
It's definitely something to do with the regex or how flume/java is using
it/pulling it in from config

Specifically the \w+-\d+-\w+ isn't matching when used in the regex (but
matches in regex testers)

The below works you don't mind being less strict about the contents of host
when matching:


multi-ale2-station.sources = source1
multi-ale2-station.channels = channel1
multi-ale2-station.sinks =  sink1

# Define the sources
multi-ale2-station.sources.source1.type = exec
multi-ale2-station.sources.source1.command = cat
/home/iain/Desktop/flumetest/source.file
multi-ale2-station.sources.source1.channels = channel1

# Define the channels
multi-ale2-station.channels.channel1.type = memory
multi-ale2-station.channels.channel1.capacity = 10000000
multi-ale2-station.channels.channel1.transactionCapacity = 10000000

# Define the interceptors
multi-ale2-station.sources.source1.interceptors = i1
multi-ale2-station.sources.source1.interceptors.i1.type = regex_extractor
multi-ale2-station.sources.source1.interceptors.i1.regex = host=(.*.
attwifi.com)

multi-ale2-station.sources.source1.interceptors.i1.serializers = s1
multi-ale2-station.sources.source1.interceptors.i1.serializers.type =
default
multi-ale2-station.sources.source1.interceptors.i1.serializers.s1.name =
host

# Define a logging sink
multi-ale2-station.sinks.sink1.type = logger
multi-ale2-station.sinks.sink1.channel = channel1


Log:


17 Feb 2016 18:57:47,079 INFO  [pool-3-thread-1]
(org.apache.flume.source.ExecSource$ExecRunnable.run:376)  - Command [cat
/home/iain/Desktop/flumetest/source.file] exited with 0
17 Feb 2016 18:57:47,081 INFO
 [SinkRunner-PollingRunner-DefaultSinkProcessor]
(org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{} body:
61 63 74 75 61 6C 6C 79 20 72 61 6E 20 46 4C 5A actually ran FLZ }
17 Feb 2016 18:57:47,082 INFO
 [SinkRunner-PollingRunner-DefaultSinkProcessor]
(org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61 74
host=ale-1-sa.at }
17 Feb 2016 18:57:47,082 INFO
 [SinkRunner-PollingRunner-DefaultSinkProcessor]
(org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61 74
host=ale-1-sa.at }
17 Feb 2016 18:57:47,082 INFO
 [SinkRunner-PollingRunner-DefaultSinkProcessor]
(org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61 74
host=ale-1-sa.at }
17 Feb 2016 18:57:47,082 INFO
 [SinkRunner-PollingRunner-DefaultSinkProcessor]
(org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61 74
host=ale-1-sa.at }
17 Feb 2016 18:57:47,083 INFO
 [SinkRunner-PollingRunner-DefaultSinkProcessor]
(org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61 74
host=ale-1-sa.at }
17 Feb 2016 18:57:47,083 INFO
 [SinkRunner-PollingRunner-DefaultSinkProcessor]
(org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61 74
host=ale-1-sa.at }
17 Feb 2016 18:57:47,083 INFO
 [SinkRunner-PollingRunner-DefaultSinkProcessor]
(org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61 74
host=ale-1-sa.at }
17 Feb 2016 18:57:47,083 INFO
 [SinkRunner-PollingRunner-DefaultSinkProcessor]
(org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61 74
host=ale-1-sa.at }
17 Feb 2016 18:57:47,083 INFO
 [SinkRunner-PollingRunner-DefaultSinkProcessor]
(org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61 74
host=ale-1-sa.at }
17 Feb 2016 18:57:47,084 INFO
 [SinkRunner-PollingRunner-DefaultSinkProcessor]
(org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61 74
host=ale-1-sa.at }
17 Feb 2016 18:57:47,084 INFO
 [SinkRunner-PollingRunner-DefaultSinkProcessor]
(org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61 74
host=ale-1-sa.at }
17 Feb 2016 18:57:47,084 INFO
 [SinkRunner-PollingRunner-DefaultSinkProcessor]
(org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61 74
host=ale-1-sa.at }
17 Feb 2016 18:57:47,084 INFO
 [SinkRunner-PollingRunner-DefaultSinkProcessor]
(org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61 74
host=ale-1-sa.at }
17 Feb 2016 18:57:47,084 INFO
 [SinkRunner-PollingRunner-DefaultSinkProcessor]
(org.apache.flume.sink.LoggerSink.process:94)  - Event: { headers:{host=
ale-1-sa.attwifi.com} body: 68 6F 73 74 3D 61 6C 65 2D 31 2D 73 61 2E 61 74
host=ale-1-sa.at }
17 Feb 2016 18:57:47,086 INFO  [conf-file-poller-0] (
org.mortbay.log.Slf4jLog.info:67)  - Logging to
org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
org.mortbay.log.Slf4jLog


A step in the right direction at least, good luck and please let us know if
you sort out whats going on w/the regex!

Cheers,

-- 
Iain Wright

This email message is confidential, intended only for the recipient(s)
named above and may contain information that is privileged, exempt from
disclosure under applicable law. If you are not the intended recipient, do
not disclose or disseminate the message to anyone except the intended
recipient. If you have received this message in error, or are not the named
recipient(s), please immediately notify the sender by return email, and
delete all copies of this message.

On Wed, Feb 17, 2016 at 6:18 PM, Sutanu Das <sd...@att.com> wrote:

> Thanks Ian,
>
>
>
> Here is the s.out which is a text file of the python script output
>
>
>
> We run Hortonworks and we are on HDP 2.3 – I think it is Flume 1.5
>
>
>
> I look forward to your testing, thanks again Ian.
>
>
>
> *From:* iain wright [mailto:iainwrig@gmail.com]
> *Sent:* Wednesday, February 17, 2016 8:06 PM
>
> *To:* user@flume.apache.org
> *Subject:* Re: regex_extractor NOT replacing the HDFS path vaiable
>
>
>
> Hi Sutanu,
>
>
>
> Bummer. Its definitely supported, we use it for writing to S3 in the exact
> manner you intend too.
>
>
>
> If you want to run this to generate some data as its presented to the
> source:
>
> /usr/local/bin/multi_ale2.py -f /etc/flume/ale_station_conf/m_s.cfg >>
> out.txt
>
>
>
> And throw it in a pastebin, or send me the file (please obfuscate any info
> you deem sensitive), I will play it with it as well.
>
>
>
> I remember having a hurdle with this, and running a debug/logger sink
> until I could see it emitting the header with the event into logs
>
>
>
> Safe to assume you're using latest stable version?
>
>
>
> Best,
>
> iain
>
>
> --
>
> Iain Wright
>
>
>
> This email message is confidential, intended only for the recipient(s)
> named above and may contain information that is privileged, exempt from
> disclosure under applicable law. If you are not the intended recipient, do
> not disclose or disseminate the message to anyone except the intended
> recipient. If you have received this message in error, or are not the named
> recipient(s), please immediately notify the sender by return email, and
> delete all copies of this message.
>
>
>
> On Wed, Feb 17, 2016 at 5:53 PM, Sutanu Das <sd...@att.com> wrote:
>
> Hi Ian,
>
>
>
> Yes, events are getting written to but the regex_extractor variable is not
> getting substituted in the HDFS path
>
>
>
> I’ve tried both hostname with the regex you advised yet, No luck
>
>
>
> Is regex_extrator for the HDFS path of Sink even supported ?
>
>
>
>
>
>
>
> 18 Feb 2016 00:58:40,855 INFO
> [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.hdfs.BucketWriter.open:265) - Creating
> /prod/hadoop/smallsite/flume_ingest_ale2*//*
> 2016/02/18/00/Sutanu_regex_ALE_2_Station_topic.1455757120803.tmp
>
>
>
>
>
> *From:* iain wright [mailto:iainwrig@gmail.com]
> *Sent:* Wednesday, February 17, 2016 7:39 PM
> *To:* user@flume.apache.org
> *Subject:* Re: regex_extractor NOT replacing the HDFS path vaiable
>
>
>
> Config looks sane,
>
>
>
> Are events being written
> to /prod/hadoop/smallsite/flume_ingest_ale2//%Y/%m/%d/%H?
>
>
>
> A couple things that may be worth trying if you haven't yet:
>
>
>
> - Try host=(ale-\d+-\w+.attwifi.com) instead of .*host=(ale-\d+-\w+.
> attwifi.com).*
>
> - Try hostname or another header instead of host, since host is a header
> used by the host interceptor
>
>
>
>
> --
>
> Iain Wright
>
>
>
> This email message is confidential, intended only for the recipient(s)
> named above and may contain information that is privileged, exempt from
> disclosure under applicable law. If you are not the intended recipient, do
> not disclose or disseminate the message to anyone except the intended
> recipient. If you have received this message in error, or are not the named
> recipient(s), please immediately notify the sender by return email, and
> delete all copies of this message.
>
>
>
> On Wed, Feb 17, 2016 at 5:06 PM, Sutanu Das <sd...@att.com> wrote:
>
> Hi Hari/Community,
>
>
>
> We are trying to replace the hdfs path with the regex_extrator interceptor
> but apparently the variable is not getting replaced in the HDFS path in the
> HDFS Sink.
>
>
>
> We are trying to replace the HDFS path of the HDFS Sink with
> /prod/hadoop/smallsite/flume_ingest_ale2*/%{host*}/%Y/%m/%d/%H….. Where
> */%{host*} is the regex = .*host=(ale-\d+-\w+.attwifi.com).* of type =
> regex_extractor
>
>
>
> We know the regex works b/c we checked in python that the source data
> output has the regex match
>
>
>
> >>> pattern = re.compile("host=(\w+-\d+-\w+.attwifi.com)\s.*")
>
> >>> pattern.match(s)
>
> <_sre.SRE_Match object at 0x7f8ca5cb4f30>
>
> >>> s
>
> *'host=ale-1-sa.attwifi.com <http://ale-1-sa.attwifi.com>* seq=478237182
> timestamp=1455754889 op=1 topic_seq=540549 lic_info=10 topic=station
> sta_eth_mac=60:f8:1d:95:74:79 username=Javiers-phone role=centerwifi
> bssid=40:e3:d6:b0:02:52 device_type=iPhone sta_ip_address=192.168.21.14
> hashed_sta_eth_mac=928ebc57036a2df7909c70ea5fce35774687835f
> hashed_sta_ip_address=8c76d83c5afb6aa1ca814d8902943a42a58d0a23 vlan=0 ht=0
> ap_name=BoA-AP564'
>
> >>>
>
>
>
>
>
> Is my config incorrect or do we need to write a custom interceptor on this?
>
>
>
>
>
> Here is my Flume config:
>
>
>
> multi-ale2-station.sources = source1
>
> multi-ale2-station.channels = channel1
>
> multi-ale2-station.sinks =  sink1
>
>
>
> # Define the sources
>
> multi-ale2-station.sources.source1.type = exec
>
> multi-ale2-station.sources.source1.command =  /usr/local/bin/multi_ale2.py
> -f /etc/flume/ale_station_conf/m_s.cfg
>
> multi-ale2-station.sources.source1.channels = channel1
>
>
>
>
>
> # Define the channels
>
> multi-ale2-station.channels.channel1.type = memory
>
> multi-ale2-station.channels.channel1.capacity = 10000000
>
> multi-ale2-station.channels.channel1.transactionCapacity = 10000000
>
>
>
>
>
> # Define the interceptors
>
> multi-ale2-station.sources.source1.interceptors = i1
>
> multi-ale2-station.sources.source1.interceptors.i1.type = regex_extractor
>
> multi-ale2-station.sources.source1.interceptors.i1.regex =
> .*host=(ale-\d+-\w+.attwifi.com).*
>
> multi-ale2-station.sources.source1.interceptors.i1.serializers = s1
>
> multi-ale2-station.sources.source1.interceptors.i1.serializers.type =
> default
>
> multi-ale2-station.sources.source1.interceptors.i1.serializers.s1.name =
> host
>
>
>
>
>
> # Define a logging sink
>
> multi-ale2-station.sinks.sink1.type = hdfs
>
> multi-ale2-station.sinks.sink1.channel = channel1
>
> multi-ale2-station.sinks.sink1.hdfs.path =
> /prod/hadoop/smallsite/flume_ingest_ale2/%{host}/%Y/%m/%d/%H
>
> multi-ale2-station.sinks.sink1.hdfs.fileType = DataStream
>
> multi-ale2-station.sinks.sink1.hdfs.writeFormat = Text
>
> multi-ale2-station.sinks.sink1.hdfs.filePrefix =
> Sutanu_regex_ALE_2_Station_topic
>
> multi-ale2-station.sinks.sink1.hdfs.useLocalTimeStamp = true
>
>
>
>
>

RE: regex_extractor NOT replacing the HDFS path vaiable

Posted by Sutanu Das <sd...@att.com>.
Thanks Ian,

Here is the s.out which is a text file of the python script output

We run Hortonworks and we are on HDP 2.3 – I think it is Flume 1.5

I look forward to your testing, thanks again Ian.

From: iain wright [mailto:iainwrig@gmail.com]
Sent: Wednesday, February 17, 2016 8:06 PM
To: user@flume.apache.org
Subject: Re: regex_extractor NOT replacing the HDFS path vaiable

Hi Sutanu,

Bummer. Its definitely supported, we use it for writing to S3 in the exact manner you intend too.

If you want to run this to generate some data as its presented to the source:
/usr/local/bin/multi_ale2.py -f /etc/flume/ale_station_conf/m_s.cfg >> out.txt

And throw it in a pastebin, or send me the file (please obfuscate any info you deem sensitive), I will play it with it as well.

I remember having a hurdle with this, and running a debug/logger sink until I could see it emitting the header with the event into logs

Safe to assume you're using latest stable version?

Best,
iain

--
Iain Wright

This email message is confidential, intended only for the recipient(s) named above and may contain information that is privileged, exempt from disclosure under applicable law. If you are not the intended recipient, do not disclose or disseminate the message to anyone except the intended recipient. If you have received this message in error, or are not the named recipient(s), please immediately notify the sender by return email, and delete all copies of this message.

On Wed, Feb 17, 2016 at 5:53 PM, Sutanu Das <sd...@att.com>> wrote:
Hi Ian,

Yes, events are getting written to but the regex_extractor variable is not getting substituted in the HDFS path

I’ve tried both hostname with the regex you advised yet, No luck

Is regex_extrator for the HDFS path of Sink even supported ?



18 Feb 2016 00:58:40,855 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.hdfs.BucketWriter.open:265) - Creating /prod/hadoop/smallsite/flume_ingest_ale2//2016/02/18/00/Sutanu_regex_ALE_2_Station_topic.1455757120803.tmp


From: iain wright [mailto:iainwrig@gmail.com<ma...@gmail.com>]
Sent: Wednesday, February 17, 2016 7:39 PM
To: user@flume.apache.org<ma...@flume.apache.org>
Subject: Re: regex_extractor NOT replacing the HDFS path vaiable

Config looks sane,

Are events being written to /prod/hadoop/smallsite/flume_ingest_ale2//%Y/%m/%d/%H?

A couple things that may be worth trying if you haven't yet:

- Try host=(ale-\d+-\w+.attwifi.com<http://attwifi.com>) instead of .*host=(ale-\d+-\w+.attwifi.com<http://attwifi.com/>).*
- Try hostname or another header instead of host, since host is a header used by the host interceptor


--
Iain Wright

This email message is confidential, intended only for the recipient(s) named above and may contain information that is privileged, exempt from disclosure under applicable law. If you are not the intended recipient, do not disclose or disseminate the message to anyone except the intended recipient. If you have received this message in error, or are not the named recipient(s), please immediately notify the sender by return email, and delete all copies of this message.

On Wed, Feb 17, 2016 at 5:06 PM, Sutanu Das <sd...@att.com>> wrote:
Hi Hari/Community,

We are trying to replace the hdfs path with the regex_extrator interceptor but apparently the variable is not getting replaced in the HDFS path in the HDFS Sink.

We are trying to replace the HDFS path of the HDFS Sink with /prod/hadoop/smallsite/flume_ingest_ale2/%{host}/%Y/%m/%d/%H….. Where /%{host} is the regex = .*host=(ale-\d+-\w+.attwifi.com<http://attwifi.com>).* of type = regex_extractor

We know the regex works b/c we checked in python that the source data output has the regex match

>>> pattern = re.compile("host=(\w+-\d+-\w+.attwifi.com<http://attwifi.com>)\s.*")
>>> pattern.match(s)
<_sre.SRE_Match object at 0x7f8ca5cb4f30>
>>> s
'host=ale-1-sa.attwifi.com<http://ale-1-sa.attwifi.com> seq=478237182 timestamp=1455754889 op=1 topic_seq=540549 lic_info=10 topic=station sta_eth_mac=60:f8:1d:95:74:79 username=Javiers-phone role=centerwifi bssid=40:e3:d6:b0:02:52 device_type=iPhone sta_ip_address=192.168.21.14 hashed_sta_eth_mac=928ebc57036a2df7909c70ea5fce35774687835f hashed_sta_ip_address=8c76d83c5afb6aa1ca814d8902943a42a58d0a23 vlan=0 ht=0 ap_name=BoA-AP564'
>>>


Is my config incorrect or do we need to write a custom interceptor on this?


Here is my Flume config:

multi-ale2-station.sources = source1
multi-ale2-station.channels = channel1
multi-ale2-station.sinks =  sink1

# Define the sources
multi-ale2-station.sources.source1.type = exec
multi-ale2-station.sources.source1.command =  /usr/local/bin/multi_ale2.py -f /etc/flume/ale_station_conf/m_s.cfg
multi-ale2-station.sources.source1.channels = channel1


# Define the channels
multi-ale2-station.channels.channel1.type = memory
multi-ale2-station.channels.channel1.capacity = 10000000
multi-ale2-station.channels.channel1.transactionCapacity = 10000000


# Define the interceptors
multi-ale2-station.sources.source1.interceptors = i1
multi-ale2-station.sources.source1.interceptors.i1.type = regex_extractor
multi-ale2-station.sources.source1.interceptors.i1.regex = .*host=(ale-\d+-\w+.attwifi.com<http://attwifi.com>).*
multi-ale2-station.sources.source1.interceptors.i1.serializers = s1
multi-ale2-station.sources.source1.interceptors.i1.serializers.type = default
multi-ale2-station.sources.source1.interceptors.i1.serializers.s1.name<http://multi-ale2-station.sources.source1.interceptors.i1.serializers.s1.name> = host


# Define a logging sink
multi-ale2-station.sinks.sink1.type = hdfs
multi-ale2-station.sinks.sink1.channel = channel1
multi-ale2-station.sinks.sink1.hdfs.path = /prod/hadoop/smallsite/flume_ingest_ale2/%{host}/%Y/%m/%d/%H
multi-ale2-station.sinks.sink1.hdfs.fileType = DataStream
multi-ale2-station.sinks.sink1.hdfs.writeFormat = Text
multi-ale2-station.sinks.sink1.hdfs.filePrefix = Sutanu_regex_ALE_2_Station_topic
multi-ale2-station.sinks.sink1.hdfs.useLocalTimeStamp = true



Re: regex_extractor NOT replacing the HDFS path vaiable

Posted by iain wright <ia...@gmail.com>.
Hi Sutanu,

Bummer. Its definitely supported, we use it for writing to S3 in the exact
manner you intend too.

If you want to run this to generate some data as its presented to the
source:
/usr/local/bin/multi_ale2.py -f /etc/flume/ale_station_conf/m_s.cfg >>
out.txt

And throw it in a pastebin, or send me the file (please obfuscate any info
you deem sensitive), I will play it with it as well.

I remember having a hurdle with this, and running a debug/logger sink until
I could see it emitting the header with the event into logs

Safe to assume you're using latest stable version?

Best,
iain

-- 
Iain Wright

This email message is confidential, intended only for the recipient(s)
named above and may contain information that is privileged, exempt from
disclosure under applicable law. If you are not the intended recipient, do
not disclose or disseminate the message to anyone except the intended
recipient. If you have received this message in error, or are not the named
recipient(s), please immediately notify the sender by return email, and
delete all copies of this message.

On Wed, Feb 17, 2016 at 5:53 PM, Sutanu Das <sd...@att.com> wrote:

> Hi Ian,
>
>
>
> Yes, events are getting written to but the regex_extractor variable is not
> getting substituted in the HDFS path
>
>
>
> I’ve tried both hostname with the regex you advised yet, No luck
>
>
>
> Is regex_extrator for the HDFS path of Sink even supported ?
>
>
>
>
>
>
>
> 18 Feb 2016 00:58:40,855 INFO
> [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.hdfs.BucketWriter.open:265) - Creating
> /prod/hadoop/smallsite/flume_ingest_ale2*//*
> 2016/02/18/00/Sutanu_regex_ALE_2_Station_topic.1455757120803.tmp
>
>
>
>
>
> *From:* iain wright [mailto:iainwrig@gmail.com]
> *Sent:* Wednesday, February 17, 2016 7:39 PM
> *To:* user@flume.apache.org
> *Subject:* Re: regex_extractor NOT replacing the HDFS path vaiable
>
>
>
> Config looks sane,
>
>
>
> Are events being written
> to /prod/hadoop/smallsite/flume_ingest_ale2//%Y/%m/%d/%H?
>
>
>
> A couple things that may be worth trying if you haven't yet:
>
>
>
> - Try host=(ale-\d+-\w+.attwifi.com) instead of .*host=(ale-\d+-\w+.
> attwifi.com).*
>
> - Try hostname or another header instead of host, since host is a header
> used by the host interceptor
>
>
>
>
> --
>
> Iain Wright
>
>
>
> This email message is confidential, intended only for the recipient(s)
> named above and may contain information that is privileged, exempt from
> disclosure under applicable law. If you are not the intended recipient, do
> not disclose or disseminate the message to anyone except the intended
> recipient. If you have received this message in error, or are not the named
> recipient(s), please immediately notify the sender by return email, and
> delete all copies of this message.
>
>
>
> On Wed, Feb 17, 2016 at 5:06 PM, Sutanu Das <sd...@att.com> wrote:
>
> Hi Hari/Community,
>
>
>
> We are trying to replace the hdfs path with the regex_extrator interceptor
> but apparently the variable is not getting replaced in the HDFS path in the
> HDFS Sink.
>
>
>
> We are trying to replace the HDFS path of the HDFS Sink with
> /prod/hadoop/smallsite/flume_ingest_ale2*/%{host*}/%Y/%m/%d/%H….. Where
> */%{host*} is the regex = .*host=(ale-\d+-\w+.attwifi.com).* of type =
> regex_extractor
>
>
>
> We know the regex works b/c we checked in python that the source data
> output has the regex match
>
>
>
> >>> pattern = re.compile("host=(\w+-\d+-\w+.attwifi.com)\s.*")
>
> >>> pattern.match(s)
>
> <_sre.SRE_Match object at 0x7f8ca5cb4f30>
>
> >>> s
>
> *'host=ale-1-sa.attwifi.com <http://ale-1-sa.attwifi.com>* seq=478237182
> timestamp=1455754889 op=1 topic_seq=540549 lic_info=10 topic=station
> sta_eth_mac=60:f8:1d:95:74:79 username=Javiers-phone role=centerwifi
> bssid=40:e3:d6:b0:02:52 device_type=iPhone sta_ip_address=192.168.21.14
> hashed_sta_eth_mac=928ebc57036a2df7909c70ea5fce35774687835f
> hashed_sta_ip_address=8c76d83c5afb6aa1ca814d8902943a42a58d0a23 vlan=0 ht=0
> ap_name=BoA-AP564'
>
> >>>
>
>
>
>
>
> Is my config incorrect or do we need to write a custom interceptor on this?
>
>
>
>
>
> Here is my Flume config:
>
>
>
> multi-ale2-station.sources = source1
>
> multi-ale2-station.channels = channel1
>
> multi-ale2-station.sinks =  sink1
>
>
>
> # Define the sources
>
> multi-ale2-station.sources.source1.type = exec
>
> multi-ale2-station.sources.source1.command =  /usr/local/bin/multi_ale2.py
> -f /etc/flume/ale_station_conf/m_s.cfg
>
> multi-ale2-station.sources.source1.channels = channel1
>
>
>
>
>
> # Define the channels
>
> multi-ale2-station.channels.channel1.type = memory
>
> multi-ale2-station.channels.channel1.capacity = 10000000
>
> multi-ale2-station.channels.channel1.transactionCapacity = 10000000
>
>
>
>
>
> # Define the interceptors
>
> multi-ale2-station.sources.source1.interceptors = i1
>
> multi-ale2-station.sources.source1.interceptors.i1.type = regex_extractor
>
> multi-ale2-station.sources.source1.interceptors.i1.regex =
> .*host=(ale-\d+-\w+.attwifi.com).*
>
> multi-ale2-station.sources.source1.interceptors.i1.serializers = s1
>
> multi-ale2-station.sources.source1.interceptors.i1.serializers.type =
> default
>
> multi-ale2-station.sources.source1.interceptors.i1.serializers.s1.name =
> host
>
>
>
>
>
> # Define a logging sink
>
> multi-ale2-station.sinks.sink1.type = hdfs
>
> multi-ale2-station.sinks.sink1.channel = channel1
>
> multi-ale2-station.sinks.sink1.hdfs.path =
> /prod/hadoop/smallsite/flume_ingest_ale2/%{host}/%Y/%m/%d/%H
>
> multi-ale2-station.sinks.sink1.hdfs.fileType = DataStream
>
> multi-ale2-station.sinks.sink1.hdfs.writeFormat = Text
>
> multi-ale2-station.sinks.sink1.hdfs.filePrefix =
> Sutanu_regex_ALE_2_Station_topic
>
> multi-ale2-station.sinks.sink1.hdfs.useLocalTimeStamp = true
>
>
>

RE: regex_extractor NOT replacing the HDFS path vaiable

Posted by Sutanu Das <sd...@att.com>.
Hi Ian,

Yes, events are getting written to but the regex_extractor variable is not getting substituted in the HDFS path

I’ve tried both hostname with the regex you advised yet, No luck

Is regex_extrator for the HDFS path of Sink even supported ?



18 Feb 2016 00:58:40,855 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.hdfs.BucketWriter.open:265) - Creating /prod/hadoop/smallsite/flume_ingest_ale2//2016/02/18/00/Sutanu_regex_ALE_2_Station_topic.1455757120803.tmp


From: iain wright [mailto:iainwrig@gmail.com]
Sent: Wednesday, February 17, 2016 7:39 PM
To: user@flume.apache.org
Subject: Re: regex_extractor NOT replacing the HDFS path vaiable

Config looks sane,

Are events being written to /prod/hadoop/smallsite/flume_ingest_ale2//%Y/%m/%d/%H?

A couple things that may be worth trying if you haven't yet:

- Try host=(ale-\d+-\w+.attwifi.com<http://attwifi.com>) instead of .*host=(ale-\d+-\w+.attwifi.com<http://attwifi.com/>).*
- Try hostname or another header instead of host, since host is a header used by the host interceptor


--
Iain Wright

This email message is confidential, intended only for the recipient(s) named above and may contain information that is privileged, exempt from disclosure under applicable law. If you are not the intended recipient, do not disclose or disseminate the message to anyone except the intended recipient. If you have received this message in error, or are not the named recipient(s), please immediately notify the sender by return email, and delete all copies of this message.

On Wed, Feb 17, 2016 at 5:06 PM, Sutanu Das <sd...@att.com>> wrote:
Hi Hari/Community,

We are trying to replace the hdfs path with the regex_extrator interceptor but apparently the variable is not getting replaced in the HDFS path in the HDFS Sink.

We are trying to replace the HDFS path of the HDFS Sink with /prod/hadoop/smallsite/flume_ingest_ale2/%{host}/%Y/%m/%d/%H….. Where /%{host} is the regex = .*host=(ale-\d+-\w+.attwifi.com<http://attwifi.com>).* of type = regex_extractor

We know the regex works b/c we checked in python that the source data output has the regex match

>>> pattern = re.compile("host=(\w+-\d+-\w+.attwifi.com<http://attwifi.com>)\s.*")
>>> pattern.match(s)
<_sre.SRE_Match object at 0x7f8ca5cb4f30>
>>> s
'host=ale-1-sa.attwifi.com<http://ale-1-sa.attwifi.com> seq=478237182 timestamp=1455754889 op=1 topic_seq=540549 lic_info=10 topic=station sta_eth_mac=60:f8:1d:95:74:79 username=Javiers-phone role=centerwifi bssid=40:e3:d6:b0:02:52 device_type=iPhone sta_ip_address=192.168.21.14 hashed_sta_eth_mac=928ebc57036a2df7909c70ea5fce35774687835f hashed_sta_ip_address=8c76d83c5afb6aa1ca814d8902943a42a58d0a23 vlan=0 ht=0 ap_name=BoA-AP564'
>>>


Is my config incorrect or do we need to write a custom interceptor on this?


Here is my Flume config:

multi-ale2-station.sources = source1
multi-ale2-station.channels = channel1
multi-ale2-station.sinks =  sink1

# Define the sources
multi-ale2-station.sources.source1.type = exec
multi-ale2-station.sources.source1.command =  /usr/local/bin/multi_ale2.py -f /etc/flume/ale_station_conf/m_s.cfg
multi-ale2-station.sources.source1.channels = channel1


# Define the channels
multi-ale2-station.channels.channel1.type = memory
multi-ale2-station.channels.channel1.capacity = 10000000
multi-ale2-station.channels.channel1.transactionCapacity = 10000000


# Define the interceptors
multi-ale2-station.sources.source1.interceptors = i1
multi-ale2-station.sources.source1.interceptors.i1.type = regex_extractor
multi-ale2-station.sources.source1.interceptors.i1.regex = .*host=(ale-\d+-\w+.attwifi.com<http://attwifi.com>).*
multi-ale2-station.sources.source1.interceptors.i1.serializers = s1
multi-ale2-station.sources.source1.interceptors.i1.serializers.type = default
multi-ale2-station.sources.source1.interceptors.i1.serializers.s1.name<http://multi-ale2-station.sources.source1.interceptors.i1.serializers.s1.name> = host


# Define a logging sink
multi-ale2-station.sinks.sink1.type = hdfs
multi-ale2-station.sinks.sink1.channel = channel1
multi-ale2-station.sinks.sink1.hdfs.path = /prod/hadoop/smallsite/flume_ingest_ale2/%{host}/%Y/%m/%d/%H
multi-ale2-station.sinks.sink1.hdfs.fileType = DataStream
multi-ale2-station.sinks.sink1.hdfs.writeFormat = Text
multi-ale2-station.sinks.sink1.hdfs.filePrefix = Sutanu_regex_ALE_2_Station_topic
multi-ale2-station.sinks.sink1.hdfs.useLocalTimeStamp = true


Re: regex_extractor NOT replacing the HDFS path vaiable

Posted by iain wright <ia...@gmail.com>.
Config looks sane,

Are events being written to /prod/hadoop/smallsite/flume_
ingest_ale2//%Y/%m/%d/%H?

A couple things that may be worth trying if you haven't yet:

- Try host=(ale-\d+-\w+.attwifi.com) instead of .*host=(ale-\d+-\w+.
attwifi.com).*
- Try hostname or another header instead of host, since host is a header
used by the host interceptor


-- 
Iain Wright

This email message is confidential, intended only for the recipient(s)
named above and may contain information that is privileged, exempt from
disclosure under applicable law. If you are not the intended recipient, do
not disclose or disseminate the message to anyone except the intended
recipient. If you have received this message in error, or are not the named
recipient(s), please immediately notify the sender by return email, and
delete all copies of this message.

On Wed, Feb 17, 2016 at 5:06 PM, Sutanu Das <sd...@att.com> wrote:

> Hi Hari/Community,
>
>
>
> We are trying to replace the hdfs path with the regex_extrator interceptor
> but apparently the variable is not getting replaced in the HDFS path in the
> HDFS Sink.
>
>
>
> We are trying to replace the HDFS path of the HDFS Sink with
> /prod/hadoop/smallsite/flume_ingest_ale2*/%{host*}/%Y/%m/%d/%H….. Where
> */%{host*} is the regex = .*host=(ale-\d+-\w+.attwifi.com).* of type =
> regex_extractor
>
>
>
> We know the regex works b/c we checked in python that the source data
> output has the regex match
>
>
>
> >>> pattern = re.compile("host=(\w+-\d+-\w+.attwifi.com)\s.*")
>
> >>> pattern.match(s)
>
> <_sre.SRE_Match object at 0x7f8ca5cb4f30>
>
> >>> s
>
> *'host=ale-1-sa.attwifi.com <http://ale-1-sa.attwifi.com>* seq=478237182
> timestamp=1455754889 op=1 topic_seq=540549 lic_info=10 topic=station
> sta_eth_mac=60:f8:1d:95:74:79 username=Javiers-phone role=centerwifi
> bssid=40:e3:d6:b0:02:52 device_type=iPhone sta_ip_address=192.168.21.14
> hashed_sta_eth_mac=928ebc57036a2df7909c70ea5fce35774687835f
> hashed_sta_ip_address=8c76d83c5afb6aa1ca814d8902943a42a58d0a23 vlan=0 ht=0
> ap_name=BoA-AP564'
>
> >>>
>
>
>
>
>
> Is my config incorrect or do we need to write a custom interceptor on this?
>
>
>
>
>
> Here is my Flume config:
>
>
>
> multi-ale2-station.sources = source1
>
> multi-ale2-station.channels = channel1
>
> multi-ale2-station.sinks =  sink1
>
>
>
> # Define the sources
>
> multi-ale2-station.sources.source1.type = exec
>
> multi-ale2-station.sources.source1.command =  /usr/local/bin/multi_ale2.py
> -f /etc/flume/ale_station_conf/m_s.cfg
>
> multi-ale2-station.sources.source1.channels = channel1
>
>
>
>
>
> # Define the channels
>
> multi-ale2-station.channels.channel1.type = memory
>
> multi-ale2-station.channels.channel1.capacity = 10000000
>
> multi-ale2-station.channels.channel1.transactionCapacity = 10000000
>
>
>
>
>
> # Define the interceptors
>
> multi-ale2-station.sources.source1.interceptors = i1
>
> multi-ale2-station.sources.source1.interceptors.i1.type = regex_extractor
>
> multi-ale2-station.sources.source1.interceptors.i1.regex =
> .*host=(ale-\d+-\w+.attwifi.com).*
>
> multi-ale2-station.sources.source1.interceptors.i1.serializers = s1
>
> multi-ale2-station.sources.source1.interceptors.i1.serializers.type =
> default
>
> multi-ale2-station.sources.source1.interceptors.i1.serializers.s1.name =
> host
>
>
>
>
>
> # Define a logging sink
>
> multi-ale2-station.sinks.sink1.type = hdfs
>
> multi-ale2-station.sinks.sink1.channel = channel1
>
> multi-ale2-station.sinks.sink1.hdfs.path =
> /prod/hadoop/smallsite/flume_ingest_ale2/%{host}/%Y/%m/%d/%H
>
> multi-ale2-station.sinks.sink1.hdfs.fileType = DataStream
>
> multi-ale2-station.sinks.sink1.hdfs.writeFormat = Text
>
> multi-ale2-station.sinks.sink1.hdfs.filePrefix =
> Sutanu_regex_ALE_2_Station_topic
>
> multi-ale2-station.sinks.sink1.hdfs.useLocalTimeStamp = true
>