You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Gary Malouf <ma...@gmail.com> on 2013/03/14 22:54:38 UTC

Writing to HDFS from multiple HDFS agents (separate machines)

Hi guys,

I'm new to flume (hdfs for that metter), using the version packaged with
CDH4 (1.3.0) and was wondering how others are maintaining different file
names being written to per HDFS sink.

My initial thought is to create a separate sub-directory in hdfs for each
sink - though I feel like the better way is to somehow prefix each file
with a unique sink id.  Are there any patterns that others are following
for this?

-Gary

Re: Writing to HDFS from multiple HDFS agents (separate machines)

Posted by Seshu V <se...@gmail.com>.

I could differentiate different sources using this config by creating
separate directories by hostname:

agent.sources.syslogsrc.interceptors = ts
agent.sources.syslogsrc.interceptors.ts.type = timestamp
agent.sinks.hdfsSink.hdfs.path =
hdfs://<ip_addr>:<port>/flumetest/%{host}/%y-%m-%d

However, I have a question related to this.  When two different products
are sending their logs to one source and I am collecting them via syslog.
 Is there a way to differentiate two different product logs coming from
single source in flume?  I would ideally like to have sub directory at the
sink like '/flumetest/%{host}/<product_name>/%y-%m-%d.  How can I do this?

Thanks,
- Seshu

On Thu, Mar 14, 2013 at 5:00 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello sir,
>
>     One idea could be to create the sub directories with the machines'
> hostnames, in case you are getting data from multiple sources. you can
> easily find out which data belongs to which machine then.
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Fri, Mar 15, 2013 at 3:24 AM, Gary Malouf <ma...@gmail.com>wrote:
>
>> Hi guys,
>>
>> I'm new to flume (hdfs for that metter), using the version packaged with
>> CDH4 (1.3.0) and was wondering how others are maintaining different file
>> names being written to per HDFS sink.
>>
>> My initial thought is to create a separate sub-directory in hdfs for each
>> sink - though I feel like the better way is to somehow prefix each file
>> with a unique sink id.  Are there any patterns that others are following
>> for this?
>>
>> -Gary
>>
>
>

Re: Writing to HDFS from multiple HDFS agents (separate machines)

Posted by Mohammad Tariq <do...@gmail.com>.

Hello sir,

    One idea could be to create the sub directories with the machines'
hostnames, in case you are getting data from multiple sources. you can
easily find out which data belongs to which machine then.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Fri, Mar 15, 2013 at 3:24 AM, Gary Malouf <ma...@gmail.com> wrote:

> Hi guys,
>
> I'm new to flume (hdfs for that metter), using the version packaged with
> CDH4 (1.3.0) and was wondering how others are maintaining different file
> names being written to per HDFS sink.
>
> My initial thought is to create a separate sub-directory in hdfs for each
> sink - though I feel like the better way is to somehow prefix each file
> with a unique sink id.  Are there any patterns that others are following
> for this?
>
> -Gary
>

Re: Writing to HDFS from multiple HDFS agents (separate machines)

Posted by Mike Percy <mp...@apache.org>.

Hi Gary,
All the suggestions in this thread are good. Something else to consider is
that adding multiple HDFS sinks pulling from the same channel is a
recommended practice to maximize performance (competing consumers pattern).
In that case, not only would it be a good idea to put the data into
directories that are specific to the hostname of the Flume agent writing to
HDFS, you will also need to do something like number the HDFS sink path (or
filePrefix) to indicate which HDFS sink wrote out the event, in order to
prevent name collisions.

Example:

# add hostname interceptor to your source as described above

# hdfs sinks...
agent.sinks.hdfs-1.path = /some/path/%{host}/1/web-events
# … snip ...
agent.sinks.hdfs-2.path = /some/path/%{host}/2/web-events
# … etc ...

Hope that helps.

Regards,
Mike

On Thu, Mar 14, 2013 at 3:34 PM, Gary Malouf <ma...@gmail.com> wrote:

> To be clear, I am referring to the segregating of data from different
> flume sinks as opposed to the original source of the event.  Having said
> that, it sounds like your approach is the easiest.
>
> -Gary
>
>
> On Thu, Mar 14, 2013 at 5:54 PM, Gary Malouf <ma...@gmail.com>wrote:
>
>> Hi guys,
>>
>> I'm new to flume (hdfs for that metter), using the version packaged with
>> CDH4 (1.3.0) and was wondering how others are maintaining different file
>> names being written to per HDFS sink.
>>
>> My initial thought is to create a separate sub-directory in hdfs for each
>> sink - though I feel like the better way is to somehow prefix each file
>> with a unique sink id.  Are there any patterns that others are following
>> for this?
>>
>> -Gary
>>
>
>

Re: Writing to HDFS from multiple HDFS agents (separate machines)

Posted by Paul Chavez <pc...@verticalsearchworks.com>.

It just depends on what you want to do with the header. In the case I presented the header is set by the agent running the HDFS sink, which seemed to align with your use case. If you need to know the originating host, just have the interceptor or originating host set a different header, the %{} notation allows you to specify an arbitrary header to swap in for the token, as long as it exists, of course.

-Paul


On Mar 14, 2013, at 7:31 PM, "Gary Malouf" <ma...@gmail.com>> wrote:

Paul, I interpreted the host property to be for identifying the host that an event originates from rather than the host of the sink which writes the event to HDFS?  Is my understanding correct?


What happens if I am using the NettyAvroRpcClient to feed events from a different server round robin style to two hdfs writing agents; should I then NOT set the host property on client side and rely on the interceptor?


On Thu, Mar 14, 2013 at 6:34 PM, Gary Malouf <ma...@gmail.com>> wrote:
To be clear, I am referring to the segregating of data from different flume sinks as opposed to the original source of the event.  Having said that, it sounds like your approach is the easiest.

-Gary


On Thu, Mar 14, 2013 at 5:54 PM, Gary Malouf <ma...@gmail.com>> wrote:
Hi guys,

I'm new to flume (hdfs for that metter), using the version packaged with CDH4 (1.3.0) and was wondering how others are maintaining different file names being written to per HDFS sink.

My initial thought is to create a separate sub-directory in hdfs for each sink - though I feel like the better way is to somehow prefix each file with a unique sink id.  Are there any patterns that others are following for this?

-Gary

Re: Writing to HDFS from multiple HDFS agents (separate machines)

Posted by Mike Percy <mp...@apache.org>.

In my experience, 3-5 HDFS sinks will give optimal performance, but it's
dependent on whether you use memory channel or file channel, your overall
throughput, batch sizes, and event sizes.

Regards,
Mike


On Thu, Mar 14, 2013 at 7:42 PM, Gary Malouf <ma...@gmail.com> wrote:

> Thanks for the pointer Mike.  Any thoughts on how you choose how many
> consumers per channel?  I will eventually find the optimal number via perf
> testing, but it would be good to start with a nice default.
>
> Thanks,
>
> Gary
>
>
> On Thu, Mar 14, 2013 at 10:30 PM, Gary Malouf <ma...@gmail.com>wrote:
>
>> Paul, I interpreted the host property to be for identifying the host that
>> an event originates from rather than the host of the sink which writes the
>> event to HDFS?  Is my understanding correct?
>>
>>
>> What happens if I am using the NettyAvroRpcClient to feed events from a
>> different server round robin style to two hdfs writing agents; should I
>> then NOT set the host property on client side and rely on the interceptor?
>>
>>
>> On Thu, Mar 14, 2013 at 6:34 PM, Gary Malouf <ma...@gmail.com>wrote:
>>
>>> To be clear, I am referring to the segregating of data from different
>>> flume sinks as opposed to the original source of the event.  Having said
>>> that, it sounds like your approach is the easiest.
>>>
>>> -Gary
>>>
>>>
>>> On Thu, Mar 14, 2013 at 5:54 PM, Gary Malouf <ma...@gmail.com>wrote:
>>>
>>>> Hi guys,
>>>>
>>>> I'm new to flume (hdfs for that metter), using the version packaged
>>>> with CDH4 (1.3.0) and was wondering how others are maintaining different
>>>> file names being written to per HDFS sink.
>>>>
>>>> My initial thought is to create a separate sub-directory in hdfs for
>>>> each sink - though I feel like the better way is to somehow prefix each
>>>> file with a unique sink id.  Are there any patterns that others are
>>>> following for this?
>>>>
>>>> -Gary
>>>>
>>>
>>>
>>
>

Re: Writing to HDFS from multiple HDFS agents (separate machines)

Posted by Gary Malouf <ma...@gmail.com>.

Thanks for the pointer Mike.  Any thoughts on how you choose how many
consumers per channel?  I will eventually find the optimal number via perf
testing, but it would be good to start with a nice default.

Thanks,

Gary


On Thu, Mar 14, 2013 at 10:30 PM, Gary Malouf <ma...@gmail.com> wrote:

> Paul, I interpreted the host property to be for identifying the host that
> an event originates from rather than the host of the sink which writes the
> event to HDFS?  Is my understanding correct?
>
>
> What happens if I am using the NettyAvroRpcClient to feed events from a
> different server round robin style to two hdfs writing agents; should I
> then NOT set the host property on client side and rely on the interceptor?
>
>
> On Thu, Mar 14, 2013 at 6:34 PM, Gary Malouf <ma...@gmail.com>wrote:
>
>> To be clear, I am referring to the segregating of data from different
>> flume sinks as opposed to the original source of the event.  Having said
>> that, it sounds like your approach is the easiest.
>>
>> -Gary
>>
>>
>> On Thu, Mar 14, 2013 at 5:54 PM, Gary Malouf <ma...@gmail.com>wrote:
>>
>>> Hi guys,
>>>
>>> I'm new to flume (hdfs for that metter), using the version packaged with
>>> CDH4 (1.3.0) and was wondering how others are maintaining different file
>>> names being written to per HDFS sink.
>>>
>>> My initial thought is to create a separate sub-directory in hdfs for
>>> each sink - though I feel like the better way is to somehow prefix each
>>> file with a unique sink id.  Are there any patterns that others are
>>> following for this?
>>>
>>> -Gary
>>>
>>
>>
>

Re: Writing to HDFS from multiple HDFS agents (separate machines)

Posted by Gary Malouf <ma...@gmail.com>.

Paul, I interpreted the host property to be for identifying the host that
an event originates from rather than the host of the sink which writes the
event to HDFS?  Is my understanding correct?

What happens if I am using the NettyAvroRpcClient to feed events from a
different server round robin style to two hdfs writing agents; should I
then NOT set the host property on client side and rely on the interceptor?

On Thu, Mar 14, 2013 at 6:34 PM, Gary Malouf <ma...@gmail.com> wrote:

> To be clear, I am referring to the segregating of data from different
> flume sinks as opposed to the original source of the event.  Having said
> that, it sounds like your approach is the easiest.
>
> -Gary
>
>
> On Thu, Mar 14, 2013 at 5:54 PM, Gary Malouf <ma...@gmail.com>wrote:
>
>> Hi guys,
>>
>> I'm new to flume (hdfs for that metter), using the version packaged with
>> CDH4 (1.3.0) and was wondering how others are maintaining different file
>> names being written to per HDFS sink.
>>
>> My initial thought is to create a separate sub-directory in hdfs for each
>> sink - though I feel like the better way is to somehow prefix each file
>> with a unique sink id.  Are there any patterns that others are following
>> for this?
>>
>> -Gary
>>
>
>

Re: Writing to HDFS from multiple HDFS agents (separate machines)

Posted by Gary Malouf <ma...@gmail.com>.

To be clear, I am referring to the segregating of data from different flume
sinks as opposed to the original source of the event.  Having said that, it
sounds like your approach is the easiest.

-Gary

On Thu, Mar 14, 2013 at 5:54 PM, Gary Malouf <ma...@gmail.com> wrote:

> Hi guys,
>
> I'm new to flume (hdfs for that metter), using the version packaged with
> CDH4 (1.3.0) and was wondering how others are maintaining different file
> names being written to per HDFS sink.
>
> My initial thought is to create a separate sub-directory in hdfs for each
> sink - though I feel like the better way is to somehow prefix each file
> with a unique sink id.  Are there any patterns that others are following
> for this?
>
> -Gary
>

RE: Writing to HDFS from multiple HDFS agents (separate machines)

Posted by Paul Chavez <pc...@verticalsearchworks.com>.

You can use a Host Interceptor on the agents running an HDFS sink, and then use %{host} in the .hdfs.filePrefix property. This isn't really documented but it works, docs only mention using those tokens in the path property but they seem to be ok for the filePrefix.

Here's some excerpts of a test config I have that does just that:

#define the interceptor on the source
staging2.sources.httpSource_stg.interceptors = iHost
staging2.sources.httpSource_stg.interceptors.iHost.type = host
staging2.sources.httpSource_stg.interceptors.iHost.useIP = false

#use the header the interceptor added in the filePrefix
staging2.sinks.hdfs_FilterLogst.type = hdfs
staging2.sinks.hdfs_FilterLogs.channel = mc_FilterLogs
staging2.sinks.hdfs_FilterLogs.hdfs.path = /flume_stg/FilterLogsJSON/%Y%m%d
staging2.sinks.hdfs_FilterLogs.hdfs.filePrefix = %{host}

Hope that helps,
Paul Chavez

________________________________
From: Gary Malouf [mailto:malouf.gary@gmail.com]
Sent: Thursday, March 14, 2013 2:55 PM
To: user
Subject: Writing to HDFS from multiple HDFS agents (separate machines)

Hi guys,

I'm new to flume (hdfs for that metter), using the version packaged with CDH4 (1.3.0) and was wondering how others are maintaining different file names being written to per HDFS sink.

My initial thought is to create a separate sub-directory in hdfs for each sink - though I feel like the better way is to somehow prefix each file with a unique sink id.  Are there any patterns that others are following for this?

-Gary