You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Jean-Philippe Caruana <jp...@target2sell.com> on 2014/10/15 16:02:55 UTC

HDFS sink: "clever" routing

Hi,

I am new to Flume (and to HDFS), so I hope my question is not stupid.

I have a multi-tenant application (about 100 different customers as for
now).
I have 16 different data types.

(In production, we have approx. 15 million messages/day through our
RabbitMQ)

I want to write to HDFS all my events, separated by tenant, data type,
and date, like this :
/data/{tenant}/{data_type}/2014/10/15/file-08.csv

Is it possible with one sink definition ? I don't want to duplicate
configuration, and new client arrive every week or so

In documentation, I see
agent1.sinks.hdfs-sink1.hdfs.path = hdfs://server/events/%Y/%m/%d/%H/

Is this possible ?
agent1.sinks.hdfs-sink1.hdfs.path =
hdfs://server/events/%tenant/%type/%Y/%m/%d/%H/

I want to write to different folder according to my incoming data.

Thanks

-- 
Jean-Philippe Caruana 
http://www.barreverte.fr

Re: HDFS sink: "clever" routing

Posted by Jean-Philippe Caruana <jp...@target2sell.com>.

Le 15/10/2014 17:57, Gwen Shapira a écrit :
> Yes, this is absolutely possible - but you need to make sure the flume
> event has the matching keys in the event header (tenant, type, and
> timestamp).
> Do this either using interceptors or through a custom source.

Thanks I'll try it (maybe next week : priorities changed here)

-- 
Jean-Philippe Caruana
http://www.barreverte.fr

Re: HDFS sink: "clever" routing

Posted by Gwen Shapira <gs...@cloudera.com>.

Yes, this is absolutely possible - but you need to make sure the flume
event has the matching keys in the event header (tenant, type, and
timestamp).
Do this either using interceptors or through a custom source.

On Wed, Oct 15, 2014 at 7:02 AM, Jean-Philippe Caruana
<jp...@target2sell.com> wrote:
> Hi,
>
> I am new to Flume (and to HDFS), so I hope my question is not stupid.
>
> I have a multi-tenant application (about 100 different customers as for
> now).
> I have 16 different data types.
>
> (In production, we have approx. 15 million messages/day through our
> RabbitMQ)
>
> I want to write to HDFS all my events, separated by tenant, data type,
> and date, like this :
> /data/{tenant}/{data_type}/2014/10/15/file-08.csv
>
> Is it possible with one sink definition ? I don't want to duplicate
> configuration, and new client arrive every week or so
>
> In documentation, I see
> agent1.sinks.hdfs-sink1.hdfs.path = hdfs://server/events/%Y/%m/%d/%H/
>
> Is this possible ?
> agent1.sinks.hdfs-sink1.hdfs.path =
> hdfs://server/events/%tenant/%type/%Y/%m/%d/%H/
>
> I want to write to different folder according to my incoming data.
>
> Thanks
>
> --
> Jean-Philippe Caruana
> http://www.barreverte.fr
>

Re: HDFS sink: "clever" routing

Posted by Johny Rufus <jr...@cloudera.com>.

The completed filename will always contain the epochTimestamp/counter added
to it (this is to uniquely distinguish the rolled files)

Thanks,
Rufus

On Fri, May 29, 2015 at 10:46 AM, Guyle M. Taber <gu...@gmtech.net> wrote:

> Ok I figured this out by using the %{basename} placeholder.
>
> However I’m trying to figure out how to prevent the epoch suffix from
> being applied to every file as it’s written to hdfs.
>
> Example:
> 20150528133001.txt-.1432920411283
>
> How do I prevent the epoch timestamp from being appended to every file
> name?
>
>
>
>
>
> > On May 28, 2015, at 3:23 PM, <gu...@gmtech.net> wrote:
> >
> > I’m using the %{file} var to hold and preserve the file/log name as it’s
> stored in HDFS, but it seems to be recreating the entire directory
> structure from the source side.
> > How can I simply write the filename as-is into the HDFS path specified?
> >
> > dp1.sinks.sinkSG.hdfs.filePrefix = %{file}  # Just want the file name
> and not the entire path+filename.
> >
> > dp1.sinks.sinkSG.hdfs.path = hdfs://
> hadoopnn1.company.com/flume/events/fe_event/%{host}/%y-%m-%d
>

Re: HDFS sink: "clever" routing

Posted by "Guyle M. Taber" <gu...@gmtech.net>.

Ok I figured this out by using the %{basename} placeholder.

However I’m trying to figure out how to prevent the epoch suffix from being applied to every file as it’s written to hdfs.

Example:
20150528133001.txt-.1432920411283

How do I prevent the epoch timestamp from being appended to every file name?

> On May 28, 2015, at 3:23 PM, <gu...@gmtech.net> wrote:
> 
> I’m using the %{file} var to hold and preserve the file/log name as it’s stored in HDFS, but it seems to be recreating the entire directory structure from the source side.
> How can I simply write the filename as-is into the HDFS path specified?
> 
> dp1.sinks.sinkSG.hdfs.filePrefix = %{file}  # Just want the file name and not the entire path+filename.
> 
> dp1.sinks.sinkSG.hdfs.path = hdfs://hadoopnn1.company.com/flume/events/fe_event/%{host}/%y-%m-%d

Re: HDFS sink: "clever" routing

Posted by Jean <la...@yahoo.fr>.

I second that, murphy law ...
New releases or patch can break the correct event format, manual mistakes too and so on

> Le 16 oct. 2014 à 17:54, Paul Chavez <pc...@ntent.com> a écrit :
> 
> Human error is most common reason in my experience. Whether it is a configuration error or fault in app development, I was just relaying a method to make your flume infrastructure more resilient. Regarding corrupted events, now that I think of it those have always been within the event payload and we have never actually seen corrupted headers.
> 
>> On Oct 16, 2014, at 8:24 AM, "Jean-Philippe Caruana" <jp...@target2sell.com> wrote:
>> 
>> Le 15/10/2014 17:57, Paul Chavez a écrit :
>>> Yes, that will work fine. From experience, I can say definitely account for the possibility of the 'tenant' and 'data_type' headers being corrupted or missing outright.
>> 
>> How come they are missing or corrupted ?
>> If my app is the only source for these events, I suppose we will code it
>> without missing headers
>> 
>> Can you elaborate ?
>> 
>> Thanks
>> 
>> -- 
>> Jean-Philippe Caruana 
>> http://www.barreverte.fr
>>

Re: HDFS sink: "clever" routing

Posted by Paul Chavez <pc...@ntent.com>.

Human error is most common reason in my experience. Whether it is a configuration error or fault in app development, I was just relaying a method to make your flume infrastructure more resilient. Regarding corrupted events, now that I think of it those have always been within the event payload and we have never actually seen corrupted headers.

> On Oct 16, 2014, at 8:24 AM, "Jean-Philippe Caruana" <jp...@target2sell.com> wrote:
> 
> Le 15/10/2014 17:57, Paul Chavez a écrit :
>> Yes, that will work fine. From experience, I can say definitely account for the possibility of the 'tenant' and 'data_type' headers being corrupted or missing outright.
> 
> How come they are missing or corrupted ?
> If my app is the only source for these events, I suppose we will code it
> without missing headers
> 
> Can you elaborate ?
> 
> Thanks
> 
> -- 
> Jean-Philippe Caruana 
> http://www.barreverte.fr
>

Re: HDFS sink: "clever" routing

Posted by Jean-Philippe Caruana <jp...@target2sell.com>.

Le 15/10/2014 17:57, Paul Chavez a écrit :
> Yes, that will work fine. From experience, I can say definitely account for the possibility of the 'tenant' and 'data_type' headers being corrupted or missing outright.

How come they are missing or corrupted ?
If my app is the only source for these events, I suppose we will code it
without missing headers

Can you elaborate ?

Thanks

-- 
Jean-Philippe Caruana 
http://www.barreverte.fr

RE: HDFS sink: "clever" routing

Posted by Paul Chavez <pc...@ntent.com>.

Yes, that will work fine. From experience, I can say definitely account for the possibility of the 'tenant' and 'data_type' headers being corrupted or missing outright.

At my org we have a similar setup where we auto-bucket on a 'logSubType' header that our application adds to the initial flume event. To keep channels from blocking if this header goes missing we have a static interceptor that adds the value 'MissingSubType' if the header does not exist. This setup has worked well for us across dozens of separate log streams for over a year.

Hope that helps,
Paul Chavez


-----Original Message-----
From: Jean-Philippe Caruana [mailto:jp@target2sell.com] 
Sent: Wednesday, October 15, 2014 7:03 AM
To: user@flume.apache.org
Subject: HDFS sink: "clever" routing

Hi,

I am new to Flume (and to HDFS), so I hope my question is not stupid.

I have a multi-tenant application (about 100 different customers as for now).
I have 16 different data types.

(In production, we have approx. 15 million messages/day through our
RabbitMQ)

I want to write to HDFS all my events, separated by tenant, data type, and date, like this :
/data/{tenant}/{data_type}/2014/10/15/file-08.csv

Is it possible with one sink definition ? I don't want to duplicate configuration, and new client arrive every week or so

In documentation, I see
agent1.sinks.hdfs-sink1.hdfs.path = hdfs://server/events/%Y/%m/%d/%H/

Is this possible ?
agent1.sinks.hdfs-sink1.hdfs.path =
hdfs://server/events/%tenant/%type/%Y/%m/%d/%H/

I want to write to different folder according to my incoming data.

Thanks

--
Jean-Philippe Caruana
http://www.barreverte.fr