You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Jonathan Hsieh <jo...@cloudera.com> on 2011/08/05 08:35:56 UTC

Re: Metadata parsing

[bcc flume-user@cloudera.org (deprecated), cc
flume-user@incubator.apache.org]

Brian,

The easiest way is to use the regex decorator to create a new attribute and
use that attribute as to do output bucketing.

http://archive.cloudera.com/cdh/3/flume/UserGuide/index.html#_extractors

Jon.

On Mon, Jul 25, 2011 at 5:50 PM, Brian Tran <br...@gmail.com> wrote:

> I want to do output bucketing based on the tailSrcFile metadata value
> set by the tailDir source. However, I only want part of the value for
> the destination path in HDFS.
>
> For example, I have an event with the tailSrcFile value
> "unwanted_prefix_category_name-2011-07-25.log" but only want to use
> "category_name" for output bucketing.
>
> What is the easiest way to do this?
>



-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Re: Metadata parsing

Posted by Jonathan Hsieh <jo...@cloudera.com>.
Brian,

Here are some directions on how to contribute code:

https://cwiki.apache.org/confluence/display/FLUME/How+to+Contribute

They are new and in progress (new project infrastructure landed yesterday),
and likely have some bugs so please let provide feed back on that as well!

Thanks,
Jon.


On Tue, Aug 9, 2011 at 1:55 AM, Brian Tran <br...@gmail.com> wrote:

> I actually wrote an implementation last week. If no one else has already
> done it, how do I go about adding it?
>
>
> On Sat, Aug 6, 2011 at 3:25 AM, Lior Harel <ha...@gmail.com> wrote:
>
>> sure, let's do this. I'll join the dev mailing list, and see if i can help
>> with the implementation.
>>
>> On Aug 5, 2011, at 6:34 PM, Jonathan Hsieh wrote:
>>
>> Lior,
>>
>> Ah, good point, I mispoke.  Thanks for correcting me!
>>
>> Unfortunately, you are correct,  flume currently can't do this
>> out-of-the-box.
>>
>> It seems like a reasonable addition and would be gladly accepted patch if
>> someone were to implement it.  If you, Brian, or anyone else is  interested
>> in building this, let's move discussion about this to the
>> flume-dev@incubator.apache.org!
>>
>> Thanks,
>> Jon.
>>
>> On Fri, Aug 5, 2011 at 1:30 AM, Lior Harel <ha...@gmail.com> wrote:
>>
>>> Hi Jon,
>>> I'm interested in the same use case as Brian asked about, I'm not sure I
>>> understand your answer, as far as I understand the regex decorator can only
>>> extract data out of the event body, while the tailSrcFile attibute is part
>>> of the metadata. Can the regex decorator somehow operate on it?
>>>
>>>
>>> Lior
>>>
>>> On Aug 5, 2011, at 9:35 AM, Jonathan Hsieh wrote:
>>>
>>> [bcc flume-user@cloudera.org (deprecated), cc
>>> flume-user@incubator.apache.org]
>>>
>>> Brian,
>>>
>>> The easiest way is to use the regex decorator to create a new attribute
>>> and use that attribute as to do output bucketing.
>>>
>>> http://archive.cloudera.com/cdh/3/flume/UserGuide/index.html#_extractors
>>>
>>> Jon.
>>>
>>> On Mon, Jul 25, 2011 at 5:50 PM, Brian Tran <br...@gmail.com>wrote:
>>>
>>>> I want to do output bucketing based on the tailSrcFile metadata value
>>>> set by the tailDir source. However, I only want part of the value for
>>>> the destination path in HDFS.
>>>>
>>>> For example, I have an event with the tailSrcFile value
>>>> "unwanted_prefix_category_name-2011-07-25.log" but only want to use
>>>> "category_name" for output bucketing.
>>>>
>>>> What is the easiest way to do this?
>>>>
>>>
>>>
>>>
>>> --
>>> // Jonathan Hsieh (shay)
>>> // Software Engineer, Cloudera
>>> // jon@cloudera.com
>>>
>>>
>>>
>>>
>>
>>
>> --
>> // Jonathan Hsieh (shay)
>> // Software Engineer, Cloudera
>> // jon@cloudera.com
>>
>>
>>
>>
>


-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Re: Metadata parsing

Posted by Brian Tran <br...@gmail.com>.
I actually wrote an implementation last week. If no one else has already
done it, how do I go about adding it?

On Sat, Aug 6, 2011 at 3:25 AM, Lior Harel <ha...@gmail.com> wrote:

> sure, let's do this. I'll join the dev mailing list, and see if i can help
> with the implementation.
>
> On Aug 5, 2011, at 6:34 PM, Jonathan Hsieh wrote:
>
> Lior,
>
> Ah, good point, I mispoke.  Thanks for correcting me!
>
> Unfortunately, you are correct,  flume currently can't do this
> out-of-the-box.
>
> It seems like a reasonable addition and would be gladly accepted patch if
> someone were to implement it.  If you, Brian, or anyone else is  interested
> in building this, let's move discussion about this to the
> flume-dev@incubator.apache.org!
>
> Thanks,
> Jon.
>
> On Fri, Aug 5, 2011 at 1:30 AM, Lior Harel <ha...@gmail.com> wrote:
>
>> Hi Jon,
>> I'm interested in the same use case as Brian asked about, I'm not sure I
>> understand your answer, as far as I understand the regex decorator can only
>> extract data out of the event body, while the tailSrcFile attibute is part
>> of the metadata. Can the regex decorator somehow operate on it?
>>
>>
>> Lior
>>
>> On Aug 5, 2011, at 9:35 AM, Jonathan Hsieh wrote:
>>
>> [bcc flume-user@cloudera.org (deprecated), cc
>> flume-user@incubator.apache.org]
>>
>> Brian,
>>
>> The easiest way is to use the regex decorator to create a new attribute
>> and use that attribute as to do output bucketing.
>>
>> http://archive.cloudera.com/cdh/3/flume/UserGuide/index.html#_extractors
>>
>> Jon.
>>
>> On Mon, Jul 25, 2011 at 5:50 PM, Brian Tran <br...@gmail.com>wrote:
>>
>>> I want to do output bucketing based on the tailSrcFile metadata value
>>> set by the tailDir source. However, I only want part of the value for
>>> the destination path in HDFS.
>>>
>>> For example, I have an event with the tailSrcFile value
>>> "unwanted_prefix_category_name-2011-07-25.log" but only want to use
>>> "category_name" for output bucketing.
>>>
>>> What is the easiest way to do this?
>>>
>>
>>
>>
>> --
>> // Jonathan Hsieh (shay)
>> // Software Engineer, Cloudera
>> // jon@cloudera.com
>>
>>
>>
>>
>
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com
>
>
>
>

Re: Metadata parsing

Posted by Lior Harel <ha...@gmail.com>.
sure, let's do this. I'll join the dev mailing list, and see if i can help with the implementation.

On Aug 5, 2011, at 6:34 PM, Jonathan Hsieh wrote:

> Lior, 
> 
> Ah, good point, I mispoke.  Thanks for correcting me!
> 
> Unfortunately, you are correct,  flume currently can't do this out-of-the-box. 
> 
> It seems like a reasonable addition and would be gladly accepted patch if someone were to implement it.  If you, Brian, or anyone else is  interested in building this, let's move discussion about this to the flume-dev@incubator.apache.org!
> 
> Thanks,
> Jon.
> 
> On Fri, Aug 5, 2011 at 1:30 AM, Lior Harel <ha...@gmail.com> wrote:
> Hi Jon,
> I'm interested in the same use case as Brian asked about, I'm not sure I understand your answer, as far as I understand the regex decorator can only extract data out of the event body, while the tailSrcFile attibute is part of the metadata. Can the regex decorator somehow operate on it?
> 
> 
> Lior 
> 
> On Aug 5, 2011, at 9:35 AM, Jonathan Hsieh wrote:
> 
>> [bcc flume-user@cloudera.org (deprecated), cc flume-user@incubator.apache.org]
>> 
>> Brian,
>> 
>> The easiest way is to use the regex decorator to create a new attribute and use that attribute as to do output bucketing.
>> 
>> http://archive.cloudera.com/cdh/3/flume/UserGuide/index.html#_extractors
>> 
>> Jon.
>> 
>> On Mon, Jul 25, 2011 at 5:50 PM, Brian Tran <br...@gmail.com> wrote:
>> I want to do output bucketing based on the tailSrcFile metadata value
>> set by the tailDir source. However, I only want part of the value for
>> the destination path in HDFS.
>> 
>> For example, I have an event with the tailSrcFile value
>> "unwanted_prefix_category_name-2011-07-25.log" but only want to use
>> "category_name" for output bucketing.
>> 
>> What is the easiest way to do this?
>> 
>> 
>> 
>> -- 
>> // Jonathan Hsieh (shay)
>> // Software Engineer, Cloudera
>> // jon@cloudera.com
>>  
>> 
> 
> 
> 
> 
> -- 
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com
>  
> 


Re: Metadata parsing

Posted by Jonathan Hsieh <jo...@cloudera.com>.
Lior,

Ah, good point, I mispoke.  Thanks for correcting me!

Unfortunately, you are correct,  flume currently can't do this
out-of-the-box.

It seems like a reasonable addition and would be gladly accepted patch if
someone were to implement it.  If you, Brian, or anyone else is  interested
in building this, let's move discussion about this to the
flume-dev@incubator.apache.org!

Thanks,
Jon.

On Fri, Aug 5, 2011 at 1:30 AM, Lior Harel <ha...@gmail.com> wrote:

> Hi Jon,
> I'm interested in the same use case as Brian asked about, I'm not sure I
> understand your answer, as far as I understand the regex decorator can only
> extract data out of the event body, while the tailSrcFile attibute is part
> of the metadata. Can the regex decorator somehow operate on it?
>
>
> Lior
>
> On Aug 5, 2011, at 9:35 AM, Jonathan Hsieh wrote:
>
> [bcc flume-user@cloudera.org (deprecated), cc
> flume-user@incubator.apache.org]
>
> Brian,
>
> The easiest way is to use the regex decorator to create a new attribute and
> use that attribute as to do output bucketing.
>
> http://archive.cloudera.com/cdh/3/flume/UserGuide/index.html#_extractors
>
> Jon.
>
> On Mon, Jul 25, 2011 at 5:50 PM, Brian Tran <br...@gmail.com> wrote:
>
>> I want to do output bucketing based on the tailSrcFile metadata value
>> set by the tailDir source. However, I only want part of the value for
>> the destination path in HDFS.
>>
>> For example, I have an event with the tailSrcFile value
>> "unwanted_prefix_category_name-2011-07-25.log" but only want to use
>> "category_name" for output bucketing.
>>
>> What is the easiest way to do this?
>>
>
>
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com
>
>
>
>


-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Re: Metadata parsing

Posted by Lior Harel <ha...@gmail.com>.
Hi Jon,
I'm interested in the same use case as Brian asked about, I'm not sure I understand your answer, as far as I understand the regex decorator can only extract data out of the event body, while the tailSrcFile attibute is part of the metadata. Can the regex decorator somehow operate on it?


Lior 

On Aug 5, 2011, at 9:35 AM, Jonathan Hsieh wrote:

> [bcc flume-user@cloudera.org (deprecated), cc flume-user@incubator.apache.org]
> 
> Brian,
> 
> The easiest way is to use the regex decorator to create a new attribute and use that attribute as to do output bucketing.
> 
> http://archive.cloudera.com/cdh/3/flume/UserGuide/index.html#_extractors
> 
> Jon.
> 
> On Mon, Jul 25, 2011 at 5:50 PM, Brian Tran <br...@gmail.com> wrote:
> I want to do output bucketing based on the tailSrcFile metadata value
> set by the tailDir source. However, I only want part of the value for
> the destination path in HDFS.
> 
> For example, I have an event with the tailSrcFile value
> "unwanted_prefix_category_name-2011-07-25.log" but only want to use
> "category_name" for output bucketing.
> 
> What is the easiest way to do this?
> 
> 
> 
> -- 
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com
>  
>