You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Tzur Turkenitz <tz...@vision.bi> on 2013/02/04 16:38:04 UTC

Data Lineage

Hello All,

 

In my company we are worried about data lineage. Big files can be split into
smaller files (block size) inside HDFS, and smaller files can be aggregated
into larger files. We want to have some kind of control regarding data
lineage and the ability to map source files to files in HDFS. Using
interceptors we can add various keys like timestamp, static, file header
etc.

 

After a file has been processed and inserted into HDFS, do those keys still
exist and viewable if I choose to cat the file in HADOOP? (I did cat the
files and didn't see any of the keys) Or the keys only exist during the
process and are not saved into the file.

 

Alternatively is it possible to append those keys into the file using
Flume's built in component?

 

I appreciate the help,

Tzur

Re: Data Lineage

Posted by Roshan Naik <ro...@hortonworks.com>.

You could use the timestamp and host interceptors to add the respective
pieces of info into the flume event's header as necessary. thereafter the
custom serializer can write out the host+timestamp.
-roshan

Re: Data Lineage

Posted by Tzur Turkenitz <tz...@vision.bi>.

Thank you, Connor.

>From what I understand I can use a serializer to write the data in my own
format.
The language in the documantation is a bit vauge, so if you could Connor
help me with the following question:
                    For a scenario where I know my logs files are delimited
by \t, I would like to add a column at the start of every event row which
indicates the Timestamp and FileName. can this be done by a Serializer?

If it's possible I'll send it to our Java devs :)


On Mon, Feb 4, 2013 at 8:51 PM, Connor Woodson <cw...@gmail.com>wrote:

> You will want to look at the Serializer
> <http://flume.apache.org/FlumeUserGuide.html#event-serializers>component.
> The default serializer is TEXT, which will only write out the body of your
> event discarding all headers. You can switch to one of the other
> serializers, or if none of them suit your purpose you are able to create
> your own that, for instance, could write the event in JSON format thus
> preserving the headers.
>
> (Only two serializers are currently documented. You can see here<https://github.com/apache/flume/tree/trunk/flume-ng-core/src/main/java/org/apache/flume/serialization>all of the ones currently in Flume (it looks like there's only one
> additional one there, and it might be exactly what you're looking for)).
>
> If you want more detail on creating a custom serializer, or how to use one
> of the existing ones, please ask.
>
> - Connor
>
>
> On Mon, Feb 4, 2013 at 7:38 AM, Tzur Turkenitz <tz...@vision.bi> wrote:
>
>> Hello All,****
>>
>> ** **
>>
>> In my company we are worried about data lineage. Big files can be split
>> into smaller files (block size) inside HDFS, and smaller files can be
>> aggregated into larger files. We want to have some kind of control
>> regarding data lineage and the ability to map source files to files in
>> HDFS. Using interceptors we can add various keys like timestamp, static,
>> file header etc.****
>>
>> ** **
>>
>> After a file has been processed and inserted into HDFS, do those keys
>> still exist and viewable if I choose to cat the file in HADOOP? (I did cat
>> the files and didn’t see any of the keys) Or the keys only exist during the
>> process and are not saved into the file.****
>>
>> ** **
>>
>> Alternatively is it possible to append those keys into the file using
>> Flume's built in component?****
>>
>> ** **
>>
>> I appreciate the help,****
>>
>> Tzur****
>>
>> ** **
>>
>
>


-- 
Regards,
Tzur Turkenitz
Vision.BI
http://www.vision.bi/

"*Facts are stubborn things, but statistics are more pliable*"
-Mark Twain

Re: Data Lineage

Posted by Connor Woodson <cw...@gmail.com>.

You will want to look at the Serializer
<http://flume.apache.org/FlumeUserGuide.html#event-serializers>component.
The default serializer is TEXT, which will only write out the body of your
event discarding all headers. You can switch to one of the other
serializers, or if none of them suit your purpose you are able to create
your own that, for instance, could write the event in JSON format thus
preserving the headers.

(Only two serializers are currently documented. You can see
here<https://github.com/apache/flume/tree/trunk/flume-ng-core/src/main/java/org/apache/flume/serialization>all
of the ones currently in Flume (it looks like there's only one
additional one there, and it might be exactly what you're looking for)).

If you want more detail on creating a custom serializer, or how to use one
of the existing ones, please ask.

- Connor

On Mon, Feb 4, 2013 at 7:38 AM, Tzur Turkenitz <tz...@vision.bi> wrote:

> Hello All,****
>
> ** **
>
> In my company we are worried about data lineage. Big files can be split
> into smaller files (block size) inside HDFS, and smaller files can be
> aggregated into larger files. We want to have some kind of control
> regarding data lineage and the ability to map source files to files in
> HDFS. Using interceptors we can add various keys like timestamp, static,
> file header etc.****
>
> ** **
>
> After a file has been processed and inserted into HDFS, do those keys
> still exist and viewable if I choose to cat the file in HADOOP? (I did cat
> the files and didn’t see any of the keys) Or the keys only exist during the
> process and are not saved into the file.****
>
> ** **
>
> Alternatively is it possible to append those keys into the file using
> Flume's built in component?****
>
> ** **
>
> I appreciate the help,****
>
> Tzur****
>
> ** **
>