You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Gintautas Sulskus <gi...@gmail.com> on 2017/09/05 12:00:18 UTC

Use case for Flume

Hi,

I have a question regarding Flume suitability for a particular use case.

Task: There is an incoming constant stream of links that point to files.
Those files to be fetched and stored in HDFS.

Desired implementation:

1. Each link to a file is stored in Kafka queue Q1.
2. Flume A1.source monitors Q1 for new links.
3. Upon retrieving a link from Q1, A1.source fetches the file. The file
eventually is stored in HDFS by A1.sink

My concern here is a seemingly overloaded functionality of A1.source. The
A1.source would have to perform two activities: 1.) to periodically poll
queue Q1 for new links to files and then  2.) fetch those files.

What do you think? Is there a cleaner way to achieve this, e.g. by using an
interceptor to fetch files? Would this be appropriate?

Best,
GIntas

Re: Use case for Flume

Posted by Bessenyei Balázs Donát <be...@apache.org>.
Hi Gintas,

I can't think of a completely Flume out-of-the-box solution, but I
believe Flume does suit your needs.
The multi-agent solution is doable, you'd have to either implement a
source (probably based on Avro Source) or implement an interceptor to
do the downloading as previously discussed.

If you need any further help, please let us know.


Thank you,

Donat

2017-09-05 21:13 GMT+02:00 Gintautas Sulskus <gi...@gmail.com>:
> Hi,
>
> Thanks for the quick replies, guys.
>
> Donat, sorry, I do not have example configs. At the moment I am just
> considering available solutions to tackle the problem at hand. I would very
> much prefer Flume for its modular and scalable approach. I would like to
> find an elegant solution that would be "native" to Flume.
> I was considering the two-agent approach as well. But then, how would the
> middle part would look like? What component would download the file? I
> assume I would face the same problem as now.
>
> Denes, files would be up to 5 megabytes in size.The interceptor approach
> looks the most suitable in this situation.
> Regarding the sink-side interceptor, wouldn't it have the same 64MB size
> limit as the source-side one?
>
> Best,
> Gintas
>
>
> On 5 Sep 2017 16:54, "Denes Arvay" <de...@cloudera.com> wrote:
>
> Hi GIntas,
>
> What is the average (or expected maximum) size of the files you'd like to
> process?
> In general it is not recommended to transfer large events (i.e. >64MB if you
> use file channel, as this is a hard limit of the protobuf implementation).
> If your files fit into this limit then I'd suggest to use an interceptor to
> fetch the data and then update the event's body and push it through Flume.
>
> In this case your setup would be:
> Kafka source + data fetcher interceptor (custom code) -> file channel (or
> memory) -> HDFS sink
>
> If the files are larger then you could use a customised HDFS sink which
> fetches the URL and stores the file on HDFS.
> In this case I'd recommend to use a Kafka channel -> custom HDFS sink setup
> without configuring any source.
>
> Actually for your problem the sink-side interceptors would be a good
> solution (https://issues.apache.org/jira/browse/FLUME-2580), but
> unfortunately it is not implemented yet.
>
> Regards,
> Denes
>
> On Tue, Sep 5, 2017 at 2:00 PM Gintautas Sulskus
> <gi...@gmail.com> wrote:
>>
>> Hi,
>>
>> I have a question regarding Flume suitability for a particular use case.
>>
>> Task: There is an incoming constant stream of links that point to files.
>> Those files to be fetched and stored in HDFS.
>>
>> Desired implementation:
>>
>> 1. Each link to a file is stored in Kafka queue Q1.
>> 2. Flume A1.source monitors Q1 for new links.
>> 3. Upon retrieving a link from Q1, A1.source fetches the file. The file
>> eventually is stored in HDFS by A1.sink
>>
>> My concern here is a seemingly overloaded functionality of A1.source. The
>> A1.source would have to perform two activities: 1.) to periodically poll
>> queue Q1 for new links to files and then  2.) fetch those files.
>>
>> What do you think? Is there a cleaner way to achieve this, e.g. by using
>> an interceptor to fetch files? Would this be appropriate?
>>
>> Best,
>> GIntas
>
>

Re: Use case for Flume

Posted by Gintautas Sulskus <gi...@gmail.com>.
Hi,

Thanks for the quick replies, guys.

Donat, sorry, I do not have example configs. At the moment I am just
considering available solutions to tackle the problem at hand. I would very
much prefer Flume for its modular and scalable approach. I would like to
find an elegant solution that would be "native" to Flume.
I was considering the two-agent approach as well. But then, how would the
middle part would look like? What component would download the file? I
assume I would face the same problem as now.

Denes, files would be up to 5 megabytes in size.The interceptor approach
looks the most suitable in this situation.
Regarding the sink-side interceptor, wouldn't it have the same 64MB size
limit as the source-side one?

Best,
Gintas


On 5 Sep 2017 16:54, "Denes Arvay" <de...@cloudera.com> wrote:

Hi GIntas,

What is the average (or expected maximum) size of the files you'd like to
process?
In general it is not recommended to transfer large events (i.e. >64MB if
you use file channel, as this is a hard limit of the protobuf
implementation).
If your files fit into this limit then I'd suggest to use an interceptor to
fetch the data and then update the event's body and push it through Flume.

In this case your setup would be:
Kafka source + data fetcher interceptor (custom code) -> file channel (or
memory) -> HDFS sink

If the files are larger then you could use a customised HDFS sink which
fetches the URL and stores the file on HDFS.
In this case I'd recommend to use a Kafka channel -> custom HDFS sink setup
without configuring any source.

Actually for your problem the sink-side interceptors would be a good
solution (https://issues.apache.org/jira/browse/FLUME-2580), but
unfortunately it is not implemented yet.

Regards,
Denes

On Tue, Sep 5, 2017 at 2:00 PM Gintautas Sulskus <
gintautas.sulskus@gmail.com> wrote:

> Hi,
>
> I have a question regarding Flume suitability for a particular use case.
>
> Task: There is an incoming constant stream of links that point to files.
> Those files to be fetched and stored in HDFS.
>
> Desired implementation:
>
> 1. Each link to a file is stored in Kafka queue Q1.
> 2. Flume A1.source monitors Q1 for new links.
> 3. Upon retrieving a link from Q1, A1.source fetches the file. The file
> eventually is stored in HDFS by A1.sink
>
> My concern here is a seemingly overloaded functionality of A1.source. The
> A1.source would have to perform two activities: 1.) to periodically poll
> queue Q1 for new links to files and then  2.) fetch those files.
>
> What do you think? Is there a cleaner way to achieve this, e.g. by using
> an interceptor to fetch files? Would this be appropriate?
>
> Best,
> GIntas
>

Re: Use case for Flume

Posted by Denes Arvay <de...@cloudera.com>.
Hi GIntas,

What is the average (or expected maximum) size of the files you'd like to
process?
In general it is not recommended to transfer large events (i.e. >64MB if
you use file channel, as this is a hard limit of the protobuf
implementation).
If your files fit into this limit then I'd suggest to use an interceptor to
fetch the data and then update the event's body and push it through Flume.

In this case your setup would be:
Kafka source + data fetcher interceptor (custom code) -> file channel (or
memory) -> HDFS sink

If the files are larger then you could use a customised HDFS sink which
fetches the URL and stores the file on HDFS.
In this case I'd recommend to use a Kafka channel -> custom HDFS sink setup
without configuring any source.

Actually for your problem the sink-side interceptors would be a good
solution (https://issues.apache.org/jira/browse/FLUME-2580), but
unfortunately it is not implemented yet.

Regards,
Denes

On Tue, Sep 5, 2017 at 2:00 PM Gintautas Sulskus <
gintautas.sulskus@gmail.com> wrote:

> Hi,
>
> I have a question regarding Flume suitability for a particular use case.
>
> Task: There is an incoming constant stream of links that point to files.
> Those files to be fetched and stored in HDFS.
>
> Desired implementation:
>
> 1. Each link to a file is stored in Kafka queue Q1.
> 2. Flume A1.source monitors Q1 for new links.
> 3. Upon retrieving a link from Q1, A1.source fetches the file. The file
> eventually is stored in HDFS by A1.sink
>
> My concern here is a seemingly overloaded functionality of A1.source. The
> A1.source would have to perform two activities: 1.) to periodically poll
> queue Q1 for new links to files and then  2.) fetch those files.
>
> What do you think? Is there a cleaner way to achieve this, e.g. by using
> an interceptor to fetch files? Would this be appropriate?
>
> Best,
> GIntas
>

Re: Use case for Flume

Posted by Bessenyei Balázs Donát <be...@apache.org>.
Hi GIntas,

Do you happen to have a config file we could examine?

Based on the scenario you have described, I'm thinking of a
"multi-agent flow" (
http://flume.apache.org/FlumeUserGuide.html#setting-multi-agent-flow )
with a Kafka source at the "start" of the first agent and a HDFS sink
at the "end" of the last agent. You can scale this system as you like
(ie. single-node to a whole cluster).
The interceptor approach sounds also possible.

Do you have any performance concerns or are you just looking for an
elegant way to implement a solution?


Thanks,

Donat


2017-09-05 14:00 GMT+02:00 Gintautas Sulskus <gi...@gmail.com>:
> Hi,
>
> I have a question regarding Flume suitability for a particular use case.
>
> Task: There is an incoming constant stream of links that point to files.
> Those files to be fetched and stored in HDFS.
>
> Desired implementation:
>
> 1. Each link to a file is stored in Kafka queue Q1.
> 2. Flume A1.source monitors Q1 for new links.
> 3. Upon retrieving a link from Q1, A1.source fetches the file. The file
> eventually is stored in HDFS by A1.sink
>
> My concern here is a seemingly overloaded functionality of A1.source. The
> A1.source would have to perform two activities: 1.) to periodically poll
> queue Q1 for new links to files and then  2.) fetch those files.
>
> What do you think? Is there a cleaner way to achieve this, e.g. by using an
> interceptor to fetch files? Would this be appropriate?
>
> Best,
> GIntas