You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Kevin Warner <ke...@gmail.com> on 2014/09/05 04:53:19 UTC

Newbie - Sink question

Hello All,
We have the following configuration:
Source->Channel->Sink

Now, the source is pointing to a folder that has lots of json files. The
channel is file based so that there is fault tolerance and the Sink is
putting CSV files on S3.

Now, there is code written in Sink that takes the JSON events and does some
MySQL database lookup and generates CSV files to be put into S3.

The question is, is it the right place for the code or should the code be
running in channel as the ACID gaurantees is present in Channel. Please
advise.

-Kev

Re: Newbie - Sink question

Posted by Sharninder <sh...@gmail.com>.
Yes, sink seems like the right place to put the CSV-S3 code. Don't mess
with the channel code unless you know what you're doing. Although since
you're doing db lookups, I'd imagine that would slow down the whole channel
depending on the source data rate. What I'd suggest is that you take a look
at how interceptors work and/or maybe take a look at the morphline sdk (
http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/
).

Keep the source for only reading files and sink for only writing files.
Everything else in the interceptor/morphline.

--
Sharninder



On Fri, Sep 5, 2014 at 8:23 AM, Kevin Warner <ke...@gmail.com>
wrote:

> Hello All,
> We have the following configuration:
> Source->Channel->Sink
>
> Now, the source is pointing to a folder that has lots of json files. The
> channel is file based so that there is fault tolerance and the Sink is
> putting CSV files on S3.
>
> Now, there is code written in Sink that takes the JSON events and does
> some MySQL database lookup and generates CSV files to be put into S3.
>
> The question is, is it the right place for the code or should the code be
> running in channel as the ACID gaurantees is present in Channel. Please
> advise.
>
> -Kev
>
>

Re: Newbie - Sink question

Posted by Kevin Warner <ke...@gmail.com>.
Guys,
I am getting push back from one of the engineers who works for a friend of
mine. Can you please take a look at his reply below and let me know what do
you guys think:

-----------------------------------------------------------------------------------------------------------------------
I really have nothing against Morphline as it seems to be driving Apache
Flume in the right direction, but I still stand by my point that Morpline
in it's current stage of maturity can't be used in our case.

I don't know if you have noticed that Flume Interceptor runs as a Flume
Source process according to this sentence from Morphline Interceptor
limited documentation:
/Currently, there is a restriction in that the morphline of an interceptor
must not generate more than one output record for each input event. This
interceptor is not intended for heavy duty ETL processing - if you need
this consider moving ETL processing from the Flume Source to a Flume Sink,
e.g. to a MorphlineSolrSink.

Given that, they obviously intended Flume Sink to be heavy lifter, as
implementation in the interceptor will slow Flume Source down.
Also, there is only one Flume Sink implementation of Morphline intended to
pass data to Solr (see this
<https://issues.apache.org/jira/browse/FLUME-2340>).

Of course, we could create our own Morphline Sink as there is some
documentation on using Morpline libraries in the Java code.
-------------------------------------------------------------------------------------------------------------------------

Please advise.

Thanks.



On Thu, Sep 4, 2014 at 11:08 PM, Ashish <pa...@gmail.com> wrote:

> I would recommend using an Interceptor for this and possibly a modified
> Flume topology. If the json files have large numbers of rows or very high
> number of files, go for a Collection tier, and use another level of agents
> that uses interceptors for DB lookup and CSV generation. Something like
>
> Collection Agents -> Transformation Agents (writing to S3 Sinks)
>
> You can scale out Transformation/Collection layer agents  based on the
> traffic volume
>
> thanks
>
>
>
>
> On Fri, Sep 5, 2014 at 8:23 AM, Kevin Warner <ke...@gmail.com>
> wrote:
>
>> Hello All,
>> We have the following configuration:
>> Source->Channel->Sink
>>
>> Now, the source is pointing to a folder that has lots of json files. The
>> channel is file based so that there is fault tolerance and the Sink is
>> putting CSV files on S3.
>>
>> Now, there is code written in Sink that takes the JSON events and does
>> some MySQL database lookup and generates CSV files to be put into S3.
>>
>> The question is, is it the right place for the code or should the code be
>> running in channel as the ACID gaurantees is present in Channel. Please
>> advise.
>>
>> -Kev
>>
>>
>
>
>
> --
> thanks
> ashish
>
> Blog: http://www.ashishpaliwal.com/blog
> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>

Re: Newbie - Sink question

Posted by Ashish <pa...@gmail.com>.
On Sat, Sep 6, 2014 at 4:42 AM, Kevin Warner <ke...@gmail.com>
wrote:

> Thanks Andrew, Ashish and Sharinder for your response.
>
> I have a large number of JSON files which are 2K size each on Tomcat
> servers. We are using rsync to get the files from the Tomcat servers to the
> EC2 compute instances. Lets say we have 4 Tomcat servers, do we need 4
> machines (EC2) with Flume on them.
>

Nope, If you can install Flume on Tomcat servers, Flume shall transfer the
files for you. Use Spool Dir Source and ensure that it points to location
where only completed files are present.


>
> On each Flume machine, we have a folder that rsync's with the Tomcat
> server folder. The source of the Flume then points to the input folder and
> after processing (we are planning to use Morphlines) the output is written
> as CSV files and uploaded to S3.
>
> Can anyone send me some examples of sample Flume tiered architecture. By
> collection agents do you mean a set of machines, in which each machine is
> getting data from multiple Tomcat servers. And after that in the Collection
> layer, are there a set of machines where there is a 1-1 relationship
> between the machines in the Collection tier and Transformation tier has
> flume instances with Morphlines which then write the CSV output to S3.
> Also, does it support HA etc.
>
> Please advise.
>

You can have 2-tier topology. 1st Tier collects data from Tomcat Server or
from location specified. It then sends data to next set of Flume Agents
that do CSV transformation and writes to S3.


Tomcat Server --- Flume Agent(s) ---> Flume Agent (Layer 2) --> S3

For 1st layer you would need 4 Flume Agent's, assuming Flume is running on
Tomcat Servers. These agents picks up file from Server and send to Tier 2
Agents which does the translation and writes to S3.
You may need 2 or 4 agents based on the load or HA requirements.
Alternatively, Tier 1 agents may be able to handle the needs, you may leave
out Layer 2 agents.

Please do some benchmark, before choosing the topology.

This text would come in handy
http://shop.oreilly.com/product/0636920030348.do, I got it 2 days ago :)


>
> Thanks.
>
>
>
>
>
>
>
>
>
>
>
>
> On Thu, Sep 4, 2014 at 11:08 PM, Ashish <pa...@gmail.com> wrote:
>
>> I would recommend using an Interceptor for this and possibly a modified
>> Flume topology. If the json files have large numbers of rows or very high
>> number of files, go for a Collection tier, and use another level of agents
>> that uses interceptors for DB lookup and CSV generation. Something like
>>
>> Collection Agents -> Transformation Agents (writing to S3 Sinks)
>>
>> You can scale out Transformation/Collection layer agents  based on the
>> traffic volume
>>
>> thanks
>>
>>
>>
>>
>> On Fri, Sep 5, 2014 at 8:23 AM, Kevin Warner <ke...@gmail.com>
>> wrote:
>>
>>> Hello All,
>>> We have the following configuration:
>>> Source->Channel->Sink
>>>
>>> Now, the source is pointing to a folder that has lots of json files. The
>>> channel is file based so that there is fault tolerance and the Sink is
>>> putting CSV files on S3.
>>>
>>> Now, there is code written in Sink that takes the JSON events and does
>>> some MySQL database lookup and generates CSV files to be put into S3.
>>>
>>> The question is, is it the right place for the code or should the code
>>> be running in channel as the ACID gaurantees is present in Channel. Please
>>> advise.
>>>
>>> -Kev
>>>
>>>
>>
>>
>>
>> --
>> thanks
>> ashish
>>
>> Blog: http://www.ashishpaliwal.com/blog
>> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>
>
>


-- 
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

Re: Newbie - Sink question

Posted by Kevin Warner <ke...@gmail.com>.
Thanks Andrew, Ashish and Sharinder for your response.

I have a large number of JSON files which are 2K size each on Tomcat
servers. We are using rsync to get the files from the Tomcat servers to the
EC2 compute instances. Lets say we have 4 Tomcat servers, do we need 4
machines (EC2) with Flume on them.

On each Flume machine, we have a folder that rsync's with the Tomcat server
folder. The source of the Flume then points to the input folder and after
processing (we are planning to use Morphlines) the output is written as CSV
files and uploaded to S3.

Can anyone send me some examples of sample Flume tiered architecture. By
collection agents do you mean a set of machines, in which each machine is
getting data from multiple Tomcat servers. And after that in the Collection
layer, are there a set of machines where there is a 1-1 relationship
between the machines in the Collection tier and Transformation tier has
flume instances with Morphlines which then write the CSV output to S3.
Also, does it support HA etc.

Please advise.

Thanks.












On Thu, Sep 4, 2014 at 11:08 PM, Ashish <pa...@gmail.com> wrote:

> I would recommend using an Interceptor for this and possibly a modified
> Flume topology. If the json files have large numbers of rows or very high
> number of files, go for a Collection tier, and use another level of agents
> that uses interceptors for DB lookup and CSV generation. Something like
>
> Collection Agents -> Transformation Agents (writing to S3 Sinks)
>
> You can scale out Transformation/Collection layer agents  based on the
> traffic volume
>
> thanks
>
>
>
>
> On Fri, Sep 5, 2014 at 8:23 AM, Kevin Warner <ke...@gmail.com>
> wrote:
>
>> Hello All,
>> We have the following configuration:
>> Source->Channel->Sink
>>
>> Now, the source is pointing to a folder that has lots of json files. The
>> channel is file based so that there is fault tolerance and the Sink is
>> putting CSV files on S3.
>>
>> Now, there is code written in Sink that takes the JSON events and does
>> some MySQL database lookup and generates CSV files to be put into S3.
>>
>> The question is, is it the right place for the code or should the code be
>> running in channel as the ACID gaurantees is present in Channel. Please
>> advise.
>>
>> -Kev
>>
>>
>
>
>
> --
> thanks
> ashish
>
> Blog: http://www.ashishpaliwal.com/blog
> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>

Re: Newbie - Sink question

Posted by Ashish <pa...@gmail.com>.
I would recommend using an Interceptor for this and possibly a modified
Flume topology. If the json files have large numbers of rows or very high
number of files, go for a Collection tier, and use another level of agents
that uses interceptors for DB lookup and CSV generation. Something like

Collection Agents -> Transformation Agents (writing to S3 Sinks)

You can scale out Transformation/Collection layer agents  based on the
traffic volume

thanks




On Fri, Sep 5, 2014 at 8:23 AM, Kevin Warner <ke...@gmail.com>
wrote:

> Hello All,
> We have the following configuration:
> Source->Channel->Sink
>
> Now, the source is pointing to a folder that has lots of json files. The
> channel is file based so that there is fault tolerance and the Sink is
> putting CSV files on S3.
>
> Now, there is code written in Sink that takes the JSON events and does
> some MySQL database lookup and generates CSV files to be put into S3.
>
> The question is, is it the right place for the code or should the code be
> running in channel as the ACID gaurantees is present in Channel. Please
> advise.
>
> -Kev
>
>



-- 
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

Re: Newbie - Sink question

Posted by Andrew Ehrlich <an...@aehrlich.com>.
What about adding in the data from MySQL as a small batch job after flume sinks to S3? You could then delete the raw data that flume sank. I would worry that the database connection would be relatively slow and unreliable and may slow the Flume throughput. 

Andrew

On Sep 4, 2014, at 7:53 PM, Kevin Warner <ke...@gmail.com> wrote:

> Hello All,
> We have the following configuration:
> Source->Channel->Sink
> 
> Now, the source is pointing to a folder that has lots of json files. The channel is file based so that there is fault tolerance and the Sink is putting CSV files on S3.
> 
> Now, there is code written in Sink that takes the JSON events and does some MySQL database lookup and generates CSV files to be put into S3. 
> 
> The question is, is it the right place for the code or should the code be running in channel as the ACID gaurantees is present in Channel. Please advise.
> 
> -Kev
>