You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Flavio Pompermaier <po...@okkam.it> on 2013/07/19 00:37:03 UTC

Fwd: Flume workflow design

Hi to all,
I'm new to Flume but I'm very excited about it!
I'd like to use it to gather some data, process received messages and then
indexing to solr.
Any suggestion about how to do that with Flume?
I've already tested an Avro source that sends data to HBase,
but my use case requires those messages to be saved in HBase but also
processed and then indexed in Solr (obviously I also need to convert the
object structure to convert them).
I think the first part is quite simple (I just use 2 sinks, one that store
in HBase) and another one that forward to another Avro instance, right?
If messages are sent during a map/reduce job, is the avro source the best
option to send documents to index to my sink (i.e. that is my first part of
the flow that up to now I simulated with an avro source..)?
Best,
Flavio

Re: Are the interceptors expected to be thread-safe?

Posted by Hari Shreedharan <hs...@cloudera.com>.

Interceptors should be thread-safe, since sources can have multiple threads (like Avro Source), in which case if the interceptors are not thread-safe, then you would hit unexpected consequences. 


Thanks,
Hari


On Tuesday, August 27, 2013 at 7:59 AM, Israel Ekpo wrote:

> Zoraida,
> 
> You can take a look at the source code for the following interceptors to see how they are implemented.
> 
> http://flume.apache.org/releases/content/1.4.0/apidocs/org/apache/flume/interceptor/package-summary.html 
> 
> They do have member fields (attributes) and they are mostly private.
> 
> http://flume.apache.org/source.html
> 
> You can also grab the source code from git and then navigate to the folders/package that contains these classes.
> 
> $ git clone http://git-wip-us.apache.org/repos/asf/flume.git flume
> 
> Author and Instructor for the Upcoming Book and Lecture Series 
> Massive Log Data Aggregation, Processing, Searching and Visualization with Open Source Software
> http://massivelogdata.com
> 
> 
> 
> On 26 August 2013 09:25, ZORAIDA HIDALGO SANCHEZ <zoraida@tid.es (mailto:zoraida@tid.es)> wrote:
> > Hi all,
> > 
> > can I create an interceptor with member attributes?
> > 
> > Thanks.
> >

Re: Are the interceptors expected to be thread-safe?

Posted by Israel Ekpo <is...@aicer.org>.

Zoraida,

You can take a look at the source code for the following interceptors to
see how they are implemented.

http://flume.apache.org/releases/content/1.4.0/apidocs/org/apache/flume/interceptor/package-summary.html

They do have member fields (attributes) and they are mostly private.

http://flume.apache.org/source.html

You can also grab the source code from git and then navigate to the
folders/package that contains these classes.

*$ git clone http://git-wip-us.apache.org/repos/asf/flume.git flume*

*Author and Instructor for the Upcoming Book and Lecture Series*
*Massive Log Data Aggregation, Processing, Searching and Visualization with
Open Source Software*
*http://massivelogdata.com*

On 26 August 2013 09:25, ZORAIDA HIDALGO SANCHEZ <zo...@tid.es> wrote:

> Hi all,
>
> can I create an interceptor with member attributes?
>
> Thanks.
>
>

Are the interceptors expected to be thread-safe?

Posted by ZORAIDA HIDALGO SANCHEZ <zo...@tid.es>.

Hi all,

can I create an interceptor with member attributes?

Thanks.

El 19/07/13 19:22, "Wolfgang Hoschek" <wh...@cloudera.com> escribió:

>Perhaps a MR job that writes directly into HBase (without going through
>Flume) might be more efficient. For examples see
>http://hbase.apache.org/book/mapreduce.example.html
>
>Wolfgang.
>
>On Jul 19, 2013, at 1:13 AM, Flavio Pompermaier wrote:
>
>> Thank you for the reply Wolfgang, I was just looking at the great use
>>case presented by Ari Flink of Cisco and infact those technologies sound
>>great!
>> The problem is that in my use case there will be an initial mapreduce
>>job that will parse some text, perform some analysis and sends the
>>results of those analyses to my HBaseSink.
>> Just once finished (not in streaming!), I have to start processing the
>>data stored in that HBase table "newer than some date contained in this
>>end-message" (I need thus a way to trigger the start of such processing),
>> which requires to invoke an external REST service and stores data in
>>another output table. Also here, just once finished I have to reduce all
>>those information and put them into Solr..
>>
>> So I think that the main problem is to avoid streaming and trigger
>>mapreduce jobs. Is there a way to do it with Flume?
>>
>> Best,
>> Flavio
>>
>> On Fri, Jul 19, 2013 at 12:51 AM, Wolfgang Hoschek
>><wh...@cloudera.com> wrote:
>> Take a look at these options:
>>
>> - HBase Sinks (send data into HBase):
>>
>>         http://flume.apache.org/FlumeUserGuide.html#hbasesinks
>>
>> - Apache Flume Morphline Solr Sink (for heavy duty ETL processing and
>>ingestion into Solr):
>>
>>         http://flume.apache.org/FlumeUserGuide.html#morphlinesolrsink
>>
>> - Apache Flume MorphlineInterceptor (for light-weight event annotations
>>and routing):
>>
>>
>>http://flume.apache.org/FlumeUserGuide.html#morphline-interceptor
>>
>> - For MapReduce jobs it is typically more straightforward and efficient
>>to send data directly to destinations, i.e. without going through Flume.
>>For example using the MapReduceIndexerTool when going from HDFS into
>>Solr:
>>
>>         https://github.com/cloudera/search/tree/master/search-mr
>>
>> Wolfgang.
>>
>> On Jul 18, 2013, at 3:37 PM, Flavio Pompermaier wrote:
>>
>> > Hi to all,
>> >
>> > I'm new to Flume but I'm very excited about it!
>> > I'd like to use it to gather some data, process received messages and
>>then indexing to solr.
>> > Any suggestion about how to do that with Flume?
>> > I've already tested an Avro source that sends data to HBase,
>> > but my use case requires those messages to be saved in HBase but also
>>processed and then indexed in Solr (obviously I also need to convert the
>>object structure to convert them).
>> > I think the first part is quite simple (I just use 2 sinks, one that
>>store in HBase) and another one that forward to another Avro instance,
>>right?
>> > If messages are sent during a map/reduce job, is the avro source the
>>best option to send documents to index to my sink (i.e. that is my first
>>part of the flow that up to now I simulated with an avro source..)?
>> > Best,
>> > Flavio
>> >
>> >
>> >
>>
>>
>>
>>
>


________________________________

Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nuestra política de envío y recepción de correo electrónico en el enlace situado más abajo.
This message is intended exclusively for its addressee. We only send and receive email on the basis of the terms set out at:
http://www.tid.es/ES/PAGINAS/disclaimer.aspx

Re: Flume workflow design

Posted by Wolfgang Hoschek <wh...@cloudera.com>.

Perhaps a MR job that writes directly into HBase (without going through Flume) might be more efficient. For examples see http://hbase.apache.org/book/mapreduce.example.html

Wolfgang.

On Jul 19, 2013, at 1:13 AM, Flavio Pompermaier wrote:

> Thank you for the reply Wolfgang, I was just looking at the great use case presented by Ari Flink of Cisco and infact those technologies sound great!
> The problem is that in my use case there will be an initial mapreduce job that will parse some text, perform some analysis and sends the results of those analyses to my HBaseSink. 
> Just once finished (not in streaming!), I have to start processing the data stored in that HBase table "newer than some date contained in this end-message" (I need thus a way to trigger the start of such processing),
> which requires to invoke an external REST service and stores data in another output table. Also here, just once finished I have to reduce all those information and put them into Solr..
> 
> So I think that the main problem is to avoid streaming and trigger mapreduce jobs. Is there a way to do it with Flume?
> 
> Best,
> Flavio
> 
> On Fri, Jul 19, 2013 at 12:51 AM, Wolfgang Hoschek <wh...@cloudera.com> wrote:
> Take a look at these options:
> 
> - HBase Sinks (send data into HBase):
> 
>         http://flume.apache.org/FlumeUserGuide.html#hbasesinks
> 
> - Apache Flume Morphline Solr Sink (for heavy duty ETL processing and ingestion into Solr):
> 
>         http://flume.apache.org/FlumeUserGuide.html#morphlinesolrsink
> 
> - Apache Flume MorphlineInterceptor (for light-weight event annotations and routing):
> 
>         http://flume.apache.org/FlumeUserGuide.html#morphline-interceptor
> 
> - For MapReduce jobs it is typically more straightforward and efficient to send data directly to destinations, i.e. without going through Flume. For example using the MapReduceIndexerTool when going from HDFS into Solr:
> 
>         https://github.com/cloudera/search/tree/master/search-mr
> 
> Wolfgang.
> 
> On Jul 18, 2013, at 3:37 PM, Flavio Pompermaier wrote:
> 
> > Hi to all,
> >
> > I'm new to Flume but I'm very excited about it!
> > I'd like to use it to gather some data, process received messages and then indexing to solr.
> > Any suggestion about how to do that with Flume?
> > I've already tested an Avro source that sends data to HBase,
> > but my use case requires those messages to be saved in HBase but also processed and then indexed in Solr (obviously I also need to convert the object structure to convert them).
> > I think the first part is quite simple (I just use 2 sinks, one that store in HBase) and another one that forward to another Avro instance, right?
> > If messages are sent during a map/reduce job, is the avro source the best option to send documents to index to my sink (i.e. that is my first part of the flow that up to now I simulated with an avro source..)?
> > Best,
> > Flavio
> >
> >
> >
> 
> 
> 
>

Re: Flume workflow design

Posted by Flavio Pompermaier <po...@okkam.it>.

Thank you for the reply Wolfgang, I was just looking at the great use case
presented by Ari Flink of Cisco and infact those technologies sound great!
The problem is that in my use case there will be an initial mapreduce job
that will parse some text, perform some analysis and sends the results of
those analyses to my HBaseSink.
Just once finished (not in streaming!), I have to start processing the data
stored in that HBase table "newer than some date contained in this
end-message" (I need thus a way to trigger the start of such processing),
which requires to invoke an external REST service and stores data in
another output table. Also here, just once finished I have to reduce all
those information and put them into Solr..

So I think that the main problem is to avoid streaming and trigger
mapreduce jobs. Is there a way to do it with Flume?

Best,
Flavio

On Fri, Jul 19, 2013 at 12:51 AM, Wolfgang Hoschek <wh...@cloudera.com>wrote:

> Take a look at these options:
>
> - HBase Sinks (send data into HBase):
>
>         http://flume.apache.org/FlumeUserGuide.html#hbasesinks
>
> - Apache Flume Morphline Solr Sink (for heavy duty ETL processing and
> ingestion into Solr):
>
>         http://flume.apache.org/FlumeUserGuide.html#morphlinesolrsink
>
> - Apache Flume MorphlineInterceptor (for light-weight event annotations
> and routing):
>
>         http://flume.apache.org/FlumeUserGuide.html#morphline-interceptor
>
> - For MapReduce jobs it is typically more straightforward and efficient to
> send data directly to destinations, i.e. without going through Flume. For
> example using the MapReduceIndexerTool when going from HDFS into Solr:
>
>         https://github.com/cloudera/search/tree/master/search-mr
>
> Wolfgang.
>
> On Jul 18, 2013, at 3:37 PM, Flavio Pompermaier wrote:
>
> > Hi to all,
> >
> > I'm new to Flume but I'm very excited about it!
> > I'd like to use it to gather some data, process received messages and
> then indexing to solr.
> > Any suggestion about how to do that with Flume?
> > I've already tested an Avro source that sends data to HBase,
> > but my use case requires those messages to be saved in HBase but also
> processed and then indexed in Solr (obviously I also need to convert the
> object structure to convert them).
> > I think the first part is quite simple (I just use 2 sinks, one that
> store in HBase) and another one that forward to another Avro instance,
> right?
> > If messages are sent during a map/reduce job, is the avro source the
> best option to send documents to index to my sink (i.e. that is my first
> part of the flow that up to now I simulated with an avro source..)?
> > Best,
> > Flavio
> >
> >
> >
>

Re: Flume workflow design

Posted by Wolfgang Hoschek <wh...@cloudera.com>.

Take a look at these options:

- HBase Sinks (send data into HBase):

	http://flume.apache.org/FlumeUserGuide.html#hbasesinks

- Apache Flume Morphline Solr Sink (for heavy duty ETL processing and ingestion into Solr): 

	http://flume.apache.org/FlumeUserGuide.html#morphlinesolrsink

- Apache Flume MorphlineInterceptor (for light-weight event annotations and routing): 

	http://flume.apache.org/FlumeUserGuide.html#morphline-interceptor

- For MapReduce jobs it is typically more straightforward and efficient to send data directly to destinations, i.e. without going through Flume. For example using the MapReduceIndexerTool when going from HDFS into Solr: 

	https://github.com/cloudera/search/tree/master/search-mr

Wolfgang.

On Jul 18, 2013, at 3:37 PM, Flavio Pompermaier wrote:

> Hi to all,
> 
> I'm new to Flume but I'm very excited about it!
> I'd like to use it to gather some data, process received messages and then indexing to solr.
> Any suggestion about how to do that with Flume?
> I've already tested an Avro source that sends data to HBase,
> but my use case requires those messages to be saved in HBase but also processed and then indexed in Solr (obviously I also need to convert the object structure to convert them).
> I think the first part is quite simple (I just use 2 sinks, one that store in HBase) and another one that forward to another Avro instance, right?
> If messages are sent during a map/reduce job, is the avro source the best option to send documents to index to my sink (i.e. that is my first part of the flow that up to now I simulated with an avro source..)?
> Best,
> Flavio
> 
> 
>