You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Guillermo Ortiz <ko...@gmail.com> on 2014/08/18 18:53:20 UTC

Flow in Flume, could it make better?

Hi,

I have build a flow with Flume and I don't know if it's the way to do it,
or there is something better. I am spooling a directory and need those data
in three different paths in HDFS with different formats, so I have created
two interceptors.

Source(Spooling) + Replication + Interceptor1 --> to C1 and C2
C1 -> Sink1 to HDFS Path1 (It's like a historic)
C2 --> Sink2 to Avro --> Source Avro + Multiplexing + Interceptor2 --> C3
and C4
C3 --> Sink3 to HDFS Path2
C4 --> Sink4 to HDFS Path3

Interceptor1 doesn't make too much with the data, it's just to save as they
are, it's like to store an history of the original data.

Interceptor2 configure an selector and a header. It processes the data and
configure the selector to redirect to Sink3 or Sink4. But this interceptor
change the original data.

I tried to do all the process without replicating data, but I could not.
Now, it seems like too many steps just because I want to store the original
data in HDFS like a historic.

Re: Flow in Flume, could it make better?

Posted by terrey shih <te...@gmail.com>.

something like this

                        channel 1 -> sink 1 (raw event sink)
agent 1src -> replicate

                                                                         ->
sink 3
                        channel 2 - sink  2 -> agent 2 src -> multiplexer

-> sink 4

In fact, I tried not having agent 2, but directly connecting sink2 to src
2, I was not able to do due to RPCClient exception.

I am just going to try to have 2 agents.

terrey


On Mon, Aug 18, 2014 at 3:06 PM, terrey shih <te...@gmail.com> wrote:

> Well, I am actually doing similar things as you do.  I also need to feed
> that data to different sinks, one just raw data and the other ones are
> Hbase sinks using the multiplexer.
>
>
>                         channel 1 -> sink 1 (raw event sink)
> agent 1src -> replicate
> channel 2 - sink  2 -> agent 2 src -> multiplexer
>
>                         channel 2 - sink  2 -> agent 2 src -> multiplexer
>
>
>
>
> On Mon, Aug 18, 2014 at 1:35 PM, Guillermo Ortiz <ko...@gmail.com>
> wrote:
>
>> On my test, everything is in the same VM. Later, I'll have another flow
>> which is just spooling or tailing a file and send through Avro to another
>> Source on my system.
>>
>> Do I really need to do that replicating step? I think that I have too
>> many channel and that means too resources and too configuration.
>>
>>
>> 2014-08-18 19:51 GMT+02:00 terrey shih <te...@gmail.com>:
>>
>> Hi,
>>>
>>> Your 2 sources (spooling) and source Avro (from sink 2) are in two
>>> different JVMs/machines ?
>>>
>>> thx
>>>
>>>
>>> On Mon, Aug 18, 2014 at 9:53 AM, Guillermo Ortiz <ko...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have build a flow with Flume and I don't know if it's the way to do
>>>> it, or there is something better. I am spooling a directory and need those
>>>> data in three different paths in HDFS with different formats, so I have
>>>> created two interceptors.
>>>>
>>>> Source(Spooling) + Replication + Interceptor1 --> to C1 and C2
>>>> C1 -> Sink1 to HDFS Path1 (It's like a historic)
>>>> C2 --> Sink2 to Avro --> Source Avro + Multiplexing + Interceptor2 -->
>>>> C3 and C4
>>>> C3 --> Sink3 to HDFS Path2
>>>> C4 --> Sink4 to HDFS Path3
>>>>
>>>> Interceptor1 doesn't make too much with the data, it's just to save as
>>>> they are, it's like to store an history of the original data.
>>>>
>>>> Interceptor2 configure an selector and a header. It processes the data
>>>> and configure the selector to redirect to Sink3 or Sink4. But this
>>>> interceptor change the original data.
>>>>
>>>> I tried to do all the process without replicating data, but I could
>>>> not. Now, it seems like too many steps just because I want to store the
>>>> original data in HDFS like a historic.
>>>>
>>>
>>>
>>
>

Re: Flow in Flume, could it make better?

Posted by Guillermo Ortiz <ko...@gmail.com>.

Would it be possible to link the interceptors to the channels?? I didn't
find anything about it in the documentation, I guess not.

I guess that another possiblity it's to execute the interceptors in the
Sink, what If i'm right means to implement specific Sinks or is it possible?


2014-08-19 9:11 GMT+02:00 Guillermo Ortiz <ko...@gmail.com>:

> Yeah, I think that it's what I'm doing.
> How about:
>
>                                                   channel1 -> sink1 (hdfs
> raw data)
> Agent 1src --> replicate +
> Interceptor1
> -->sink3
>                                                    channel2 --> sink2 avro
> --> agent2 src Avro --> multiplexing + interceptor2
>
> -->sink4
>
> Could it be possible to apply the interceptor1 just for channel1?? I know
> that interceptors apply to source level. Interceptor1 doesn't modify too
> much the data,
> I could feed channel2 with those little transformations but ideally I
> would like it. So, if I want to do it, it looks like I'd have to create
> another level with more channels, etc, etc... Something like this:
>
>                                    channel1 -> *sink1 avro -> scr1 avro +
> interceptor1 -> channel -> sink1 (hdfs raw data)*
> Agent 1src -->
> replicate
>                           -->sink3
>                                    channel2 --> sink2 avro --> agent2 src
> Avro --> multiplexing + interceptor2
>
> -->sink4
>
> The point is that in sink4 my flow continues and I have other structure
> that it's similiar that all the previously, So, that means 8 channels in
> total. I don't know if it's possible to simplify this.
>
>
> 2014-08-19 0:09 GMT+02:00 terrey shih <te...@gmail.com>:
>
> something like this
>>
>>                         channel 1 -> sink 1 (raw event sink)
>> agent 1src -> replicate
>>
>>                                                                          ->
>> sink 3
>>                         channel 2 - sink  2 -> agent 2 src -> multiplexer
>>
>> -> sink 4
>>
>> In fact, I tried not having agent 2, but directly connecting sink2 to src
>> 2, I was not able to do due to RPCClient exception.
>>
>> I am just going to try to have 2 agents.
>>
>> terrey
>>
>>
>> On Mon, Aug 18, 2014 at 3:06 PM, terrey shih <te...@gmail.com>
>> wrote:
>>
>>> Well, I am actually doing similar things as you do.  I also need to feed
>>> that data to different sinks, one just raw data and the other ones are
>>> Hbase sinks using the multiplexer.
>>>
>>>
>>>                         channel 1 -> sink 1 (raw event sink)
>>> agent 1src -> replicate
>>> channel 2 - sink  2 -> agent 2 src -> multiplexer
>>>
>>>                         channel 2 - sink  2 -> agent 2 src -> multiplexer
>>>
>>>
>>>
>>>
>>> On Mon, Aug 18, 2014 at 1:35 PM, Guillermo Ortiz <ko...@gmail.com>
>>> wrote:
>>>
>>>> On my test, everything is in the same VM. Later, I'll have another flow
>>>> which is just spooling or tailing a file and send through Avro to another
>>>> Source on my system.
>>>>
>>>> Do I really need to do that replicating step? I think that I have too
>>>> many channel and that means too resources and too configuration.
>>>>
>>>>
>>>> 2014-08-18 19:51 GMT+02:00 terrey shih <te...@gmail.com>:
>>>>
>>>> Hi,
>>>>>
>>>>> Your 2 sources (spooling) and source Avro (from sink 2) are in two
>>>>> different JVMs/machines ?
>>>>>
>>>>> thx
>>>>>
>>>>>
>>>>> On Mon, Aug 18, 2014 at 9:53 AM, Guillermo Ortiz <konstt2000@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have build a flow with Flume and I don't know if it's the way to do
>>>>>> it, or there is something better. I am spooling a directory and need those
>>>>>> data in three different paths in HDFS with different formats, so I have
>>>>>> created two interceptors.
>>>>>>
>>>>>> Source(Spooling) + Replication + Interceptor1 --> to C1 and C2
>>>>>> C1 -> Sink1 to HDFS Path1 (It's like a historic)
>>>>>> C2 --> Sink2 to Avro --> Source Avro + Multiplexing + Interceptor2
>>>>>> --> C3 and C4
>>>>>> C3 --> Sink3 to HDFS Path2
>>>>>> C4 --> Sink4 to HDFS Path3
>>>>>>
>>>>>> Interceptor1 doesn't make too much with the data, it's just to save
>>>>>> as they are, it's like to store an history of the original data.
>>>>>>
>>>>>> Interceptor2 configure an selector and a header. It processes the
>>>>>> data and configure the selector to redirect to Sink3 or Sink4. But this
>>>>>> interceptor change the original data.
>>>>>>
>>>>>> I tried to do all the process without replicating data, but I could
>>>>>> not. Now, it seems like too many steps just because I want to store the
>>>>>> original data in HDFS like a historic.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Flow in Flume, could it make better?

Posted by Guillermo Ortiz <ko...@gmail.com>.

Yeah, I think that it's what I'm doing.
How about:

                                                  channel1 -> sink1 (hdfs
raw data)
Agent 1src --> replicate +
Interceptor1
-->sink3
                                                   channel2 --> sink2 avro
--> agent2 src Avro --> multiplexing + interceptor2

-->sink4

Could it be possible to apply the interceptor1 just for channel1?? I know
that interceptors apply to source level. Interceptor1 doesn't modify too
much the data,
I could feed channel2 with those little transformations but ideally I would
like it. So, if I want to do it, it looks like I'd have to create another
level with more channels, etc, etc... Something like this:

                                   channel1 -> *sink1 avro -> scr1 avro +
interceptor1 -> channel -> sink1 (hdfs raw data)*
Agent 1src -->
replicate
                          -->sink3
                                   channel2 --> sink2 avro --> agent2 src
Avro --> multiplexing + interceptor2

-->sink4

The point is that in sink4 my flow continues and I have other structure
that it's similiar that all the previously, So, that means 8 channels in
total. I don't know if it's possible to simplify this.


2014-08-19 0:09 GMT+02:00 terrey shih <te...@gmail.com>:

> something like this
>
>                         channel 1 -> sink 1 (raw event sink)
> agent 1src -> replicate
>
>                                                                          ->
> sink 3
>                         channel 2 - sink  2 -> agent 2 src -> multiplexer
>
> -> sink 4
>
> In fact, I tried not having agent 2, but directly connecting sink2 to src
> 2, I was not able to do due to RPCClient exception.
>
> I am just going to try to have 2 agents.
>
> terrey
>
>
> On Mon, Aug 18, 2014 at 3:06 PM, terrey shih <te...@gmail.com> wrote:
>
>> Well, I am actually doing similar things as you do.  I also need to feed
>> that data to different sinks, one just raw data and the other ones are
>> Hbase sinks using the multiplexer.
>>
>>
>>                         channel 1 -> sink 1 (raw event sink)
>> agent 1src -> replicate
>> channel 2 - sink  2 -> agent 2 src -> multiplexer
>>
>>                         channel 2 - sink  2 -> agent 2 src -> multiplexer
>>
>>
>>
>>
>> On Mon, Aug 18, 2014 at 1:35 PM, Guillermo Ortiz <ko...@gmail.com>
>> wrote:
>>
>>> On my test, everything is in the same VM. Later, I'll have another flow
>>> which is just spooling or tailing a file and send through Avro to another
>>> Source on my system.
>>>
>>> Do I really need to do that replicating step? I think that I have too
>>> many channel and that means too resources and too configuration.
>>>
>>>
>>> 2014-08-18 19:51 GMT+02:00 terrey shih <te...@gmail.com>:
>>>
>>> Hi,
>>>>
>>>> Your 2 sources (spooling) and source Avro (from sink 2) are in two
>>>> different JVMs/machines ?
>>>>
>>>> thx
>>>>
>>>>
>>>> On Mon, Aug 18, 2014 at 9:53 AM, Guillermo Ortiz <ko...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have build a flow with Flume and I don't know if it's the way to do
>>>>> it, or there is something better. I am spooling a directory and need those
>>>>> data in three different paths in HDFS with different formats, so I have
>>>>> created two interceptors.
>>>>>
>>>>> Source(Spooling) + Replication + Interceptor1 --> to C1 and C2
>>>>> C1 -> Sink1 to HDFS Path1 (It's like a historic)
>>>>> C2 --> Sink2 to Avro --> Source Avro + Multiplexing + Interceptor2 -->
>>>>> C3 and C4
>>>>> C3 --> Sink3 to HDFS Path2
>>>>> C4 --> Sink4 to HDFS Path3
>>>>>
>>>>> Interceptor1 doesn't make too much with the data, it's just to save as
>>>>> they are, it's like to store an history of the original data.
>>>>>
>>>>> Interceptor2 configure an selector and a header. It processes the data
>>>>> and configure the selector to redirect to Sink3 or Sink4. But this
>>>>> interceptor change the original data.
>>>>>
>>>>> I tried to do all the process without replicating data, but I could
>>>>> not. Now, it seems like too many steps just because I want to store the
>>>>> original data in HDFS like a historic.
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Flow in Flume, could it make better?

Posted by terrey shih <te...@gmail.com>.

something like this

                        channel 1 -> sink 1 (raw event sink)
agent 1src -> replicate

                                                                         ->
sink 3
                        channel 2 - sink  2 -> agent 2 src -> multiplexer

-> sink 4

In fact, I tried not having agent 2, but directly connecting sink2 to src
2, I was not able to do due to RPCClient exception.

I am just going to try to have 2 agents.

terrey


On Mon, Aug 18, 2014 at 3:06 PM, terrey shih <te...@gmail.com> wrote:

> Well, I am actually doing similar things as you do.  I also need to feed
> that data to different sinks, one just raw data and the other ones are
> Hbase sinks using the multiplexer.
>
>
>                         channel 1 -> sink 1 (raw event sink)
> agent 1src -> replicate
> channel 2 - sink  2 -> agent 2 src -> multiplexer
>
>                         channel 2 - sink  2 -> agent 2 src -> multiplexer
>
>
>
>
> On Mon, Aug 18, 2014 at 1:35 PM, Guillermo Ortiz <ko...@gmail.com>
> wrote:
>
>> On my test, everything is in the same VM. Later, I'll have another flow
>> which is just spooling or tailing a file and send through Avro to another
>> Source on my system.
>>
>> Do I really need to do that replicating step? I think that I have too
>> many channel and that means too resources and too configuration.
>>
>>
>> 2014-08-18 19:51 GMT+02:00 terrey shih <te...@gmail.com>:
>>
>> Hi,
>>>
>>> Your 2 sources (spooling) and source Avro (from sink 2) are in two
>>> different JVMs/machines ?
>>>
>>> thx
>>>
>>>
>>> On Mon, Aug 18, 2014 at 9:53 AM, Guillermo Ortiz <ko...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have build a flow with Flume and I don't know if it's the way to do
>>>> it, or there is something better. I am spooling a directory and need those
>>>> data in three different paths in HDFS with different formats, so I have
>>>> created two interceptors.
>>>>
>>>> Source(Spooling) + Replication + Interceptor1 --> to C1 and C2
>>>> C1 -> Sink1 to HDFS Path1 (It's like a historic)
>>>> C2 --> Sink2 to Avro --> Source Avro + Multiplexing + Interceptor2 -->
>>>> C3 and C4
>>>> C3 --> Sink3 to HDFS Path2
>>>> C4 --> Sink4 to HDFS Path3
>>>>
>>>> Interceptor1 doesn't make too much with the data, it's just to save as
>>>> they are, it's like to store an history of the original data.
>>>>
>>>> Interceptor2 configure an selector and a header. It processes the data
>>>> and configure the selector to redirect to Sink3 or Sink4. But this
>>>> interceptor change the original data.
>>>>
>>>> I tried to do all the process without replicating data, but I could
>>>> not. Now, it seems like too many steps just because I want to store the
>>>> original data in HDFS like a historic.
>>>>
>>>
>>>
>>
>

Re: Flow in Flume, could it make better?

Posted by terrey shih <te...@gmail.com>.

Well, I am actually doing similar things as you do.  I also need to feed
that data to different sinks, one just raw data and the other ones are
Hbase sinks using the multiplexer.


                        channel 1 -> sink 1 (raw event sink)
agent 1src -> replicate
channel 2 - sink  2 -> agent 2 src -> multiplexer

                        channel 2 - sink  2 -> agent 2 src -> multiplexer




On Mon, Aug 18, 2014 at 1:35 PM, Guillermo Ortiz <ko...@gmail.com>
wrote:

> On my test, everything is in the same VM. Later, I'll have another flow
> which is just spooling or tailing a file and send through Avro to another
> Source on my system.
>
> Do I really need to do that replicating step? I think that I have too many
> channel and that means too resources and too configuration.
>
>
> 2014-08-18 19:51 GMT+02:00 terrey shih <te...@gmail.com>:
>
> Hi,
>>
>> Your 2 sources (spooling) and source Avro (from sink 2) are in two
>> different JVMs/machines ?
>>
>> thx
>>
>>
>> On Mon, Aug 18, 2014 at 9:53 AM, Guillermo Ortiz <ko...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I have build a flow with Flume and I don't know if it's the way to do
>>> it, or there is something better. I am spooling a directory and need those
>>> data in three different paths in HDFS with different formats, so I have
>>> created two interceptors.
>>>
>>> Source(Spooling) + Replication + Interceptor1 --> to C1 and C2
>>> C1 -> Sink1 to HDFS Path1 (It's like a historic)
>>> C2 --> Sink2 to Avro --> Source Avro + Multiplexing + Interceptor2 -->
>>> C3 and C4
>>> C3 --> Sink3 to HDFS Path2
>>> C4 --> Sink4 to HDFS Path3
>>>
>>> Interceptor1 doesn't make too much with the data, it's just to save as
>>> they are, it's like to store an history of the original data.
>>>
>>> Interceptor2 configure an selector and a header. It processes the data
>>> and configure the selector to redirect to Sink3 or Sink4. But this
>>> interceptor change the original data.
>>>
>>> I tried to do all the process without replicating data, but I could not.
>>> Now, it seems like too many steps just because I want to store the original
>>> data in HDFS like a historic.
>>>
>>
>>
>

Re: Flow in Flume, could it make better?

Posted by Guillermo Ortiz <ko...@gmail.com>.

On my test, everything is in the same VM. Later, I'll have another flow
which is just spooling or tailing a file and send through Avro to another
Source on my system.

Do I really need to do that replicating step? I think that I have too many
channel and that means too resources and too configuration.


2014-08-18 19:51 GMT+02:00 terrey shih <te...@gmail.com>:

> Hi,
>
> Your 2 sources (spooling) and source Avro (from sink 2) are in two
> different JVMs/machines ?
>
> thx
>
>
> On Mon, Aug 18, 2014 at 9:53 AM, Guillermo Ortiz <ko...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I have build a flow with Flume and I don't know if it's the way to do it,
>> or there is something better. I am spooling a directory and need those data
>> in three different paths in HDFS with different formats, so I have created
>> two interceptors.
>>
>> Source(Spooling) + Replication + Interceptor1 --> to C1 and C2
>> C1 -> Sink1 to HDFS Path1 (It's like a historic)
>> C2 --> Sink2 to Avro --> Source Avro + Multiplexing + Interceptor2 --> C3
>> and C4
>> C3 --> Sink3 to HDFS Path2
>> C4 --> Sink4 to HDFS Path3
>>
>> Interceptor1 doesn't make too much with the data, it's just to save as
>> they are, it's like to store an history of the original data.
>>
>> Interceptor2 configure an selector and a header. It processes the data
>> and configure the selector to redirect to Sink3 or Sink4. But this
>> interceptor change the original data.
>>
>> I tried to do all the process without replicating data, but I could not.
>> Now, it seems like too many steps just because I want to store the original
>> data in HDFS like a historic.
>>
>
>

Re: Flow in Flume, could it make better?

Posted by terrey shih <te...@gmail.com>.

Hi,

Your 2 sources (spooling) and source Avro (from sink 2) are in two
different JVMs/machines ?

thx


On Mon, Aug 18, 2014 at 9:53 AM, Guillermo Ortiz <ko...@gmail.com>
wrote:

> Hi,
>
> I have build a flow with Flume and I don't know if it's the way to do it,
> or there is something better. I am spooling a directory and need those data
> in three different paths in HDFS with different formats, so I have created
> two interceptors.
>
> Source(Spooling) + Replication + Interceptor1 --> to C1 and C2
> C1 -> Sink1 to HDFS Path1 (It's like a historic)
> C2 --> Sink2 to Avro --> Source Avro + Multiplexing + Interceptor2 --> C3
> and C4
> C3 --> Sink3 to HDFS Path2
> C4 --> Sink4 to HDFS Path3
>
> Interceptor1 doesn't make too much with the data, it's just to save as
> they are, it's like to store an history of the original data.
>
> Interceptor2 configure an selector and a header. It processes the data and
> configure the selector to redirect to Sink3 or Sink4. But this interceptor
> change the original data.
>
> I tried to do all the process without replicating data, but I could not.
> Now, it seems like too many steps just because I want to store the original
> data in HDFS like a historic.
>