You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@chukwa.apache.org by Oded Rosen <od...@legolas-media.com> on 2010/03/17 18:00:18 UTC

Hbase over Chukwa demux

 I work with a hadoop cluster with tons of new data each day.
The data is flowing into hadoop from outside servers, using chukwa.

Chukwa has a tool called demux, a builtin mapred job.
Chukwa users may write their own map & reduce classes for this demux, with
the only limitation that the input & output types are chukwa records - I
cannot use HBase's TableMap, TableReduce.
In order to write data to hbase during this mapred job, I can only use the
table.put & table.commit, which work on one hbase raw only (aren't they?).
This raised serious latency issues, as writing thousands of records to hbase
this way every 5 minutes is not effective and really s-l-o-w.
Even if I'll move the hbase writing from the map phase to the reduce phase,
the same rows should be updated, so moving the ".put" to the reducer seems
does not suppose to change anything.

I would like to write straight to hbase from the chukwa demuxer, and not to
have another job that reads the chukwa output and write it to hbase.
The target is to have this data as fast as I can in hbase.

Is there a way to write effectively to hbase without TableReduce? Have I got
something wrong?
is there someone using Chukwa that managed to do this thing?


Thanks in advance for any kind of help,
-- 
Oded

Re: Hbase over Chukwa demux

Posted by Jerome Boulon <jb...@netflix.com>.
Or Demux could be the right place to do it ...
If you need to parse/format your data ten demux is the right place to do
that and with my new demux class (to be published) then you can parse/format
your data and then use any output format to get the best format that match
your needs, and in my case Demux output is an Hive SeqFile.

On the other end, if you don't need to parse the data and you just want to
store the data without any modification then having an HbaseWriter like Ari
mentions could be an other option.

/Jerome.

BTW, I will do a short presentation tonight at the Facebook/Hive user group
On how we are using Honu(Chukwa-Streaming) and hive to compute stats and
metrics information at Netflix.

On 3/18/10 11:49 AM, "Ariel Rabkin" <as...@gmail.com> wrote:

> Hrm.
> 
> Demux might not be the right place in the processing pipeline to
> attack your problem.  The Chukwa collector supports pluggable writers,
> and you could think about having data pushed directly from collectors
> to HBase.  Data shows up at the collector in variable-length Chunks,
> so you'd have to parse 'em and figure out how to map them into your
> particular table schema.
> 
> --Ari
> 
> On Wed, Mar 17, 2010 at 10:00 AM, Oded Rosen <od...@legolas-media.com> wrote:
>> I work with a hadoop cluster with tons of new data each day.
>> The data is flowing into hadoop from outside servers, using chukwa.
>> Chukwa has a tool called demux, a builtin mapred job.
>> Chukwa users may write their own map & reduce classes for this demux, with
>> the only limitation that the input & output types are chukwa records - I
>> cannot use HBase's TableMap, TableReduce.
>> In order to write data to hbase during this mapred job, I can only use the
>> table.put & table.commit, which work on one hbase raw only (aren't they?).
>> This raised serious latency issues, as writing thousands of records to hbase
>> this way every 5 minutes is not effective and really s-l-o-w.
>> Even if I'll move the hbase writing from the map phase to the reduce phase,
>> the same rows should be updated, so moving the ".put" to the reducer seems
>> does not suppose to change anything.
>> I would like to write straight to hbase from the chukwa demuxer, and not to
>> have another job that reads the chukwa output and write it to hbase.
>> The target is to have this data as fast as I can in hbase.
>> Is there a way to write effectively to hbase without TableReduce? Have I got
>> something wrong?
>> is there someone using Chukwa that managed to do this thing?
>> 
>> 
>> Thanks in advance for any kind of help,
>> --
>> Oded
>> 
> 
> 


Re: Hbase over Chukwa demux

Posted by Ariel Rabkin <as...@gmail.com>.
Hrm.

Demux might not be the right place in the processing pipeline to
attack your problem.  The Chukwa collector supports pluggable writers,
and you could think about having data pushed directly from collectors
to HBase.  Data shows up at the collector in variable-length Chunks,
so you'd have to parse 'em and figure out how to map them into your
particular table schema.

--Ari

On Wed, Mar 17, 2010 at 10:00 AM, Oded Rosen <od...@legolas-media.com> wrote:
> I work with a hadoop cluster with tons of new data each day.
> The data is flowing into hadoop from outside servers, using chukwa.
> Chukwa has a tool called demux, a builtin mapred job.
> Chukwa users may write their own map & reduce classes for this demux, with
> the only limitation that the input & output types are chukwa records - I
> cannot use HBase's TableMap, TableReduce.
> In order to write data to hbase during this mapred job, I can only use the
> table.put & table.commit, which work on one hbase raw only (aren't they?).
> This raised serious latency issues, as writing thousands of records to hbase
> this way every 5 minutes is not effective and really s-l-o-w.
> Even if I'll move the hbase writing from the map phase to the reduce phase,
> the same rows should be updated, so moving the ".put" to the reducer seems
> does not suppose to change anything.
> I would like to write straight to hbase from the chukwa demuxer, and not to
> have another job that reads the chukwa output and write it to hbase.
> The target is to have this data as fast as I can in hbase.
> Is there a way to write effectively to hbase without TableReduce? Have I got
> something wrong?
> is there someone using Chukwa that managed to do this thing?
>
>
> Thanks in advance for any kind of help,
> --
> Oded
>



-- 
Ari Rabkin asrabkin@gmail.com
UC Berkeley Computer Science Department

Re: Hbase over Chukwa demux

Posted by Oded Rosen <od...@legolas-media.com>.
Well, the best solution for my case will be a demux process that will output
both a chukwa record (a regular text/sequence file will be even better) AND
to hbase (so multiple output format will be great for me).
Also, the hbase writer should query the current data in the hbase (read the
same row it will update) to use it as a reference for the update.

If those two things will work, I'll be a happy man.

On Wed, Mar 17, 2010 at 7:12 PM, Jerome Boulon <jb...@netflix.com> wrote:

>  Hi,
> I have a new Demux that use something similar to MultipleOutputFormat and
> one of my output is an Hive SeqFile (directly from Demux).
> So I guess that it should not be difficult to get a specific OutputFormat
> for Hbase.
> Do you have any special requirement other than being able to output to
> HBase?
>
> /Jerome.
>
>
> On 3/17/10 10:00 AM, "Oded Rosen" <od...@legolas-media.com> wrote:
>
> I work with a hadoop cluster with tons of new data each day.
> The data is flowing into hadoop from outside servers, using chukwa.
>
> Chukwa has a tool called demux, a builtin mapred job.
> Chukwa users may write their own map & reduce classes for this demux, with
> the only limitation that the input & output types are chukwa records - I
> cannot use HBase's TableMap, TableReduce.
> In order to write data to hbase during this mapred job, I can only use the
> table.put & table.commit, which work on one hbase raw only (aren't they?).
> This raised serious latency issues, as writing thousands of records to
> hbase this way every 5 minutes is not effective and really s-l-o-w.
> Even if I'll move the hbase writing from the map phase to the reduce phase,
> the same rows should be updated, so moving the ".put" to the reducer seems
> does not suppose to change anything.
>
> I would like to write straight to hbase from the chukwa demuxer, and not to
> have another job that reads the chukwa output and write it to hbase.
> The target is to have this data as fast as I can in hbase.
>
> Is there a way to write effectively to hbase without TableReduce? Have I
> got something wrong?
> is there someone using Chukwa that managed to do this thing?
>
>
> Thanks in advance for any kind of help,
>
>


-- 
Oded

Re: Hbase over Chukwa demux

Posted by Oded Rosen <od...@legolas-media.com>.
Well, the best solution for my case will be a demux process that will output
both a chukwa record (a regular text/sequence file will be even better) AND
to hbase (so multiple output format will be great for me).
Also, the hbase writer should query the current data in the hbase (read the
same row it will update) to use it as a reference for the update.

If those two things will work, I'll be a happy man.

On Wed, Mar 17, 2010 at 7:12 PM, Jerome Boulon <jb...@netflix.com> wrote:

>  Hi,
> I have a new Demux that use something similar to MultipleOutputFormat and
> one of my output is an Hive SeqFile (directly from Demux).
> So I guess that it should not be difficult to get a specific OutputFormat
> for Hbase.
> Do you have any special requirement other than being able to output to
> HBase?
>
> /Jerome.
>
>
> On 3/17/10 10:00 AM, "Oded Rosen" <od...@legolas-media.com> wrote:
>
> I work with a hadoop cluster with tons of new data each day.
> The data is flowing into hadoop from outside servers, using chukwa.
>
> Chukwa has a tool called demux, a builtin mapred job.
> Chukwa users may write their own map & reduce classes for this demux, with
> the only limitation that the input & output types are chukwa records - I
> cannot use HBase's TableMap, TableReduce.
> In order to write data to hbase during this mapred job, I can only use the
> table.put & table.commit, which work on one hbase raw only (aren't they?).
> This raised serious latency issues, as writing thousands of records to
> hbase this way every 5 minutes is not effective and really s-l-o-w.
> Even if I'll move the hbase writing from the map phase to the reduce phase,
> the same rows should be updated, so moving the ".put" to the reducer seems
> does not suppose to change anything.
>
> I would like to write straight to hbase from the chukwa demuxer, and not to
> have another job that reads the chukwa output and write it to hbase.
> The target is to have this data as fast as I can in hbase.
>
> Is there a way to write effectively to hbase without TableReduce? Have I
> got something wrong?
> is there someone using Chukwa that managed to do this thing?
>
>
> Thanks in advance for any kind of help,
>
>


-- 
Oded

Re: Hbase over Chukwa demux

Posted by Jerome Boulon <jb...@netflix.com>.
Hi,
I have a new Demux that use something similar to MultipleOutputFormat and
one of my output is an Hive SeqFile (directly from Demux).
So I guess that it should not be difficult to get a specific OutputFormat
for Hbase.
Do you have any special requirement other than being able to output to
HBase?

/Jerome.

On 3/17/10 10:00 AM, "Oded Rosen" <od...@legolas-media.com> wrote:

> I work with a hadoop cluster with tons of new data each day.
> The data is flowing into hadoop from outside servers, using chukwa.
> 
> Chukwa has a tool called demux, a builtin mapred job.
> Chukwa users may write their own map & reduce classes for this demux, with the
> only limitation that the input & output types are chukwa records - I cannot
> use HBase's TableMap, TableReduce.
> In order to write data to hbase during this mapred job, I can only use the
> table.put & table.commit, which work on one hbase raw only (aren't they?).
> This raised serious latency issues, as writing thousands of records to hbase
> this way every 5 minutes is not effective and really s-l-o-w.
> Even if I'll move the hbase writing from the map phase to the reduce phase,
> the same rows should be updated, so moving the ".put" to the reducer seems
> does not suppose to change anything.
> 
> I would like to write straight to hbase from the chukwa demuxer, and not to
> have another job that reads the chukwa output and write it to hbase.
> The target is to have this data as fast as I can in hbase.
> 
> Is there a way to write effectively to hbase without TableReduce? Have I got
> something wrong?
> is there someone using Chukwa that managed to do this thing?
> 
> 
> Thanks in advance for any kind of help,


Re: Hbase over Chukwa demux

Posted by Eric Yang <ey...@yahoo-inc.com>.
Hi Oded,

Current Chukwa Demux uses one reducer per record type for output.  It
depends on your data model.  It may be worth while to look into running
multiple reducer per recordtype, if your data has a lot of record for a
single data type.  I think the conf.setNumReduceTasks is specified in
org.apache.hadoop.chukwa.extraction.demux.Demux.java.  You can set more if
you don't use ChukwaRecord after demux.

The current demux needs some major update to improve, and patches are
welcome.  :)

Regards,
Eric

On 3/17/10 10:00 AM, "Oded Rosen" <od...@legolas-media.com> wrote:

> I work with a hadoop cluster with tons of new data each day.
> The data is flowing into hadoop from outside servers, using chukwa.
> 
> Chukwa has a tool called demux, a builtin mapred job.
> Chukwa users may write their own map & reduce classes for this demux, with the
> only limitation that the input & output types are chukwa records - I cannot
> use HBase's TableMap, TableReduce.
> In order to write data to hbase during this mapred job, I can only use the
> table.put & table.commit, which work on one hbase raw only (aren't they?).
> This raised serious latency issues, as writing thousands of records to hbase
> this way every 5 minutes is not effective and really s-l-o-w.
> Even if I'll move the hbase writing from the map phase to the reduce phase,
> the same rows should be updated, so moving the ".put" to the reducer seems
> does not suppose to change anything.
> 
> I would like to write straight to hbase from the chukwa demuxer, and not to
> have another job that reads the chukwa output and write it to hbase.
> The target is to have this data as fast as I can in hbase.
> 
> Is there a way to write effectively to hbase without TableReduce? Have I got
> something wrong?
> is there someone using Chukwa that managed to do this thing?
> 
> 
> Thanks in advance for any kind of help,