You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Pavan Sudheendra <pa...@gmail.com> on 2013/09/21 08:32:55 UTC

How to best decide mapper output/reducer input for a huge string?

I need to improve my MR jobs which uses HBase as source as well as sink..

Basically, i'm reading data from 3 HBase Tables in the mapper, writing them
out as one huge string for the reducer to do some computation and dump into
a HBase Table..

Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.

The output of the mapper is something like this :

HouseHoldId contentID name duration genre type channelId personId
televisionID timestamp

I'm interested in sorting it on the basis of the HouseHoldID value so i'm
using this technique. I'm not interested in the V part of pair so i'm kind
of ignoring it. My mapper class is defined as follows:

public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }

For my MR job to be completed, it takes 22 hours to complete which is not
desirable at all. I'm supposed to optimize this somehow to run a lot faster
somehow..

scan.setCaching(750);
scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
                                       Table1,           // input
HBase table name
                                       scan,
                                       AnalyzeMapper.class,    // mapper
                                       Text.class,             //
mapper output key
                                       IntWritable.class,      //
mapper output value
                                       job);

                TableMapReduceUtil.initTableReducerJob(
                                        OutputTable,                //
output table
                                        AnalyzeReducerTable.class,  //
reducer class
                                        job);
                job.setNumReduceTasks(RegionCount);

My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a
8 node cloudera cluster.

Should i use a custom SortComparator or a Group Comparator?


-- 
Regards-
Pavan

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Jens Scheidtmann <je...@gmail.com>.
Dear Pavan,

If it was working well, runtime would be shorter. What makes you sure this
is Hbase or Hadoop related? What percentage of time is spent in your
algorithms?

Use System.getTimeMillies() and run your program on the first 100,000
Records single threaded and print to stdout. See were time is
spent.Estimate total runtime from that, see if there is a speed-up.

Jens

Am Dienstag, 24. September 2013 schrieb Pavan Sudheendra :

> No, I'm pretty sure the job is executing fine.. Just that the time it
> takes to complete the whole process, is too much that's all..
>
>

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Jens Scheidtmann <je...@gmail.com>.
Dear Pavan,

If it was working well, runtime would be shorter. What makes you sure this
is Hbase or Hadoop related? What percentage of time is spent in your
algorithms?

Use System.getTimeMillies() and run your program on the first 100,000
Records single threaded and print to stdout. See were time is
spent.Estimate total runtime from that, see if there is a speed-up.

Jens

Am Dienstag, 24. September 2013 schrieb Pavan Sudheendra :

> No, I'm pretty sure the job is executing fine.. Just that the time it
> takes to complete the whole process, is too much that's all..
>
>

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Jens Scheidtmann <je...@gmail.com>.
Dear Pavan,

If it was working well, runtime would be shorter. What makes you sure this
is Hbase or Hadoop related? What percentage of time is spent in your
algorithms?

Use System.getTimeMillies() and run your program on the first 100,000
Records single threaded and print to stdout. See were time is
spent.Estimate total runtime from that, see if there is a speed-up.

Jens

Am Dienstag, 24. September 2013 schrieb Pavan Sudheendra :

> No, I'm pretty sure the job is executing fine.. Just that the time it
> takes to complete the whole process, is too much that's all..
>
>

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Jens Scheidtmann <je...@gmail.com>.
Dear Pavan,

If it was working well, runtime would be shorter. What makes you sure this
is Hbase or Hadoop related? What percentage of time is spent in your
algorithms?

Use System.getTimeMillies() and run your program on the first 100,000
Records single threaded and print to stdout. See were time is
spent.Estimate total runtime from that, see if there is a speed-up.

Jens

Am Dienstag, 24. September 2013 schrieb Pavan Sudheendra :

> No, I'm pretty sure the job is executing fine.. Just that the time it
> takes to complete the whole process, is too much that's all..
>
>

RE: How to best decide mapper output/reducer input for a huge string?

Posted by Pavan Sudheendra <pa...@gmail.com>.
No, I'm pretty sure the job is executing fine.. Just that the time it takes
to complete the whole process, is too much that's all..

I didn't mean to say the mapper or the reducer doesn't work.. Just that
it's very slow and I'm trying to figure out where it's happening in my
code.

Regards,
Pavan
On Sep 23, 2013 11:49 PM, "John Lilley" <jo...@redpoint.net> wrote:

>  You might try creating a “stub” MR job in which the mappers produce no
> output; that would isolate the time spent reading from HBase without the
> trouble of instrumenting your code.****
>
> John****
>
> ** **
>
> ** **
>
> *From:* Pavan Sudheendra [mailto:pavan0591@gmail.com]
> *Sent:* Monday, September 23, 2013 3:31 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: How to best decide mapper output/reducer input for a huge
> string?****
>
> ** **
>
> @John, to be really frank i don't know what the limiting factor is.. It
> might be all of them or a subset of them.. Cannot tell.. ****
>
> ** **
>
> On Mon, Sep 23, 2013 at 2:58 PM, Pavan Sudheendra <pa...@gmail.com>
> wrote:****
>
> @Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers
> are functional at the same time.. Although, @Pradeep, i should do the
> compression like you say.. I'll give it a shot.. As far as i can see, i
> think i'll need to implement Writable and write out the key of the mapper
> using the specific data types instead of writing it out as a string which
> might slow the operation down..****
>
> ** **
>
> On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota <pr...@gmail.com>
> wrote:****
>
> Pavan,****
>
> ** **
>
> It's hard to tell whether there's anything wrong with your design or not
> since you haven't given us specific enough details. The best thing you can
> do is instrument your code and see what is taking a long time. Rahul
> mentioned a problem that I myself have seen before, with only one region
> (or a couple) having any data. So even if you have 21 regions, only mapper
> might be doing the heavy lifting.****
>
> ** **
>
> A combiner is hugely helpful in terms of reducing the data output of
> mappers. Writing a combiner is a best practice and you should almost always
> have one. Compression can be turned on by setting the following properties
> in your job config.****
>
> <property>****
>
>     <name> mapreduce.map.output.compress </name> ****
>
>     <value> true</value> ****
>
> </property>****
>
> <property>****
>
>     <name>mapreduce.map.output.compress.codec</name>****
>
>     <value>org.apache.hadoop.io.compress.GzipCodec</value>****
>
> </property>****
>
> You can also try other compression codes such as Lzo, Snappy, Bzip2, etc.
> depending on your use cases. Gzip is really slow but gets the best
> compression ratios. Snappy/Lzo are a lot faster but don't have as good of a
> compression ratio. If your computations are CPU bound, then you'd probably
> want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs
> are idle, you can use Gzip. You'll have to experiment and find the best
> settings for you. There are a lot of other tweaks that you can try to get
> the best performance out of your cluster.****
>
> ** **
>
> One of the best things you can do is to install Ganglia (or some other
> similar tool) on your cluster and monitor usage of resources while your job
> is running. This will tell you if your job is I/O bound or CPU bound.****
>
> ** **
>
> Take a look at this paper by Intel about optimizing your Hadoop cluster
> and see if that fits your deployment.
> http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf
> ****
>
> ** **
>
> If your cluster is already optimized and your job is not I/O bound, then
> there might be a problem with your algorithm and might warrant a redesign.
> ****
>
> ** **
>
> Hope this helps!****
>
> - Pradeep****
>
> ** **
>
> On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:****
>
> One mapper is spawned per hbase table region. You can use the admin ui of
> hbase to find the number of regions per table. It might happen that all the
> data is sitting in a single region , so a single mapper is spawned and you
> are not getting enough parallel work getting done.****
>
> If that is the case then you can recreate the tables with predefined
> splits to create more regions.****
>
> Thanks,****
>
> Rahul****
>
> ** **
>
> On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> Pavan,****
>
> How large are the rows in HBase?  22 million rows is not very much but you
> mentioned “huge strings”.  Can you tell which part of the processing is the
> limiting factor (read from HBase, mapper output, reducers)?****
>
> John****
>
>  ****
>
>  ****
>
> *From:* Pavan Sudheendra [mailto:pavan0591@gmail.com]
> *Sent:* Saturday, September 21, 2013 2:17 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: How to best decide mapper output/reducer input for a huge
> string?****
>
>  ****
>
> No, I don't have a combiner in place. Is it necessary? How do I make my
> map output compressed? Yes, the Tables in HBase are compressed.****
>
> Although, there's no real bottleneck, the time it takes to process the
> entire table is huge. I have to constantly check if i can optimize it
> somehow.. ****
>
> Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
> any thing wrong with my design? Does it require any kind of re-work? Thank
> you so much for helping..****
>
>  ****
>
> On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pr...@gmail.com>
> wrote:****
>
> One thing that comes to mind is that your keys are Strings which are
> highly inefficient. You might get a lot better performance if you write a
> custom writable for your Key object using the appropriate data types. For
> example, use a long (LongWritable) for timestamps. This should make
> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
> comparisons for sorting will also go up.****
>
>  ****
>
> Ensure that your map output's are being compressed. Are your tables in
> HBase compressed? Do you have a combiner?****
>
>  ****
>
> Have you been able to profile your code to see where the bottlenecks are?*
> ***
>
>  ****
>
> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>
> wrote:****
>
> Hi Pradeep,****
>
> Yes.. Basically i'm only writing the key part as the map output.. The V of
> <K,V> is not of much use to me.. But i'm hoping to change that if it leads
> to faster execution.. I'm kind of a newbie so looking to make the
> map/reduce job run a lot faster.. ****
>
> Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
> seems if i write a map output for each and every row of a 19 m row HBase
> table, its taking nearly a day to complete.. (21 mappers and 21 reducers)*
> ***
>
>  ****
>
> I have looked at both Pig/Hive to do the job but i'm supposed to do this
> via a MR job.. So, cannot use either of that.. Do you recommend me to try
> something if i have the data in that format?****
>
>  ****
>
> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pr...@gmail.com>
> wrote:****
>
> I'm sorry but I don't understand your question. Is the output of the
> mapper you're describing the key portion? If it is the key, then your data
> should already be sorted by HouseHoldId since it occurs first in your key.
> ****
>
>  ****
>
> The SortComparator will tell Hadoop how to sort your data. So you use this
> if you have a need for a non lexical sort order. The GroupingComparator
> will tell Hadoop how to group your data for the reducer. All KV-pairs from
> the same group will be given to the same Reducer.****
>
>  ****
>
> If your reduce computation needs all the KV-pairs for the same
> HouseHoldId, then you will need to write a GroupingComparator.****
>
>  ****
>
> Also, have you considered using a higher level abstraction on Hadoop such
> as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
> easier to write in these languages.****
>
>  ****
>
> Hope this helps!****
>
> - Pradeep****
>
>  ****
>
> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>
> wrote:****
>
> I need to improve my MR jobs which uses HBase as source as well as sink..
> ** **
>
> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
> them out as one huge string for the reducer to do some computation and dump
> into a HBase Table.. ****
>
> Table1 ~ 19 million rows.****
>
> Table2 ~ 2 million rows.****
>
> Table3 ~ 900,000 rows.****
>
> The output of the mapper is something like this : ****
>
> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp****
>
> I'm interested in sorting it on the basis of the HouseHoldID value so i'm
> using this technique. I'm not interested in the V part of pair so i'm kind
> of ignoring it. My mapper class is defined as follows:****
>
> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }****
>
> For my MR job to be completed, it takes 22 hours to complete which is not
> desirable at all. I'm supposed to optimize this somehow to run a lot faster
> somehow..****
>
> scan.setCaching(750);        ****
>
> scan.setCacheBlocks(false); ****
>
> TableMapReduceUtil.initTableMapperJob (****
>
>                                        Table1,           // input HBase table name****
>
>                                        scan,                   ****
>
>                                        AnalyzeMapper.class,    // mapper****
>
>                                        Text.class,             // mapper output key****
>
>                                        IntWritable.class,      // mapper output value****
>
>                                        job);****
>
>  ****
>
>                 TableMapReduceUtil.initTableReducerJob(****
>
>                                         OutputTable,                // output table****
>
>                                         AnalyzeReducerTable.class,  // reducer class****
>
>                                         job);****
>
>                 job.setNumReduceTasks(RegionCount);  ****
>
> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a
> 8 node cloudera cluster.****
>
> Should i use a custom SortComparator or a Group Comparator? ****
>
>
> ****
>
>
> --
> Regards-****
>
> Pavan****
>
>  ****
>
>
>
> ****
>
> --
> Regards-****
>
> Pavan****
>
>  ****
>
>
>
>
> --
> Regards-****
>
> Pavan****
>
> ** **
>
> ** **
>
>
>
> ****
>
> --
> Regards-****
>
> Pavan****
>
>
>
>
> --
> Regards-****
>
> Pavan****
>

RE: How to best decide mapper output/reducer input for a huge string?

Posted by Pavan Sudheendra <pa...@gmail.com>.
No, I'm pretty sure the job is executing fine.. Just that the time it takes
to complete the whole process, is too much that's all..

I didn't mean to say the mapper or the reducer doesn't work.. Just that
it's very slow and I'm trying to figure out where it's happening in my
code.

Regards,
Pavan
On Sep 23, 2013 11:49 PM, "John Lilley" <jo...@redpoint.net> wrote:

>  You might try creating a “stub” MR job in which the mappers produce no
> output; that would isolate the time spent reading from HBase without the
> trouble of instrumenting your code.****
>
> John****
>
> ** **
>
> ** **
>
> *From:* Pavan Sudheendra [mailto:pavan0591@gmail.com]
> *Sent:* Monday, September 23, 2013 3:31 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: How to best decide mapper output/reducer input for a huge
> string?****
>
> ** **
>
> @John, to be really frank i don't know what the limiting factor is.. It
> might be all of them or a subset of them.. Cannot tell.. ****
>
> ** **
>
> On Mon, Sep 23, 2013 at 2:58 PM, Pavan Sudheendra <pa...@gmail.com>
> wrote:****
>
> @Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers
> are functional at the same time.. Although, @Pradeep, i should do the
> compression like you say.. I'll give it a shot.. As far as i can see, i
> think i'll need to implement Writable and write out the key of the mapper
> using the specific data types instead of writing it out as a string which
> might slow the operation down..****
>
> ** **
>
> On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota <pr...@gmail.com>
> wrote:****
>
> Pavan,****
>
> ** **
>
> It's hard to tell whether there's anything wrong with your design or not
> since you haven't given us specific enough details. The best thing you can
> do is instrument your code and see what is taking a long time. Rahul
> mentioned a problem that I myself have seen before, with only one region
> (or a couple) having any data. So even if you have 21 regions, only mapper
> might be doing the heavy lifting.****
>
> ** **
>
> A combiner is hugely helpful in terms of reducing the data output of
> mappers. Writing a combiner is a best practice and you should almost always
> have one. Compression can be turned on by setting the following properties
> in your job config.****
>
> <property>****
>
>     <name> mapreduce.map.output.compress </name> ****
>
>     <value> true</value> ****
>
> </property>****
>
> <property>****
>
>     <name>mapreduce.map.output.compress.codec</name>****
>
>     <value>org.apache.hadoop.io.compress.GzipCodec</value>****
>
> </property>****
>
> You can also try other compression codes such as Lzo, Snappy, Bzip2, etc.
> depending on your use cases. Gzip is really slow but gets the best
> compression ratios. Snappy/Lzo are a lot faster but don't have as good of a
> compression ratio. If your computations are CPU bound, then you'd probably
> want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs
> are idle, you can use Gzip. You'll have to experiment and find the best
> settings for you. There are a lot of other tweaks that you can try to get
> the best performance out of your cluster.****
>
> ** **
>
> One of the best things you can do is to install Ganglia (or some other
> similar tool) on your cluster and monitor usage of resources while your job
> is running. This will tell you if your job is I/O bound or CPU bound.****
>
> ** **
>
> Take a look at this paper by Intel about optimizing your Hadoop cluster
> and see if that fits your deployment.
> http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf
> ****
>
> ** **
>
> If your cluster is already optimized and your job is not I/O bound, then
> there might be a problem with your algorithm and might warrant a redesign.
> ****
>
> ** **
>
> Hope this helps!****
>
> - Pradeep****
>
> ** **
>
> On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:****
>
> One mapper is spawned per hbase table region. You can use the admin ui of
> hbase to find the number of regions per table. It might happen that all the
> data is sitting in a single region , so a single mapper is spawned and you
> are not getting enough parallel work getting done.****
>
> If that is the case then you can recreate the tables with predefined
> splits to create more regions.****
>
> Thanks,****
>
> Rahul****
>
> ** **
>
> On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> Pavan,****
>
> How large are the rows in HBase?  22 million rows is not very much but you
> mentioned “huge strings”.  Can you tell which part of the processing is the
> limiting factor (read from HBase, mapper output, reducers)?****
>
> John****
>
>  ****
>
>  ****
>
> *From:* Pavan Sudheendra [mailto:pavan0591@gmail.com]
> *Sent:* Saturday, September 21, 2013 2:17 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: How to best decide mapper output/reducer input for a huge
> string?****
>
>  ****
>
> No, I don't have a combiner in place. Is it necessary? How do I make my
> map output compressed? Yes, the Tables in HBase are compressed.****
>
> Although, there's no real bottleneck, the time it takes to process the
> entire table is huge. I have to constantly check if i can optimize it
> somehow.. ****
>
> Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
> any thing wrong with my design? Does it require any kind of re-work? Thank
> you so much for helping..****
>
>  ****
>
> On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pr...@gmail.com>
> wrote:****
>
> One thing that comes to mind is that your keys are Strings which are
> highly inefficient. You might get a lot better performance if you write a
> custom writable for your Key object using the appropriate data types. For
> example, use a long (LongWritable) for timestamps. This should make
> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
> comparisons for sorting will also go up.****
>
>  ****
>
> Ensure that your map output's are being compressed. Are your tables in
> HBase compressed? Do you have a combiner?****
>
>  ****
>
> Have you been able to profile your code to see where the bottlenecks are?*
> ***
>
>  ****
>
> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>
> wrote:****
>
> Hi Pradeep,****
>
> Yes.. Basically i'm only writing the key part as the map output.. The V of
> <K,V> is not of much use to me.. But i'm hoping to change that if it leads
> to faster execution.. I'm kind of a newbie so looking to make the
> map/reduce job run a lot faster.. ****
>
> Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
> seems if i write a map output for each and every row of a 19 m row HBase
> table, its taking nearly a day to complete.. (21 mappers and 21 reducers)*
> ***
>
>  ****
>
> I have looked at both Pig/Hive to do the job but i'm supposed to do this
> via a MR job.. So, cannot use either of that.. Do you recommend me to try
> something if i have the data in that format?****
>
>  ****
>
> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pr...@gmail.com>
> wrote:****
>
> I'm sorry but I don't understand your question. Is the output of the
> mapper you're describing the key portion? If it is the key, then your data
> should already be sorted by HouseHoldId since it occurs first in your key.
> ****
>
>  ****
>
> The SortComparator will tell Hadoop how to sort your data. So you use this
> if you have a need for a non lexical sort order. The GroupingComparator
> will tell Hadoop how to group your data for the reducer. All KV-pairs from
> the same group will be given to the same Reducer.****
>
>  ****
>
> If your reduce computation needs all the KV-pairs for the same
> HouseHoldId, then you will need to write a GroupingComparator.****
>
>  ****
>
> Also, have you considered using a higher level abstraction on Hadoop such
> as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
> easier to write in these languages.****
>
>  ****
>
> Hope this helps!****
>
> - Pradeep****
>
>  ****
>
> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>
> wrote:****
>
> I need to improve my MR jobs which uses HBase as source as well as sink..
> ** **
>
> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
> them out as one huge string for the reducer to do some computation and dump
> into a HBase Table.. ****
>
> Table1 ~ 19 million rows.****
>
> Table2 ~ 2 million rows.****
>
> Table3 ~ 900,000 rows.****
>
> The output of the mapper is something like this : ****
>
> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp****
>
> I'm interested in sorting it on the basis of the HouseHoldID value so i'm
> using this technique. I'm not interested in the V part of pair so i'm kind
> of ignoring it. My mapper class is defined as follows:****
>
> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }****
>
> For my MR job to be completed, it takes 22 hours to complete which is not
> desirable at all. I'm supposed to optimize this somehow to run a lot faster
> somehow..****
>
> scan.setCaching(750);        ****
>
> scan.setCacheBlocks(false); ****
>
> TableMapReduceUtil.initTableMapperJob (****
>
>                                        Table1,           // input HBase table name****
>
>                                        scan,                   ****
>
>                                        AnalyzeMapper.class,    // mapper****
>
>                                        Text.class,             // mapper output key****
>
>                                        IntWritable.class,      // mapper output value****
>
>                                        job);****
>
>  ****
>
>                 TableMapReduceUtil.initTableReducerJob(****
>
>                                         OutputTable,                // output table****
>
>                                         AnalyzeReducerTable.class,  // reducer class****
>
>                                         job);****
>
>                 job.setNumReduceTasks(RegionCount);  ****
>
> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a
> 8 node cloudera cluster.****
>
> Should i use a custom SortComparator or a Group Comparator? ****
>
>
> ****
>
>
> --
> Regards-****
>
> Pavan****
>
>  ****
>
>
>
> ****
>
> --
> Regards-****
>
> Pavan****
>
>  ****
>
>
>
>
> --
> Regards-****
>
> Pavan****
>
> ** **
>
> ** **
>
>
>
> ****
>
> --
> Regards-****
>
> Pavan****
>
>
>
>
> --
> Regards-****
>
> Pavan****
>

RE: How to best decide mapper output/reducer input for a huge string?

Posted by Pavan Sudheendra <pa...@gmail.com>.
No, I'm pretty sure the job is executing fine.. Just that the time it takes
to complete the whole process, is too much that's all..

I didn't mean to say the mapper or the reducer doesn't work.. Just that
it's very slow and I'm trying to figure out where it's happening in my
code.

Regards,
Pavan
On Sep 23, 2013 11:49 PM, "John Lilley" <jo...@redpoint.net> wrote:

>  You might try creating a “stub” MR job in which the mappers produce no
> output; that would isolate the time spent reading from HBase without the
> trouble of instrumenting your code.****
>
> John****
>
> ** **
>
> ** **
>
> *From:* Pavan Sudheendra [mailto:pavan0591@gmail.com]
> *Sent:* Monday, September 23, 2013 3:31 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: How to best decide mapper output/reducer input for a huge
> string?****
>
> ** **
>
> @John, to be really frank i don't know what the limiting factor is.. It
> might be all of them or a subset of them.. Cannot tell.. ****
>
> ** **
>
> On Mon, Sep 23, 2013 at 2:58 PM, Pavan Sudheendra <pa...@gmail.com>
> wrote:****
>
> @Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers
> are functional at the same time.. Although, @Pradeep, i should do the
> compression like you say.. I'll give it a shot.. As far as i can see, i
> think i'll need to implement Writable and write out the key of the mapper
> using the specific data types instead of writing it out as a string which
> might slow the operation down..****
>
> ** **
>
> On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota <pr...@gmail.com>
> wrote:****
>
> Pavan,****
>
> ** **
>
> It's hard to tell whether there's anything wrong with your design or not
> since you haven't given us specific enough details. The best thing you can
> do is instrument your code and see what is taking a long time. Rahul
> mentioned a problem that I myself have seen before, with only one region
> (or a couple) having any data. So even if you have 21 regions, only mapper
> might be doing the heavy lifting.****
>
> ** **
>
> A combiner is hugely helpful in terms of reducing the data output of
> mappers. Writing a combiner is a best practice and you should almost always
> have one. Compression can be turned on by setting the following properties
> in your job config.****
>
> <property>****
>
>     <name> mapreduce.map.output.compress </name> ****
>
>     <value> true</value> ****
>
> </property>****
>
> <property>****
>
>     <name>mapreduce.map.output.compress.codec</name>****
>
>     <value>org.apache.hadoop.io.compress.GzipCodec</value>****
>
> </property>****
>
> You can also try other compression codes such as Lzo, Snappy, Bzip2, etc.
> depending on your use cases. Gzip is really slow but gets the best
> compression ratios. Snappy/Lzo are a lot faster but don't have as good of a
> compression ratio. If your computations are CPU bound, then you'd probably
> want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs
> are idle, you can use Gzip. You'll have to experiment and find the best
> settings for you. There are a lot of other tweaks that you can try to get
> the best performance out of your cluster.****
>
> ** **
>
> One of the best things you can do is to install Ganglia (or some other
> similar tool) on your cluster and monitor usage of resources while your job
> is running. This will tell you if your job is I/O bound or CPU bound.****
>
> ** **
>
> Take a look at this paper by Intel about optimizing your Hadoop cluster
> and see if that fits your deployment.
> http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf
> ****
>
> ** **
>
> If your cluster is already optimized and your job is not I/O bound, then
> there might be a problem with your algorithm and might warrant a redesign.
> ****
>
> ** **
>
> Hope this helps!****
>
> - Pradeep****
>
> ** **
>
> On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:****
>
> One mapper is spawned per hbase table region. You can use the admin ui of
> hbase to find the number of regions per table. It might happen that all the
> data is sitting in a single region , so a single mapper is spawned and you
> are not getting enough parallel work getting done.****
>
> If that is the case then you can recreate the tables with predefined
> splits to create more regions.****
>
> Thanks,****
>
> Rahul****
>
> ** **
>
> On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> Pavan,****
>
> How large are the rows in HBase?  22 million rows is not very much but you
> mentioned “huge strings”.  Can you tell which part of the processing is the
> limiting factor (read from HBase, mapper output, reducers)?****
>
> John****
>
>  ****
>
>  ****
>
> *From:* Pavan Sudheendra [mailto:pavan0591@gmail.com]
> *Sent:* Saturday, September 21, 2013 2:17 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: How to best decide mapper output/reducer input for a huge
> string?****
>
>  ****
>
> No, I don't have a combiner in place. Is it necessary? How do I make my
> map output compressed? Yes, the Tables in HBase are compressed.****
>
> Although, there's no real bottleneck, the time it takes to process the
> entire table is huge. I have to constantly check if i can optimize it
> somehow.. ****
>
> Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
> any thing wrong with my design? Does it require any kind of re-work? Thank
> you so much for helping..****
>
>  ****
>
> On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pr...@gmail.com>
> wrote:****
>
> One thing that comes to mind is that your keys are Strings which are
> highly inefficient. You might get a lot better performance if you write a
> custom writable for your Key object using the appropriate data types. For
> example, use a long (LongWritable) for timestamps. This should make
> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
> comparisons for sorting will also go up.****
>
>  ****
>
> Ensure that your map output's are being compressed. Are your tables in
> HBase compressed? Do you have a combiner?****
>
>  ****
>
> Have you been able to profile your code to see where the bottlenecks are?*
> ***
>
>  ****
>
> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>
> wrote:****
>
> Hi Pradeep,****
>
> Yes.. Basically i'm only writing the key part as the map output.. The V of
> <K,V> is not of much use to me.. But i'm hoping to change that if it leads
> to faster execution.. I'm kind of a newbie so looking to make the
> map/reduce job run a lot faster.. ****
>
> Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
> seems if i write a map output for each and every row of a 19 m row HBase
> table, its taking nearly a day to complete.. (21 mappers and 21 reducers)*
> ***
>
>  ****
>
> I have looked at both Pig/Hive to do the job but i'm supposed to do this
> via a MR job.. So, cannot use either of that.. Do you recommend me to try
> something if i have the data in that format?****
>
>  ****
>
> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pr...@gmail.com>
> wrote:****
>
> I'm sorry but I don't understand your question. Is the output of the
> mapper you're describing the key portion? If it is the key, then your data
> should already be sorted by HouseHoldId since it occurs first in your key.
> ****
>
>  ****
>
> The SortComparator will tell Hadoop how to sort your data. So you use this
> if you have a need for a non lexical sort order. The GroupingComparator
> will tell Hadoop how to group your data for the reducer. All KV-pairs from
> the same group will be given to the same Reducer.****
>
>  ****
>
> If your reduce computation needs all the KV-pairs for the same
> HouseHoldId, then you will need to write a GroupingComparator.****
>
>  ****
>
> Also, have you considered using a higher level abstraction on Hadoop such
> as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
> easier to write in these languages.****
>
>  ****
>
> Hope this helps!****
>
> - Pradeep****
>
>  ****
>
> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>
> wrote:****
>
> I need to improve my MR jobs which uses HBase as source as well as sink..
> ** **
>
> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
> them out as one huge string for the reducer to do some computation and dump
> into a HBase Table.. ****
>
> Table1 ~ 19 million rows.****
>
> Table2 ~ 2 million rows.****
>
> Table3 ~ 900,000 rows.****
>
> The output of the mapper is something like this : ****
>
> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp****
>
> I'm interested in sorting it on the basis of the HouseHoldID value so i'm
> using this technique. I'm not interested in the V part of pair so i'm kind
> of ignoring it. My mapper class is defined as follows:****
>
> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }****
>
> For my MR job to be completed, it takes 22 hours to complete which is not
> desirable at all. I'm supposed to optimize this somehow to run a lot faster
> somehow..****
>
> scan.setCaching(750);        ****
>
> scan.setCacheBlocks(false); ****
>
> TableMapReduceUtil.initTableMapperJob (****
>
>                                        Table1,           // input HBase table name****
>
>                                        scan,                   ****
>
>                                        AnalyzeMapper.class,    // mapper****
>
>                                        Text.class,             // mapper output key****
>
>                                        IntWritable.class,      // mapper output value****
>
>                                        job);****
>
>  ****
>
>                 TableMapReduceUtil.initTableReducerJob(****
>
>                                         OutputTable,                // output table****
>
>                                         AnalyzeReducerTable.class,  // reducer class****
>
>                                         job);****
>
>                 job.setNumReduceTasks(RegionCount);  ****
>
> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a
> 8 node cloudera cluster.****
>
> Should i use a custom SortComparator or a Group Comparator? ****
>
>
> ****
>
>
> --
> Regards-****
>
> Pavan****
>
>  ****
>
>
>
> ****
>
> --
> Regards-****
>
> Pavan****
>
>  ****
>
>
>
>
> --
> Regards-****
>
> Pavan****
>
> ** **
>
> ** **
>
>
>
> ****
>
> --
> Regards-****
>
> Pavan****
>
>
>
>
> --
> Regards-****
>
> Pavan****
>

RE: How to best decide mapper output/reducer input for a huge string?

Posted by Pavan Sudheendra <pa...@gmail.com>.
No, I'm pretty sure the job is executing fine.. Just that the time it takes
to complete the whole process, is too much that's all..

I didn't mean to say the mapper or the reducer doesn't work.. Just that
it's very slow and I'm trying to figure out where it's happening in my
code.

Regards,
Pavan
On Sep 23, 2013 11:49 PM, "John Lilley" <jo...@redpoint.net> wrote:

>  You might try creating a “stub” MR job in which the mappers produce no
> output; that would isolate the time spent reading from HBase without the
> trouble of instrumenting your code.****
>
> John****
>
> ** **
>
> ** **
>
> *From:* Pavan Sudheendra [mailto:pavan0591@gmail.com]
> *Sent:* Monday, September 23, 2013 3:31 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: How to best decide mapper output/reducer input for a huge
> string?****
>
> ** **
>
> @John, to be really frank i don't know what the limiting factor is.. It
> might be all of them or a subset of them.. Cannot tell.. ****
>
> ** **
>
> On Mon, Sep 23, 2013 at 2:58 PM, Pavan Sudheendra <pa...@gmail.com>
> wrote:****
>
> @Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers
> are functional at the same time.. Although, @Pradeep, i should do the
> compression like you say.. I'll give it a shot.. As far as i can see, i
> think i'll need to implement Writable and write out the key of the mapper
> using the specific data types instead of writing it out as a string which
> might slow the operation down..****
>
> ** **
>
> On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota <pr...@gmail.com>
> wrote:****
>
> Pavan,****
>
> ** **
>
> It's hard to tell whether there's anything wrong with your design or not
> since you haven't given us specific enough details. The best thing you can
> do is instrument your code and see what is taking a long time. Rahul
> mentioned a problem that I myself have seen before, with only one region
> (or a couple) having any data. So even if you have 21 regions, only mapper
> might be doing the heavy lifting.****
>
> ** **
>
> A combiner is hugely helpful in terms of reducing the data output of
> mappers. Writing a combiner is a best practice and you should almost always
> have one. Compression can be turned on by setting the following properties
> in your job config.****
>
> <property>****
>
>     <name> mapreduce.map.output.compress </name> ****
>
>     <value> true</value> ****
>
> </property>****
>
> <property>****
>
>     <name>mapreduce.map.output.compress.codec</name>****
>
>     <value>org.apache.hadoop.io.compress.GzipCodec</value>****
>
> </property>****
>
> You can also try other compression codes such as Lzo, Snappy, Bzip2, etc.
> depending on your use cases. Gzip is really slow but gets the best
> compression ratios. Snappy/Lzo are a lot faster but don't have as good of a
> compression ratio. If your computations are CPU bound, then you'd probably
> want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs
> are idle, you can use Gzip. You'll have to experiment and find the best
> settings for you. There are a lot of other tweaks that you can try to get
> the best performance out of your cluster.****
>
> ** **
>
> One of the best things you can do is to install Ganglia (or some other
> similar tool) on your cluster and monitor usage of resources while your job
> is running. This will tell you if your job is I/O bound or CPU bound.****
>
> ** **
>
> Take a look at this paper by Intel about optimizing your Hadoop cluster
> and see if that fits your deployment.
> http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf
> ****
>
> ** **
>
> If your cluster is already optimized and your job is not I/O bound, then
> there might be a problem with your algorithm and might warrant a redesign.
> ****
>
> ** **
>
> Hope this helps!****
>
> - Pradeep****
>
> ** **
>
> On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:****
>
> One mapper is spawned per hbase table region. You can use the admin ui of
> hbase to find the number of regions per table. It might happen that all the
> data is sitting in a single region , so a single mapper is spawned and you
> are not getting enough parallel work getting done.****
>
> If that is the case then you can recreate the tables with predefined
> splits to create more regions.****
>
> Thanks,****
>
> Rahul****
>
> ** **
>
> On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> Pavan,****
>
> How large are the rows in HBase?  22 million rows is not very much but you
> mentioned “huge strings”.  Can you tell which part of the processing is the
> limiting factor (read from HBase, mapper output, reducers)?****
>
> John****
>
>  ****
>
>  ****
>
> *From:* Pavan Sudheendra [mailto:pavan0591@gmail.com]
> *Sent:* Saturday, September 21, 2013 2:17 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: How to best decide mapper output/reducer input for a huge
> string?****
>
>  ****
>
> No, I don't have a combiner in place. Is it necessary? How do I make my
> map output compressed? Yes, the Tables in HBase are compressed.****
>
> Although, there's no real bottleneck, the time it takes to process the
> entire table is huge. I have to constantly check if i can optimize it
> somehow.. ****
>
> Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
> any thing wrong with my design? Does it require any kind of re-work? Thank
> you so much for helping..****
>
>  ****
>
> On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pr...@gmail.com>
> wrote:****
>
> One thing that comes to mind is that your keys are Strings which are
> highly inefficient. You might get a lot better performance if you write a
> custom writable for your Key object using the appropriate data types. For
> example, use a long (LongWritable) for timestamps. This should make
> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
> comparisons for sorting will also go up.****
>
>  ****
>
> Ensure that your map output's are being compressed. Are your tables in
> HBase compressed? Do you have a combiner?****
>
>  ****
>
> Have you been able to profile your code to see where the bottlenecks are?*
> ***
>
>  ****
>
> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>
> wrote:****
>
> Hi Pradeep,****
>
> Yes.. Basically i'm only writing the key part as the map output.. The V of
> <K,V> is not of much use to me.. But i'm hoping to change that if it leads
> to faster execution.. I'm kind of a newbie so looking to make the
> map/reduce job run a lot faster.. ****
>
> Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
> seems if i write a map output for each and every row of a 19 m row HBase
> table, its taking nearly a day to complete.. (21 mappers and 21 reducers)*
> ***
>
>  ****
>
> I have looked at both Pig/Hive to do the job but i'm supposed to do this
> via a MR job.. So, cannot use either of that.. Do you recommend me to try
> something if i have the data in that format?****
>
>  ****
>
> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pr...@gmail.com>
> wrote:****
>
> I'm sorry but I don't understand your question. Is the output of the
> mapper you're describing the key portion? If it is the key, then your data
> should already be sorted by HouseHoldId since it occurs first in your key.
> ****
>
>  ****
>
> The SortComparator will tell Hadoop how to sort your data. So you use this
> if you have a need for a non lexical sort order. The GroupingComparator
> will tell Hadoop how to group your data for the reducer. All KV-pairs from
> the same group will be given to the same Reducer.****
>
>  ****
>
> If your reduce computation needs all the KV-pairs for the same
> HouseHoldId, then you will need to write a GroupingComparator.****
>
>  ****
>
> Also, have you considered using a higher level abstraction on Hadoop such
> as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
> easier to write in these languages.****
>
>  ****
>
> Hope this helps!****
>
> - Pradeep****
>
>  ****
>
> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>
> wrote:****
>
> I need to improve my MR jobs which uses HBase as source as well as sink..
> ** **
>
> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
> them out as one huge string for the reducer to do some computation and dump
> into a HBase Table.. ****
>
> Table1 ~ 19 million rows.****
>
> Table2 ~ 2 million rows.****
>
> Table3 ~ 900,000 rows.****
>
> The output of the mapper is something like this : ****
>
> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp****
>
> I'm interested in sorting it on the basis of the HouseHoldID value so i'm
> using this technique. I'm not interested in the V part of pair so i'm kind
> of ignoring it. My mapper class is defined as follows:****
>
> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }****
>
> For my MR job to be completed, it takes 22 hours to complete which is not
> desirable at all. I'm supposed to optimize this somehow to run a lot faster
> somehow..****
>
> scan.setCaching(750);        ****
>
> scan.setCacheBlocks(false); ****
>
> TableMapReduceUtil.initTableMapperJob (****
>
>                                        Table1,           // input HBase table name****
>
>                                        scan,                   ****
>
>                                        AnalyzeMapper.class,    // mapper****
>
>                                        Text.class,             // mapper output key****
>
>                                        IntWritable.class,      // mapper output value****
>
>                                        job);****
>
>  ****
>
>                 TableMapReduceUtil.initTableReducerJob(****
>
>                                         OutputTable,                // output table****
>
>                                         AnalyzeReducerTable.class,  // reducer class****
>
>                                         job);****
>
>                 job.setNumReduceTasks(RegionCount);  ****
>
> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a
> 8 node cloudera cluster.****
>
> Should i use a custom SortComparator or a Group Comparator? ****
>
>
> ****
>
>
> --
> Regards-****
>
> Pavan****
>
>  ****
>
>
>
> ****
>
> --
> Regards-****
>
> Pavan****
>
>  ****
>
>
>
>
> --
> Regards-****
>
> Pavan****
>
> ** **
>
> ** **
>
>
>
> ****
>
> --
> Regards-****
>
> Pavan****
>
>
>
>
> --
> Regards-****
>
> Pavan****
>

RE: How to best decide mapper output/reducer input for a huge string?

Posted by John Lilley <jo...@redpoint.net>.
You might try creating a "stub" MR job in which the mappers produce no output; that would isolate the time spent reading from HBase without the trouble of instrumenting your code.
John


From: Pavan Sudheendra [mailto:pavan0591@gmail.com]
Sent: Monday, September 23, 2013 3:31 AM
To: user@hadoop.apache.org
Subject: Re: How to best decide mapper output/reducer input for a huge string?

@John, to be really frank i don't know what the limiting factor is.. It might be all of them or a subset of them.. Cannot tell..

On Mon, Sep 23, 2013 at 2:58 PM, Pavan Sudheendra <pa...@gmail.com>> wrote:
@Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers are functional at the same time.. Although, @Pradeep, i should do the compression like you say.. I'll give it a shot.. As far as i can see, i think i'll need to implement Writable and write out the key of the mapper using the specific data types instead of writing it out as a string which might slow the operation down..

On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota <pr...@gmail.com>> wrote:
Pavan,

It's hard to tell whether there's anything wrong with your design or not since you haven't given us specific enough details. The best thing you can do is instrument your code and see what is taking a long time. Rahul mentioned a problem that I myself have seen before, with only one region (or a couple) having any data. So even if you have 21 regions, only mapper might be doing the heavy lifting.

A combiner is hugely helpful in terms of reducing the data output of mappers. Writing a combiner is a best practice and you should almost always have one. Compression can be turned on by setting the following properties in your job config.
<property>
    <name> mapreduce.map.output.compress </name>
    <value> true</value>
</property>
<property>
    <name>mapreduce.map.output.compress.codec</name>
    <value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>
You can also try other compression codes such as Lzo, Snappy, Bzip2, etc. depending on your use cases. Gzip is really slow but gets the best compression ratios. Snappy/Lzo are a lot faster but don't have as good of a compression ratio. If your computations are CPU bound, then you'd probably want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs are idle, you can use Gzip. You'll have to experiment and find the best settings for you. There are a lot of other tweaks that you can try to get the best performance out of your cluster.

One of the best things you can do is to install Ganglia (or some other similar tool) on your cluster and monitor usage of resources while your job is running. This will tell you if your job is I/O bound or CPU bound.

Take a look at this paper by Intel about optimizing your Hadoop cluster and see if that fits your deployment. http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf

If your cluster is already optimized and your job is not I/O bound, then there might be a problem with your algorithm and might warrant a redesign.

Hope this helps!
- Pradeep

On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee <ra...@gmail.com>> wrote:
One mapper is spawned per hbase table region. You can use the admin ui of hbase to find the number of regions per table. It might happen that all the data is sitting in a single region , so a single mapper is spawned and you are not getting enough parallel work getting done.
If that is the case then you can recreate the tables with predefined splits to create more regions.
Thanks,
Rahul

On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <jo...@redpoint.net>> wrote:
Pavan,
How large are the rows in HBase?  22 million rows is not very much but you mentioned "huge strings".  Can you tell which part of the processing is the limiting factor (read from HBase, mapper output, reducers)?
John


From: Pavan Sudheendra [mailto:pavan0591@gmail.com<ma...@gmail.com>]
Sent: Saturday, September 21, 2013 2:17 AM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: How to best decide mapper output/reducer input for a huge string?

No, I don't have a combiner in place. Is it necessary? How do I make my map output compressed? Yes, the Tables in HBase are compressed.
Although, there's no real bottleneck, the time it takes to process the entire table is huge. I have to constantly check if i can optimize it somehow..
Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see any thing wrong with my design? Does it require any kind of re-work? Thank you so much for helping..

On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pr...@gmail.com>> wrote:
One thing that comes to mind is that your keys are Strings which are highly inefficient. You might get a lot better performance if you write a custom writable for your Key object using the appropriate data types. For example, use a long (LongWritable) for timestamps. This should make (de)serialization a lot faster. If HouseHoldId is an integer, your speed of comparisons for sorting will also go up.

Ensure that your map output's are being compressed. Are your tables in HBase compressed? Do you have a combiner?

Have you been able to profile your code to see where the bottlenecks are?

On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>> wrote:
Hi Pradeep,
Yes.. Basically i'm only writing the key part as the map output.. The V of <K,V> is not of much use to me.. But i'm hoping to change that if it leads to faster execution.. I'm kind of a newbie so looking to make the map/reduce job run a lot faster..
Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But seems if i write a map output for each and every row of a 19 m row HBase table, its taking nearly a day to complete.. (21 mappers and 21 reducers)

I have looked at both Pig/Hive to do the job but i'm supposed to do this via a MR job.. So, cannot use either of that.. Do you recommend me to try something if i have the data in that format?

On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pr...@gmail.com>> wrote:
I'm sorry but I don't understand your question. Is the output of the mapper you're describing the key portion? If it is the key, then your data should already be sorted by HouseHoldId since it occurs first in your key.

The SortComparator will tell Hadoop how to sort your data. So you use this if you have a need for a non lexical sort order. The GroupingComparator will tell Hadoop how to group your data for the reducer. All KV-pairs from the same group will be given to the same Reducer.

If your reduce computation needs all the KV-pairs for the same HouseHoldId, then you will need to write a GroupingComparator.

Also, have you considered using a higher level abstraction on Hadoop such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT easier to write in these languages.

Hope this helps!
- Pradeep

On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>> wrote:

I need to improve my MR jobs which uses HBase as source as well as sink..

Basically, i'm reading data from 3 HBase Tables in the mapper, writing them out as one huge string for the reducer to do some computation and dump into a HBase Table..

Table1 ~ 19 million rows.

Table2 ~ 2 million rows.

Table3 ~ 900,000 rows.

The output of the mapper is something like this :

HouseHoldId contentID name duration genre type channelId personId televisionID timestamp

I'm interested in sorting it on the basis of the HouseHoldID value so i'm using this technique. I'm not interested in the V part of pair so i'm kind of ignoring it. My mapper class is defined as follows:

public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }

For my MR job to be completed, it takes 22 hours to complete which is not desirable at all. I'm supposed to optimize this somehow to run a lot faster somehow..

scan.setCaching(750);

scan.setCacheBlocks(false);

TableMapReduceUtil.initTableMapperJob (

                                       Table1,           // input HBase table name

                                       scan,

                                       AnalyzeMapper.class,    // mapper

                                       Text.class,             // mapper output key

                                       IntWritable.class,      // mapper output value

                                       job);



                TableMapReduceUtil.initTableReducerJob(

                                        OutputTable,                // output table

                                        AnalyzeReducerTable.class,  // reducer class

                                        job);

                job.setNumReduceTasks(RegionCount);

My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a 8 node cloudera cluster.

Should i use a custom SortComparator or a Group Comparator?


--
Regards-
Pavan



--
Regards-
Pavan




--
Regards-
Pavan




--
Regards-
Pavan



--
Regards-
Pavan

RE: How to best decide mapper output/reducer input for a huge string?

Posted by John Lilley <jo...@redpoint.net>.
You might try creating a "stub" MR job in which the mappers produce no output; that would isolate the time spent reading from HBase without the trouble of instrumenting your code.
John


From: Pavan Sudheendra [mailto:pavan0591@gmail.com]
Sent: Monday, September 23, 2013 3:31 AM
To: user@hadoop.apache.org
Subject: Re: How to best decide mapper output/reducer input for a huge string?

@John, to be really frank i don't know what the limiting factor is.. It might be all of them or a subset of them.. Cannot tell..

On Mon, Sep 23, 2013 at 2:58 PM, Pavan Sudheendra <pa...@gmail.com>> wrote:
@Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers are functional at the same time.. Although, @Pradeep, i should do the compression like you say.. I'll give it a shot.. As far as i can see, i think i'll need to implement Writable and write out the key of the mapper using the specific data types instead of writing it out as a string which might slow the operation down..

On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota <pr...@gmail.com>> wrote:
Pavan,

It's hard to tell whether there's anything wrong with your design or not since you haven't given us specific enough details. The best thing you can do is instrument your code and see what is taking a long time. Rahul mentioned a problem that I myself have seen before, with only one region (or a couple) having any data. So even if you have 21 regions, only mapper might be doing the heavy lifting.

A combiner is hugely helpful in terms of reducing the data output of mappers. Writing a combiner is a best practice and you should almost always have one. Compression can be turned on by setting the following properties in your job config.
<property>
    <name> mapreduce.map.output.compress </name>
    <value> true</value>
</property>
<property>
    <name>mapreduce.map.output.compress.codec</name>
    <value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>
You can also try other compression codes such as Lzo, Snappy, Bzip2, etc. depending on your use cases. Gzip is really slow but gets the best compression ratios. Snappy/Lzo are a lot faster but don't have as good of a compression ratio. If your computations are CPU bound, then you'd probably want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs are idle, you can use Gzip. You'll have to experiment and find the best settings for you. There are a lot of other tweaks that you can try to get the best performance out of your cluster.

One of the best things you can do is to install Ganglia (or some other similar tool) on your cluster and monitor usage of resources while your job is running. This will tell you if your job is I/O bound or CPU bound.

Take a look at this paper by Intel about optimizing your Hadoop cluster and see if that fits your deployment. http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf

If your cluster is already optimized and your job is not I/O bound, then there might be a problem with your algorithm and might warrant a redesign.

Hope this helps!
- Pradeep

On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee <ra...@gmail.com>> wrote:
One mapper is spawned per hbase table region. You can use the admin ui of hbase to find the number of regions per table. It might happen that all the data is sitting in a single region , so a single mapper is spawned and you are not getting enough parallel work getting done.
If that is the case then you can recreate the tables with predefined splits to create more regions.
Thanks,
Rahul

On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <jo...@redpoint.net>> wrote:
Pavan,
How large are the rows in HBase?  22 million rows is not very much but you mentioned "huge strings".  Can you tell which part of the processing is the limiting factor (read from HBase, mapper output, reducers)?
John


From: Pavan Sudheendra [mailto:pavan0591@gmail.com<ma...@gmail.com>]
Sent: Saturday, September 21, 2013 2:17 AM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: How to best decide mapper output/reducer input for a huge string?

No, I don't have a combiner in place. Is it necessary? How do I make my map output compressed? Yes, the Tables in HBase are compressed.
Although, there's no real bottleneck, the time it takes to process the entire table is huge. I have to constantly check if i can optimize it somehow..
Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see any thing wrong with my design? Does it require any kind of re-work? Thank you so much for helping..

On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pr...@gmail.com>> wrote:
One thing that comes to mind is that your keys are Strings which are highly inefficient. You might get a lot better performance if you write a custom writable for your Key object using the appropriate data types. For example, use a long (LongWritable) for timestamps. This should make (de)serialization a lot faster. If HouseHoldId is an integer, your speed of comparisons for sorting will also go up.

Ensure that your map output's are being compressed. Are your tables in HBase compressed? Do you have a combiner?

Have you been able to profile your code to see where the bottlenecks are?

On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>> wrote:
Hi Pradeep,
Yes.. Basically i'm only writing the key part as the map output.. The V of <K,V> is not of much use to me.. But i'm hoping to change that if it leads to faster execution.. I'm kind of a newbie so looking to make the map/reduce job run a lot faster..
Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But seems if i write a map output for each and every row of a 19 m row HBase table, its taking nearly a day to complete.. (21 mappers and 21 reducers)

I have looked at both Pig/Hive to do the job but i'm supposed to do this via a MR job.. So, cannot use either of that.. Do you recommend me to try something if i have the data in that format?

On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pr...@gmail.com>> wrote:
I'm sorry but I don't understand your question. Is the output of the mapper you're describing the key portion? If it is the key, then your data should already be sorted by HouseHoldId since it occurs first in your key.

The SortComparator will tell Hadoop how to sort your data. So you use this if you have a need for a non lexical sort order. The GroupingComparator will tell Hadoop how to group your data for the reducer. All KV-pairs from the same group will be given to the same Reducer.

If your reduce computation needs all the KV-pairs for the same HouseHoldId, then you will need to write a GroupingComparator.

Also, have you considered using a higher level abstraction on Hadoop such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT easier to write in these languages.

Hope this helps!
- Pradeep

On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>> wrote:

I need to improve my MR jobs which uses HBase as source as well as sink..

Basically, i'm reading data from 3 HBase Tables in the mapper, writing them out as one huge string for the reducer to do some computation and dump into a HBase Table..

Table1 ~ 19 million rows.

Table2 ~ 2 million rows.

Table3 ~ 900,000 rows.

The output of the mapper is something like this :

HouseHoldId contentID name duration genre type channelId personId televisionID timestamp

I'm interested in sorting it on the basis of the HouseHoldID value so i'm using this technique. I'm not interested in the V part of pair so i'm kind of ignoring it. My mapper class is defined as follows:

public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }

For my MR job to be completed, it takes 22 hours to complete which is not desirable at all. I'm supposed to optimize this somehow to run a lot faster somehow..

scan.setCaching(750);

scan.setCacheBlocks(false);

TableMapReduceUtil.initTableMapperJob (

                                       Table1,           // input HBase table name

                                       scan,

                                       AnalyzeMapper.class,    // mapper

                                       Text.class,             // mapper output key

                                       IntWritable.class,      // mapper output value

                                       job);



                TableMapReduceUtil.initTableReducerJob(

                                        OutputTable,                // output table

                                        AnalyzeReducerTable.class,  // reducer class

                                        job);

                job.setNumReduceTasks(RegionCount);

My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a 8 node cloudera cluster.

Should i use a custom SortComparator or a Group Comparator?


--
Regards-
Pavan



--
Regards-
Pavan




--
Regards-
Pavan




--
Regards-
Pavan



--
Regards-
Pavan

RE: How to best decide mapper output/reducer input for a huge string?

Posted by John Lilley <jo...@redpoint.net>.
You might try creating a "stub" MR job in which the mappers produce no output; that would isolate the time spent reading from HBase without the trouble of instrumenting your code.
John


From: Pavan Sudheendra [mailto:pavan0591@gmail.com]
Sent: Monday, September 23, 2013 3:31 AM
To: user@hadoop.apache.org
Subject: Re: How to best decide mapper output/reducer input for a huge string?

@John, to be really frank i don't know what the limiting factor is.. It might be all of them or a subset of them.. Cannot tell..

On Mon, Sep 23, 2013 at 2:58 PM, Pavan Sudheendra <pa...@gmail.com>> wrote:
@Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers are functional at the same time.. Although, @Pradeep, i should do the compression like you say.. I'll give it a shot.. As far as i can see, i think i'll need to implement Writable and write out the key of the mapper using the specific data types instead of writing it out as a string which might slow the operation down..

On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota <pr...@gmail.com>> wrote:
Pavan,

It's hard to tell whether there's anything wrong with your design or not since you haven't given us specific enough details. The best thing you can do is instrument your code and see what is taking a long time. Rahul mentioned a problem that I myself have seen before, with only one region (or a couple) having any data. So even if you have 21 regions, only mapper might be doing the heavy lifting.

A combiner is hugely helpful in terms of reducing the data output of mappers. Writing a combiner is a best practice and you should almost always have one. Compression can be turned on by setting the following properties in your job config.
<property>
    <name> mapreduce.map.output.compress </name>
    <value> true</value>
</property>
<property>
    <name>mapreduce.map.output.compress.codec</name>
    <value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>
You can also try other compression codes such as Lzo, Snappy, Bzip2, etc. depending on your use cases. Gzip is really slow but gets the best compression ratios. Snappy/Lzo are a lot faster but don't have as good of a compression ratio. If your computations are CPU bound, then you'd probably want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs are idle, you can use Gzip. You'll have to experiment and find the best settings for you. There are a lot of other tweaks that you can try to get the best performance out of your cluster.

One of the best things you can do is to install Ganglia (or some other similar tool) on your cluster and monitor usage of resources while your job is running. This will tell you if your job is I/O bound or CPU bound.

Take a look at this paper by Intel about optimizing your Hadoop cluster and see if that fits your deployment. http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf

If your cluster is already optimized and your job is not I/O bound, then there might be a problem with your algorithm and might warrant a redesign.

Hope this helps!
- Pradeep

On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee <ra...@gmail.com>> wrote:
One mapper is spawned per hbase table region. You can use the admin ui of hbase to find the number of regions per table. It might happen that all the data is sitting in a single region , so a single mapper is spawned and you are not getting enough parallel work getting done.
If that is the case then you can recreate the tables with predefined splits to create more regions.
Thanks,
Rahul

On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <jo...@redpoint.net>> wrote:
Pavan,
How large are the rows in HBase?  22 million rows is not very much but you mentioned "huge strings".  Can you tell which part of the processing is the limiting factor (read from HBase, mapper output, reducers)?
John


From: Pavan Sudheendra [mailto:pavan0591@gmail.com<ma...@gmail.com>]
Sent: Saturday, September 21, 2013 2:17 AM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: How to best decide mapper output/reducer input for a huge string?

No, I don't have a combiner in place. Is it necessary? How do I make my map output compressed? Yes, the Tables in HBase are compressed.
Although, there's no real bottleneck, the time it takes to process the entire table is huge. I have to constantly check if i can optimize it somehow..
Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see any thing wrong with my design? Does it require any kind of re-work? Thank you so much for helping..

On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pr...@gmail.com>> wrote:
One thing that comes to mind is that your keys are Strings which are highly inefficient. You might get a lot better performance if you write a custom writable for your Key object using the appropriate data types. For example, use a long (LongWritable) for timestamps. This should make (de)serialization a lot faster. If HouseHoldId is an integer, your speed of comparisons for sorting will also go up.

Ensure that your map output's are being compressed. Are your tables in HBase compressed? Do you have a combiner?

Have you been able to profile your code to see where the bottlenecks are?

On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>> wrote:
Hi Pradeep,
Yes.. Basically i'm only writing the key part as the map output.. The V of <K,V> is not of much use to me.. But i'm hoping to change that if it leads to faster execution.. I'm kind of a newbie so looking to make the map/reduce job run a lot faster..
Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But seems if i write a map output for each and every row of a 19 m row HBase table, its taking nearly a day to complete.. (21 mappers and 21 reducers)

I have looked at both Pig/Hive to do the job but i'm supposed to do this via a MR job.. So, cannot use either of that.. Do you recommend me to try something if i have the data in that format?

On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pr...@gmail.com>> wrote:
I'm sorry but I don't understand your question. Is the output of the mapper you're describing the key portion? If it is the key, then your data should already be sorted by HouseHoldId since it occurs first in your key.

The SortComparator will tell Hadoop how to sort your data. So you use this if you have a need for a non lexical sort order. The GroupingComparator will tell Hadoop how to group your data for the reducer. All KV-pairs from the same group will be given to the same Reducer.

If your reduce computation needs all the KV-pairs for the same HouseHoldId, then you will need to write a GroupingComparator.

Also, have you considered using a higher level abstraction on Hadoop such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT easier to write in these languages.

Hope this helps!
- Pradeep

On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>> wrote:

I need to improve my MR jobs which uses HBase as source as well as sink..

Basically, i'm reading data from 3 HBase Tables in the mapper, writing them out as one huge string for the reducer to do some computation and dump into a HBase Table..

Table1 ~ 19 million rows.

Table2 ~ 2 million rows.

Table3 ~ 900,000 rows.

The output of the mapper is something like this :

HouseHoldId contentID name duration genre type channelId personId televisionID timestamp

I'm interested in sorting it on the basis of the HouseHoldID value so i'm using this technique. I'm not interested in the V part of pair so i'm kind of ignoring it. My mapper class is defined as follows:

public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }

For my MR job to be completed, it takes 22 hours to complete which is not desirable at all. I'm supposed to optimize this somehow to run a lot faster somehow..

scan.setCaching(750);

scan.setCacheBlocks(false);

TableMapReduceUtil.initTableMapperJob (

                                       Table1,           // input HBase table name

                                       scan,

                                       AnalyzeMapper.class,    // mapper

                                       Text.class,             // mapper output key

                                       IntWritable.class,      // mapper output value

                                       job);



                TableMapReduceUtil.initTableReducerJob(

                                        OutputTable,                // output table

                                        AnalyzeReducerTable.class,  // reducer class

                                        job);

                job.setNumReduceTasks(RegionCount);

My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a 8 node cloudera cluster.

Should i use a custom SortComparator or a Group Comparator?


--
Regards-
Pavan



--
Regards-
Pavan




--
Regards-
Pavan




--
Regards-
Pavan



--
Regards-
Pavan

RE: How to best decide mapper output/reducer input for a huge string?

Posted by John Lilley <jo...@redpoint.net>.
You might try creating a "stub" MR job in which the mappers produce no output; that would isolate the time spent reading from HBase without the trouble of instrumenting your code.
John


From: Pavan Sudheendra [mailto:pavan0591@gmail.com]
Sent: Monday, September 23, 2013 3:31 AM
To: user@hadoop.apache.org
Subject: Re: How to best decide mapper output/reducer input for a huge string?

@John, to be really frank i don't know what the limiting factor is.. It might be all of them or a subset of them.. Cannot tell..

On Mon, Sep 23, 2013 at 2:58 PM, Pavan Sudheendra <pa...@gmail.com>> wrote:
@Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers are functional at the same time.. Although, @Pradeep, i should do the compression like you say.. I'll give it a shot.. As far as i can see, i think i'll need to implement Writable and write out the key of the mapper using the specific data types instead of writing it out as a string which might slow the operation down..

On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota <pr...@gmail.com>> wrote:
Pavan,

It's hard to tell whether there's anything wrong with your design or not since you haven't given us specific enough details. The best thing you can do is instrument your code and see what is taking a long time. Rahul mentioned a problem that I myself have seen before, with only one region (or a couple) having any data. So even if you have 21 regions, only mapper might be doing the heavy lifting.

A combiner is hugely helpful in terms of reducing the data output of mappers. Writing a combiner is a best practice and you should almost always have one. Compression can be turned on by setting the following properties in your job config.
<property>
    <name> mapreduce.map.output.compress </name>
    <value> true</value>
</property>
<property>
    <name>mapreduce.map.output.compress.codec</name>
    <value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>
You can also try other compression codes such as Lzo, Snappy, Bzip2, etc. depending on your use cases. Gzip is really slow but gets the best compression ratios. Snappy/Lzo are a lot faster but don't have as good of a compression ratio. If your computations are CPU bound, then you'd probably want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs are idle, you can use Gzip. You'll have to experiment and find the best settings for you. There are a lot of other tweaks that you can try to get the best performance out of your cluster.

One of the best things you can do is to install Ganglia (or some other similar tool) on your cluster and monitor usage of resources while your job is running. This will tell you if your job is I/O bound or CPU bound.

Take a look at this paper by Intel about optimizing your Hadoop cluster and see if that fits your deployment. http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf

If your cluster is already optimized and your job is not I/O bound, then there might be a problem with your algorithm and might warrant a redesign.

Hope this helps!
- Pradeep

On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee <ra...@gmail.com>> wrote:
One mapper is spawned per hbase table region. You can use the admin ui of hbase to find the number of regions per table. It might happen that all the data is sitting in a single region , so a single mapper is spawned and you are not getting enough parallel work getting done.
If that is the case then you can recreate the tables with predefined splits to create more regions.
Thanks,
Rahul

On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <jo...@redpoint.net>> wrote:
Pavan,
How large are the rows in HBase?  22 million rows is not very much but you mentioned "huge strings".  Can you tell which part of the processing is the limiting factor (read from HBase, mapper output, reducers)?
John


From: Pavan Sudheendra [mailto:pavan0591@gmail.com<ma...@gmail.com>]
Sent: Saturday, September 21, 2013 2:17 AM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: How to best decide mapper output/reducer input for a huge string?

No, I don't have a combiner in place. Is it necessary? How do I make my map output compressed? Yes, the Tables in HBase are compressed.
Although, there's no real bottleneck, the time it takes to process the entire table is huge. I have to constantly check if i can optimize it somehow..
Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see any thing wrong with my design? Does it require any kind of re-work? Thank you so much for helping..

On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pr...@gmail.com>> wrote:
One thing that comes to mind is that your keys are Strings which are highly inefficient. You might get a lot better performance if you write a custom writable for your Key object using the appropriate data types. For example, use a long (LongWritable) for timestamps. This should make (de)serialization a lot faster. If HouseHoldId is an integer, your speed of comparisons for sorting will also go up.

Ensure that your map output's are being compressed. Are your tables in HBase compressed? Do you have a combiner?

Have you been able to profile your code to see where the bottlenecks are?

On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>> wrote:
Hi Pradeep,
Yes.. Basically i'm only writing the key part as the map output.. The V of <K,V> is not of much use to me.. But i'm hoping to change that if it leads to faster execution.. I'm kind of a newbie so looking to make the map/reduce job run a lot faster..
Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But seems if i write a map output for each and every row of a 19 m row HBase table, its taking nearly a day to complete.. (21 mappers and 21 reducers)

I have looked at both Pig/Hive to do the job but i'm supposed to do this via a MR job.. So, cannot use either of that.. Do you recommend me to try something if i have the data in that format?

On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pr...@gmail.com>> wrote:
I'm sorry but I don't understand your question. Is the output of the mapper you're describing the key portion? If it is the key, then your data should already be sorted by HouseHoldId since it occurs first in your key.

The SortComparator will tell Hadoop how to sort your data. So you use this if you have a need for a non lexical sort order. The GroupingComparator will tell Hadoop how to group your data for the reducer. All KV-pairs from the same group will be given to the same Reducer.

If your reduce computation needs all the KV-pairs for the same HouseHoldId, then you will need to write a GroupingComparator.

Also, have you considered using a higher level abstraction on Hadoop such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT easier to write in these languages.

Hope this helps!
- Pradeep

On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>> wrote:

I need to improve my MR jobs which uses HBase as source as well as sink..

Basically, i'm reading data from 3 HBase Tables in the mapper, writing them out as one huge string for the reducer to do some computation and dump into a HBase Table..

Table1 ~ 19 million rows.

Table2 ~ 2 million rows.

Table3 ~ 900,000 rows.

The output of the mapper is something like this :

HouseHoldId contentID name duration genre type channelId personId televisionID timestamp

I'm interested in sorting it on the basis of the HouseHoldID value so i'm using this technique. I'm not interested in the V part of pair so i'm kind of ignoring it. My mapper class is defined as follows:

public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }

For my MR job to be completed, it takes 22 hours to complete which is not desirable at all. I'm supposed to optimize this somehow to run a lot faster somehow..

scan.setCaching(750);

scan.setCacheBlocks(false);

TableMapReduceUtil.initTableMapperJob (

                                       Table1,           // input HBase table name

                                       scan,

                                       AnalyzeMapper.class,    // mapper

                                       Text.class,             // mapper output key

                                       IntWritable.class,      // mapper output value

                                       job);



                TableMapReduceUtil.initTableReducerJob(

                                        OutputTable,                // output table

                                        AnalyzeReducerTable.class,  // reducer class

                                        job);

                job.setNumReduceTasks(RegionCount);

My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a 8 node cloudera cluster.

Should i use a custom SortComparator or a Group Comparator?


--
Regards-
Pavan



--
Regards-
Pavan




--
Regards-
Pavan




--
Regards-
Pavan



--
Regards-
Pavan

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Pavan Sudheendra <pa...@gmail.com>.
@John, to be really frank i don't know what the limiting factor is.. It
might be all of them or a subset of them.. Cannot tell..


On Mon, Sep 23, 2013 at 2:58 PM, Pavan Sudheendra <pa...@gmail.com>wrote:

> @Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers
> are functional at the same time.. Although, @Pradeep, i should do the
> compression like you say.. I'll give it a shot.. As far as i can see, i
> think i'll need to implement Writable and write out the key of the mapper
> using the specific data types instead of writing it out as a string which
> might slow the operation down..
>
>
> On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota <pr...@gmail.com>wrote:
>
>> Pavan,
>>
>> It's hard to tell whether there's anything wrong with your design or not
>> since you haven't given us specific enough details. The best thing you can
>> do is instrument your code and see what is taking a long time. Rahul
>> mentioned a problem that I myself have seen before, with only one region
>> (or a couple) having any data. So even if you have 21 regions, only mapper
>> might be doing the heavy lifting.
>>
>> A combiner is hugely helpful in terms of reducing the data output of
>> mappers. Writing a combiner is a best practice and you should almost always
>> have one. Compression can be turned on by setting the following properties
>> in your job config.
>>  <property>
>>      <name> mapreduce.map.output.compress </name>
>>      <value> true</value>
>>  </property>
>>  <property>
>>      <name>mapreduce.map.output.compress.codec</name>
>>      <value>org.apache.hadoop.io.compress.GzipCodec</value>
>>  </property>
>> You can also try other compression codes such as Lzo, Snappy, Bzip2, etc.
>> depending on your use cases. Gzip is really slow but gets the best
>> compression ratios. Snappy/Lzo are a lot faster but don't have as good of a
>> compression ratio. If your computations are CPU bound, then you'd probably
>> want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs
>> are idle, you can use Gzip. You'll have to experiment and find the best
>> settings for you. There are a lot of other tweaks that you can try to get
>> the best performance out of your cluster.
>>
>> One of the best things you can do is to install Ganglia (or some other
>> similar tool) on your cluster and monitor usage of resources while your job
>> is running. This will tell you if your job is I/O bound or CPU bound.
>>
>> Take a look at this paper by Intel about optimizing your Hadoop cluster
>> and see if that fits your deployment.
>> http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf
>>
>> If your cluster is already optimized and your job is not I/O bound, then
>> there might be a problem with your algorithm and might warrant a redesign.
>>
>> Hope this helps!
>> - Pradeep
>>
>>
>> On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> One mapper is spawned per hbase table region. You can use the admin ui
>>> of hbase to find the number of regions per table. It might happen that all
>>> the data is sitting in a single region , so a single mapper is spawned and
>>> you are not getting enough parallel work getting done.
>>>
>>> If that is the case then you can recreate the tables with predefined
>>> splits to create more regions.
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <jo...@redpoint.net>wrote:
>>>
>>>>  Pavan,****
>>>>
>>>> How large are the rows in HBase?  22 million rows is not very much but
>>>> you mentioned “huge strings”.  Can you tell which part of the processing is
>>>> the limiting factor (read from HBase, mapper output, reducers)?****
>>>>
>>>> John****
>>>>
>>>> ** **
>>>>
>>>> ** **
>>>>
>>>> *From:* Pavan Sudheendra [mailto:pavan0591@gmail.com]
>>>> *Sent:* Saturday, September 21, 2013 2:17 AM
>>>> *To:* user@hadoop.apache.org
>>>> *Subject:* Re: How to best decide mapper output/reducer input for a
>>>> huge string?****
>>>>
>>>> ** **
>>>>
>>>> No, I don't have a combiner in place. Is it necessary? How do I make my
>>>> map output compressed? Yes, the Tables in HBase are compressed.****
>>>>
>>>> Although, there's no real bottleneck, the time it takes to process the
>>>> entire table is huge. I have to constantly check if i can optimize it
>>>> somehow.. ****
>>>>
>>>> Oh okay.. I'll implement a Custom Writable.. Apart from that, do you
>>>> see any thing wrong with my design? Does it require any kind of re-work?
>>>> Thank you so much for helping..****
>>>>
>>>> ** **
>>>>
>>>> On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <
>>>> pradeepg26@gmail.com> wrote:****
>>>>
>>>> One thing that comes to mind is that your keys are Strings which are
>>>> highly inefficient. You might get a lot better performance if you write a
>>>> custom writable for your Key object using the appropriate data types. For
>>>> example, use a long (LongWritable) for timestamps. This should make
>>>> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
>>>> comparisons for sorting will also go up.****
>>>>
>>>> ** **
>>>>
>>>> Ensure that your map output's are being compressed. Are your tables in
>>>> HBase compressed? Do you have a combiner?****
>>>>
>>>> ** **
>>>>
>>>> Have you been able to profile your code to see where the bottlenecks
>>>> are?****
>>>>
>>>> ** **
>>>>
>>>> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>
>>>> wrote:****
>>>>
>>>> Hi Pradeep,****
>>>>
>>>> Yes.. Basically i'm only writing the key part as the map output.. The V
>>>> of <K,V> is not of much use to me.. But i'm hoping to change that if it
>>>> leads to faster execution.. I'm kind of a newbie so looking to make the
>>>> map/reduce job run a lot faster.. ****
>>>>
>>>> Also, yes. It gets sorted by the HouseHoldID which is what i needed..
>>>> But seems if i write a map output for each and every row of a 19 m row
>>>> HBase table, its taking nearly a day to complete.. (21 mappers and 21
>>>> reducers)****
>>>>
>>>> ** **
>>>>
>>>> I have looked at both Pig/Hive to do the job but i'm supposed to do
>>>> this via a MR job.. So, cannot use either of that.. Do you recommend me to
>>>> try something if i have the data in that format?****
>>>>
>>>> ** **
>>>>
>>>> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <
>>>> pradeepg26@gmail.com> wrote:****
>>>>
>>>> I'm sorry but I don't understand your question. Is the output of the
>>>> mapper you're describing the key portion? If it is the key, then your data
>>>> should already be sorted by HouseHoldId since it occurs first in your key.
>>>> ****
>>>>
>>>> ** **
>>>>
>>>> The SortComparator will tell Hadoop how to sort your data. So you use
>>>> this if you have a need for a non lexical sort order. The
>>>> GroupingComparator will tell Hadoop how to group your data for the reducer.
>>>> All KV-pairs from the same group will be given to the same Reducer.****
>>>>
>>>> ** **
>>>>
>>>> If your reduce computation needs all the KV-pairs for the same
>>>> HouseHoldId, then you will need to write a GroupingComparator.****
>>>>
>>>> ** **
>>>>
>>>> Also, have you considered using a higher level abstraction on Hadoop
>>>> such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are
>>>> a LOT easier to write in these languages.****
>>>>
>>>> ** **
>>>>
>>>> Hope this helps!****
>>>>
>>>> - Pradeep****
>>>>
>>>> ** **
>>>>
>>>> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>
>>>> wrote:****
>>>>
>>>> I need to improve my MR jobs which uses HBase as source as well as
>>>> sink.. ** **
>>>>
>>>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>>>> them out as one huge string for the reducer to do some computation and dump
>>>> into a HBase Table.. ****
>>>>
>>>> Table1 ~ 19 million rows.****
>>>>
>>>> Table2 ~ 2 million rows.****
>>>>
>>>> Table3 ~ 900,000 rows.****
>>>>
>>>> The output of the mapper is something like this : ****
>>>>
>>>> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp****
>>>>
>>>> I'm interested in sorting it on the basis of the HouseHoldID value so
>>>> i'm using this technique. I'm not interested in the V part of pair so i'm
>>>> kind of ignoring it. My mapper class is defined as follows:****
>>>>
>>>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }****
>>>>
>>>> For my MR job to be completed, it takes 22 hours to complete which is
>>>> not desirable at all. I'm supposed to optimize this somehow to run a lot
>>>> faster somehow..****
>>>>
>>>> scan.setCaching(750);        ****
>>>>
>>>> scan.setCacheBlocks(false); ****
>>>>
>>>> TableMapReduceUtil.initTableMapperJob (****
>>>>
>>>>                                        Table1,           // input HBase table name****
>>>>
>>>>                                        scan,                   ****
>>>>
>>>>                                        AnalyzeMapper.class,    // mapper****
>>>>
>>>>                                        Text.class,             // mapper output key****
>>>>
>>>>                                        IntWritable.class,      // mapper output value****
>>>>
>>>>                                        job);****
>>>>
>>>> ** **
>>>>
>>>>                 TableMapReduceUtil.initTableReducerJob(****
>>>>
>>>>                                         OutputTable,                // output table****
>>>>
>>>>                                         AnalyzeReducerTable.class,  // reducer class****
>>>>
>>>>                                         job);****
>>>>
>>>>                 job.setNumReduceTasks(RegionCount);  ****
>>>>
>>>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are
>>>> running a 8 node cloudera cluster.****
>>>>
>>>> Should i use a custom SortComparator or a Group Comparator? ****
>>>>
>>>>
>>>> ****
>>>>
>>>>
>>>> --
>>>> Regards-****
>>>>
>>>> Pavan****
>>>>
>>>> ** **
>>>>
>>>>
>>>>
>>>> ****
>>>>
>>>> --
>>>> Regards-****
>>>>
>>>> Pavan****
>>>>
>>>> ** **
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards-****
>>>>
>>>> Pavan****
>>>>
>>>
>>>
>>
>
>
> --
> Regards-
> Pavan
>



-- 
Regards-
Pavan

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Pavan Sudheendra <pa...@gmail.com>.
@John, to be really frank i don't know what the limiting factor is.. It
might be all of them or a subset of them.. Cannot tell..


On Mon, Sep 23, 2013 at 2:58 PM, Pavan Sudheendra <pa...@gmail.com>wrote:

> @Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers
> are functional at the same time.. Although, @Pradeep, i should do the
> compression like you say.. I'll give it a shot.. As far as i can see, i
> think i'll need to implement Writable and write out the key of the mapper
> using the specific data types instead of writing it out as a string which
> might slow the operation down..
>
>
> On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota <pr...@gmail.com>wrote:
>
>> Pavan,
>>
>> It's hard to tell whether there's anything wrong with your design or not
>> since you haven't given us specific enough details. The best thing you can
>> do is instrument your code and see what is taking a long time. Rahul
>> mentioned a problem that I myself have seen before, with only one region
>> (or a couple) having any data. So even if you have 21 regions, only mapper
>> might be doing the heavy lifting.
>>
>> A combiner is hugely helpful in terms of reducing the data output of
>> mappers. Writing a combiner is a best practice and you should almost always
>> have one. Compression can be turned on by setting the following properties
>> in your job config.
>>  <property>
>>      <name> mapreduce.map.output.compress </name>
>>      <value> true</value>
>>  </property>
>>  <property>
>>      <name>mapreduce.map.output.compress.codec</name>
>>      <value>org.apache.hadoop.io.compress.GzipCodec</value>
>>  </property>
>> You can also try other compression codes such as Lzo, Snappy, Bzip2, etc.
>> depending on your use cases. Gzip is really slow but gets the best
>> compression ratios. Snappy/Lzo are a lot faster but don't have as good of a
>> compression ratio. If your computations are CPU bound, then you'd probably
>> want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs
>> are idle, you can use Gzip. You'll have to experiment and find the best
>> settings for you. There are a lot of other tweaks that you can try to get
>> the best performance out of your cluster.
>>
>> One of the best things you can do is to install Ganglia (or some other
>> similar tool) on your cluster and monitor usage of resources while your job
>> is running. This will tell you if your job is I/O bound or CPU bound.
>>
>> Take a look at this paper by Intel about optimizing your Hadoop cluster
>> and see if that fits your deployment.
>> http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf
>>
>> If your cluster is already optimized and your job is not I/O bound, then
>> there might be a problem with your algorithm and might warrant a redesign.
>>
>> Hope this helps!
>> - Pradeep
>>
>>
>> On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> One mapper is spawned per hbase table region. You can use the admin ui
>>> of hbase to find the number of regions per table. It might happen that all
>>> the data is sitting in a single region , so a single mapper is spawned and
>>> you are not getting enough parallel work getting done.
>>>
>>> If that is the case then you can recreate the tables with predefined
>>> splits to create more regions.
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <jo...@redpoint.net>wrote:
>>>
>>>>  Pavan,****
>>>>
>>>> How large are the rows in HBase?  22 million rows is not very much but
>>>> you mentioned “huge strings”.  Can you tell which part of the processing is
>>>> the limiting factor (read from HBase, mapper output, reducers)?****
>>>>
>>>> John****
>>>>
>>>> ** **
>>>>
>>>> ** **
>>>>
>>>> *From:* Pavan Sudheendra [mailto:pavan0591@gmail.com]
>>>> *Sent:* Saturday, September 21, 2013 2:17 AM
>>>> *To:* user@hadoop.apache.org
>>>> *Subject:* Re: How to best decide mapper output/reducer input for a
>>>> huge string?****
>>>>
>>>> ** **
>>>>
>>>> No, I don't have a combiner in place. Is it necessary? How do I make my
>>>> map output compressed? Yes, the Tables in HBase are compressed.****
>>>>
>>>> Although, there's no real bottleneck, the time it takes to process the
>>>> entire table is huge. I have to constantly check if i can optimize it
>>>> somehow.. ****
>>>>
>>>> Oh okay.. I'll implement a Custom Writable.. Apart from that, do you
>>>> see any thing wrong with my design? Does it require any kind of re-work?
>>>> Thank you so much for helping..****
>>>>
>>>> ** **
>>>>
>>>> On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <
>>>> pradeepg26@gmail.com> wrote:****
>>>>
>>>> One thing that comes to mind is that your keys are Strings which are
>>>> highly inefficient. You might get a lot better performance if you write a
>>>> custom writable for your Key object using the appropriate data types. For
>>>> example, use a long (LongWritable) for timestamps. This should make
>>>> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
>>>> comparisons for sorting will also go up.****
>>>>
>>>> ** **
>>>>
>>>> Ensure that your map output's are being compressed. Are your tables in
>>>> HBase compressed? Do you have a combiner?****
>>>>
>>>> ** **
>>>>
>>>> Have you been able to profile your code to see where the bottlenecks
>>>> are?****
>>>>
>>>> ** **
>>>>
>>>> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>
>>>> wrote:****
>>>>
>>>> Hi Pradeep,****
>>>>
>>>> Yes.. Basically i'm only writing the key part as the map output.. The V
>>>> of <K,V> is not of much use to me.. But i'm hoping to change that if it
>>>> leads to faster execution.. I'm kind of a newbie so looking to make the
>>>> map/reduce job run a lot faster.. ****
>>>>
>>>> Also, yes. It gets sorted by the HouseHoldID which is what i needed..
>>>> But seems if i write a map output for each and every row of a 19 m row
>>>> HBase table, its taking nearly a day to complete.. (21 mappers and 21
>>>> reducers)****
>>>>
>>>> ** **
>>>>
>>>> I have looked at both Pig/Hive to do the job but i'm supposed to do
>>>> this via a MR job.. So, cannot use either of that.. Do you recommend me to
>>>> try something if i have the data in that format?****
>>>>
>>>> ** **
>>>>
>>>> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <
>>>> pradeepg26@gmail.com> wrote:****
>>>>
>>>> I'm sorry but I don't understand your question. Is the output of the
>>>> mapper you're describing the key portion? If it is the key, then your data
>>>> should already be sorted by HouseHoldId since it occurs first in your key.
>>>> ****
>>>>
>>>> ** **
>>>>
>>>> The SortComparator will tell Hadoop how to sort your data. So you use
>>>> this if you have a need for a non lexical sort order. The
>>>> GroupingComparator will tell Hadoop how to group your data for the reducer.
>>>> All KV-pairs from the same group will be given to the same Reducer.****
>>>>
>>>> ** **
>>>>
>>>> If your reduce computation needs all the KV-pairs for the same
>>>> HouseHoldId, then you will need to write a GroupingComparator.****
>>>>
>>>> ** **
>>>>
>>>> Also, have you considered using a higher level abstraction on Hadoop
>>>> such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are
>>>> a LOT easier to write in these languages.****
>>>>
>>>> ** **
>>>>
>>>> Hope this helps!****
>>>>
>>>> - Pradeep****
>>>>
>>>> ** **
>>>>
>>>> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>
>>>> wrote:****
>>>>
>>>> I need to improve my MR jobs which uses HBase as source as well as
>>>> sink.. ** **
>>>>
>>>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>>>> them out as one huge string for the reducer to do some computation and dump
>>>> into a HBase Table.. ****
>>>>
>>>> Table1 ~ 19 million rows.****
>>>>
>>>> Table2 ~ 2 million rows.****
>>>>
>>>> Table3 ~ 900,000 rows.****
>>>>
>>>> The output of the mapper is something like this : ****
>>>>
>>>> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp****
>>>>
>>>> I'm interested in sorting it on the basis of the HouseHoldID value so
>>>> i'm using this technique. I'm not interested in the V part of pair so i'm
>>>> kind of ignoring it. My mapper class is defined as follows:****
>>>>
>>>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }****
>>>>
>>>> For my MR job to be completed, it takes 22 hours to complete which is
>>>> not desirable at all. I'm supposed to optimize this somehow to run a lot
>>>> faster somehow..****
>>>>
>>>> scan.setCaching(750);        ****
>>>>
>>>> scan.setCacheBlocks(false); ****
>>>>
>>>> TableMapReduceUtil.initTableMapperJob (****
>>>>
>>>>                                        Table1,           // input HBase table name****
>>>>
>>>>                                        scan,                   ****
>>>>
>>>>                                        AnalyzeMapper.class,    // mapper****
>>>>
>>>>                                        Text.class,             // mapper output key****
>>>>
>>>>                                        IntWritable.class,      // mapper output value****
>>>>
>>>>                                        job);****
>>>>
>>>> ** **
>>>>
>>>>                 TableMapReduceUtil.initTableReducerJob(****
>>>>
>>>>                                         OutputTable,                // output table****
>>>>
>>>>                                         AnalyzeReducerTable.class,  // reducer class****
>>>>
>>>>                                         job);****
>>>>
>>>>                 job.setNumReduceTasks(RegionCount);  ****
>>>>
>>>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are
>>>> running a 8 node cloudera cluster.****
>>>>
>>>> Should i use a custom SortComparator or a Group Comparator? ****
>>>>
>>>>
>>>> ****
>>>>
>>>>
>>>> --
>>>> Regards-****
>>>>
>>>> Pavan****
>>>>
>>>> ** **
>>>>
>>>>
>>>>
>>>> ****
>>>>
>>>> --
>>>> Regards-****
>>>>
>>>> Pavan****
>>>>
>>>> ** **
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards-****
>>>>
>>>> Pavan****
>>>>
>>>
>>>
>>
>
>
> --
> Regards-
> Pavan
>



-- 
Regards-
Pavan

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Pavan Sudheendra <pa...@gmail.com>.
@John, to be really frank i don't know what the limiting factor is.. It
might be all of them or a subset of them.. Cannot tell..


On Mon, Sep 23, 2013 at 2:58 PM, Pavan Sudheendra <pa...@gmail.com>wrote:

> @Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers
> are functional at the same time.. Although, @Pradeep, i should do the
> compression like you say.. I'll give it a shot.. As far as i can see, i
> think i'll need to implement Writable and write out the key of the mapper
> using the specific data types instead of writing it out as a string which
> might slow the operation down..
>
>
> On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota <pr...@gmail.com>wrote:
>
>> Pavan,
>>
>> It's hard to tell whether there's anything wrong with your design or not
>> since you haven't given us specific enough details. The best thing you can
>> do is instrument your code and see what is taking a long time. Rahul
>> mentioned a problem that I myself have seen before, with only one region
>> (or a couple) having any data. So even if you have 21 regions, only mapper
>> might be doing the heavy lifting.
>>
>> A combiner is hugely helpful in terms of reducing the data output of
>> mappers. Writing a combiner is a best practice and you should almost always
>> have one. Compression can be turned on by setting the following properties
>> in your job config.
>>  <property>
>>      <name> mapreduce.map.output.compress </name>
>>      <value> true</value>
>>  </property>
>>  <property>
>>      <name>mapreduce.map.output.compress.codec</name>
>>      <value>org.apache.hadoop.io.compress.GzipCodec</value>
>>  </property>
>> You can also try other compression codes such as Lzo, Snappy, Bzip2, etc.
>> depending on your use cases. Gzip is really slow but gets the best
>> compression ratios. Snappy/Lzo are a lot faster but don't have as good of a
>> compression ratio. If your computations are CPU bound, then you'd probably
>> want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs
>> are idle, you can use Gzip. You'll have to experiment and find the best
>> settings for you. There are a lot of other tweaks that you can try to get
>> the best performance out of your cluster.
>>
>> One of the best things you can do is to install Ganglia (or some other
>> similar tool) on your cluster and monitor usage of resources while your job
>> is running. This will tell you if your job is I/O bound or CPU bound.
>>
>> Take a look at this paper by Intel about optimizing your Hadoop cluster
>> and see if that fits your deployment.
>> http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf
>>
>> If your cluster is already optimized and your job is not I/O bound, then
>> there might be a problem with your algorithm and might warrant a redesign.
>>
>> Hope this helps!
>> - Pradeep
>>
>>
>> On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> One mapper is spawned per hbase table region. You can use the admin ui
>>> of hbase to find the number of regions per table. It might happen that all
>>> the data is sitting in a single region , so a single mapper is spawned and
>>> you are not getting enough parallel work getting done.
>>>
>>> If that is the case then you can recreate the tables with predefined
>>> splits to create more regions.
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <jo...@redpoint.net>wrote:
>>>
>>>>  Pavan,****
>>>>
>>>> How large are the rows in HBase?  22 million rows is not very much but
>>>> you mentioned “huge strings”.  Can you tell which part of the processing is
>>>> the limiting factor (read from HBase, mapper output, reducers)?****
>>>>
>>>> John****
>>>>
>>>> ** **
>>>>
>>>> ** **
>>>>
>>>> *From:* Pavan Sudheendra [mailto:pavan0591@gmail.com]
>>>> *Sent:* Saturday, September 21, 2013 2:17 AM
>>>> *To:* user@hadoop.apache.org
>>>> *Subject:* Re: How to best decide mapper output/reducer input for a
>>>> huge string?****
>>>>
>>>> ** **
>>>>
>>>> No, I don't have a combiner in place. Is it necessary? How do I make my
>>>> map output compressed? Yes, the Tables in HBase are compressed.****
>>>>
>>>> Although, there's no real bottleneck, the time it takes to process the
>>>> entire table is huge. I have to constantly check if i can optimize it
>>>> somehow.. ****
>>>>
>>>> Oh okay.. I'll implement a Custom Writable.. Apart from that, do you
>>>> see any thing wrong with my design? Does it require any kind of re-work?
>>>> Thank you so much for helping..****
>>>>
>>>> ** **
>>>>
>>>> On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <
>>>> pradeepg26@gmail.com> wrote:****
>>>>
>>>> One thing that comes to mind is that your keys are Strings which are
>>>> highly inefficient. You might get a lot better performance if you write a
>>>> custom writable for your Key object using the appropriate data types. For
>>>> example, use a long (LongWritable) for timestamps. This should make
>>>> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
>>>> comparisons for sorting will also go up.****
>>>>
>>>> ** **
>>>>
>>>> Ensure that your map output's are being compressed. Are your tables in
>>>> HBase compressed? Do you have a combiner?****
>>>>
>>>> ** **
>>>>
>>>> Have you been able to profile your code to see where the bottlenecks
>>>> are?****
>>>>
>>>> ** **
>>>>
>>>> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>
>>>> wrote:****
>>>>
>>>> Hi Pradeep,****
>>>>
>>>> Yes.. Basically i'm only writing the key part as the map output.. The V
>>>> of <K,V> is not of much use to me.. But i'm hoping to change that if it
>>>> leads to faster execution.. I'm kind of a newbie so looking to make the
>>>> map/reduce job run a lot faster.. ****
>>>>
>>>> Also, yes. It gets sorted by the HouseHoldID which is what i needed..
>>>> But seems if i write a map output for each and every row of a 19 m row
>>>> HBase table, its taking nearly a day to complete.. (21 mappers and 21
>>>> reducers)****
>>>>
>>>> ** **
>>>>
>>>> I have looked at both Pig/Hive to do the job but i'm supposed to do
>>>> this via a MR job.. So, cannot use either of that.. Do you recommend me to
>>>> try something if i have the data in that format?****
>>>>
>>>> ** **
>>>>
>>>> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <
>>>> pradeepg26@gmail.com> wrote:****
>>>>
>>>> I'm sorry but I don't understand your question. Is the output of the
>>>> mapper you're describing the key portion? If it is the key, then your data
>>>> should already be sorted by HouseHoldId since it occurs first in your key.
>>>> ****
>>>>
>>>> ** **
>>>>
>>>> The SortComparator will tell Hadoop how to sort your data. So you use
>>>> this if you have a need for a non lexical sort order. The
>>>> GroupingComparator will tell Hadoop how to group your data for the reducer.
>>>> All KV-pairs from the same group will be given to the same Reducer.****
>>>>
>>>> ** **
>>>>
>>>> If your reduce computation needs all the KV-pairs for the same
>>>> HouseHoldId, then you will need to write a GroupingComparator.****
>>>>
>>>> ** **
>>>>
>>>> Also, have you considered using a higher level abstraction on Hadoop
>>>> such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are
>>>> a LOT easier to write in these languages.****
>>>>
>>>> ** **
>>>>
>>>> Hope this helps!****
>>>>
>>>> - Pradeep****
>>>>
>>>> ** **
>>>>
>>>> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>
>>>> wrote:****
>>>>
>>>> I need to improve my MR jobs which uses HBase as source as well as
>>>> sink.. ** **
>>>>
>>>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>>>> them out as one huge string for the reducer to do some computation and dump
>>>> into a HBase Table.. ****
>>>>
>>>> Table1 ~ 19 million rows.****
>>>>
>>>> Table2 ~ 2 million rows.****
>>>>
>>>> Table3 ~ 900,000 rows.****
>>>>
>>>> The output of the mapper is something like this : ****
>>>>
>>>> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp****
>>>>
>>>> I'm interested in sorting it on the basis of the HouseHoldID value so
>>>> i'm using this technique. I'm not interested in the V part of pair so i'm
>>>> kind of ignoring it. My mapper class is defined as follows:****
>>>>
>>>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }****
>>>>
>>>> For my MR job to be completed, it takes 22 hours to complete which is
>>>> not desirable at all. I'm supposed to optimize this somehow to run a lot
>>>> faster somehow..****
>>>>
>>>> scan.setCaching(750);        ****
>>>>
>>>> scan.setCacheBlocks(false); ****
>>>>
>>>> TableMapReduceUtil.initTableMapperJob (****
>>>>
>>>>                                        Table1,           // input HBase table name****
>>>>
>>>>                                        scan,                   ****
>>>>
>>>>                                        AnalyzeMapper.class,    // mapper****
>>>>
>>>>                                        Text.class,             // mapper output key****
>>>>
>>>>                                        IntWritable.class,      // mapper output value****
>>>>
>>>>                                        job);****
>>>>
>>>> ** **
>>>>
>>>>                 TableMapReduceUtil.initTableReducerJob(****
>>>>
>>>>                                         OutputTable,                // output table****
>>>>
>>>>                                         AnalyzeReducerTable.class,  // reducer class****
>>>>
>>>>                                         job);****
>>>>
>>>>                 job.setNumReduceTasks(RegionCount);  ****
>>>>
>>>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are
>>>> running a 8 node cloudera cluster.****
>>>>
>>>> Should i use a custom SortComparator or a Group Comparator? ****
>>>>
>>>>
>>>> ****
>>>>
>>>>
>>>> --
>>>> Regards-****
>>>>
>>>> Pavan****
>>>>
>>>> ** **
>>>>
>>>>
>>>>
>>>> ****
>>>>
>>>> --
>>>> Regards-****
>>>>
>>>> Pavan****
>>>>
>>>> ** **
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards-****
>>>>
>>>> Pavan****
>>>>
>>>
>>>
>>
>
>
> --
> Regards-
> Pavan
>



-- 
Regards-
Pavan

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Pavan Sudheendra <pa...@gmail.com>.
@John, to be really frank i don't know what the limiting factor is.. It
might be all of them or a subset of them.. Cannot tell..


On Mon, Sep 23, 2013 at 2:58 PM, Pavan Sudheendra <pa...@gmail.com>wrote:

> @Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers
> are functional at the same time.. Although, @Pradeep, i should do the
> compression like you say.. I'll give it a shot.. As far as i can see, i
> think i'll need to implement Writable and write out the key of the mapper
> using the specific data types instead of writing it out as a string which
> might slow the operation down..
>
>
> On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota <pr...@gmail.com>wrote:
>
>> Pavan,
>>
>> It's hard to tell whether there's anything wrong with your design or not
>> since you haven't given us specific enough details. The best thing you can
>> do is instrument your code and see what is taking a long time. Rahul
>> mentioned a problem that I myself have seen before, with only one region
>> (or a couple) having any data. So even if you have 21 regions, only mapper
>> might be doing the heavy lifting.
>>
>> A combiner is hugely helpful in terms of reducing the data output of
>> mappers. Writing a combiner is a best practice and you should almost always
>> have one. Compression can be turned on by setting the following properties
>> in your job config.
>>  <property>
>>      <name> mapreduce.map.output.compress </name>
>>      <value> true</value>
>>  </property>
>>  <property>
>>      <name>mapreduce.map.output.compress.codec</name>
>>      <value>org.apache.hadoop.io.compress.GzipCodec</value>
>>  </property>
>> You can also try other compression codes such as Lzo, Snappy, Bzip2, etc.
>> depending on your use cases. Gzip is really slow but gets the best
>> compression ratios. Snappy/Lzo are a lot faster but don't have as good of a
>> compression ratio. If your computations are CPU bound, then you'd probably
>> want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs
>> are idle, you can use Gzip. You'll have to experiment and find the best
>> settings for you. There are a lot of other tweaks that you can try to get
>> the best performance out of your cluster.
>>
>> One of the best things you can do is to install Ganglia (or some other
>> similar tool) on your cluster and monitor usage of resources while your job
>> is running. This will tell you if your job is I/O bound or CPU bound.
>>
>> Take a look at this paper by Intel about optimizing your Hadoop cluster
>> and see if that fits your deployment.
>> http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf
>>
>> If your cluster is already optimized and your job is not I/O bound, then
>> there might be a problem with your algorithm and might warrant a redesign.
>>
>> Hope this helps!
>> - Pradeep
>>
>>
>> On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> One mapper is spawned per hbase table region. You can use the admin ui
>>> of hbase to find the number of regions per table. It might happen that all
>>> the data is sitting in a single region , so a single mapper is spawned and
>>> you are not getting enough parallel work getting done.
>>>
>>> If that is the case then you can recreate the tables with predefined
>>> splits to create more regions.
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <jo...@redpoint.net>wrote:
>>>
>>>>  Pavan,****
>>>>
>>>> How large are the rows in HBase?  22 million rows is not very much but
>>>> you mentioned “huge strings”.  Can you tell which part of the processing is
>>>> the limiting factor (read from HBase, mapper output, reducers)?****
>>>>
>>>> John****
>>>>
>>>> ** **
>>>>
>>>> ** **
>>>>
>>>> *From:* Pavan Sudheendra [mailto:pavan0591@gmail.com]
>>>> *Sent:* Saturday, September 21, 2013 2:17 AM
>>>> *To:* user@hadoop.apache.org
>>>> *Subject:* Re: How to best decide mapper output/reducer input for a
>>>> huge string?****
>>>>
>>>> ** **
>>>>
>>>> No, I don't have a combiner in place. Is it necessary? How do I make my
>>>> map output compressed? Yes, the Tables in HBase are compressed.****
>>>>
>>>> Although, there's no real bottleneck, the time it takes to process the
>>>> entire table is huge. I have to constantly check if i can optimize it
>>>> somehow.. ****
>>>>
>>>> Oh okay.. I'll implement a Custom Writable.. Apart from that, do you
>>>> see any thing wrong with my design? Does it require any kind of re-work?
>>>> Thank you so much for helping..****
>>>>
>>>> ** **
>>>>
>>>> On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <
>>>> pradeepg26@gmail.com> wrote:****
>>>>
>>>> One thing that comes to mind is that your keys are Strings which are
>>>> highly inefficient. You might get a lot better performance if you write a
>>>> custom writable for your Key object using the appropriate data types. For
>>>> example, use a long (LongWritable) for timestamps. This should make
>>>> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
>>>> comparisons for sorting will also go up.****
>>>>
>>>> ** **
>>>>
>>>> Ensure that your map output's are being compressed. Are your tables in
>>>> HBase compressed? Do you have a combiner?****
>>>>
>>>> ** **
>>>>
>>>> Have you been able to profile your code to see where the bottlenecks
>>>> are?****
>>>>
>>>> ** **
>>>>
>>>> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>
>>>> wrote:****
>>>>
>>>> Hi Pradeep,****
>>>>
>>>> Yes.. Basically i'm only writing the key part as the map output.. The V
>>>> of <K,V> is not of much use to me.. But i'm hoping to change that if it
>>>> leads to faster execution.. I'm kind of a newbie so looking to make the
>>>> map/reduce job run a lot faster.. ****
>>>>
>>>> Also, yes. It gets sorted by the HouseHoldID which is what i needed..
>>>> But seems if i write a map output for each and every row of a 19 m row
>>>> HBase table, its taking nearly a day to complete.. (21 mappers and 21
>>>> reducers)****
>>>>
>>>> ** **
>>>>
>>>> I have looked at both Pig/Hive to do the job but i'm supposed to do
>>>> this via a MR job.. So, cannot use either of that.. Do you recommend me to
>>>> try something if i have the data in that format?****
>>>>
>>>> ** **
>>>>
>>>> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <
>>>> pradeepg26@gmail.com> wrote:****
>>>>
>>>> I'm sorry but I don't understand your question. Is the output of the
>>>> mapper you're describing the key portion? If it is the key, then your data
>>>> should already be sorted by HouseHoldId since it occurs first in your key.
>>>> ****
>>>>
>>>> ** **
>>>>
>>>> The SortComparator will tell Hadoop how to sort your data. So you use
>>>> this if you have a need for a non lexical sort order. The
>>>> GroupingComparator will tell Hadoop how to group your data for the reducer.
>>>> All KV-pairs from the same group will be given to the same Reducer.****
>>>>
>>>> ** **
>>>>
>>>> If your reduce computation needs all the KV-pairs for the same
>>>> HouseHoldId, then you will need to write a GroupingComparator.****
>>>>
>>>> ** **
>>>>
>>>> Also, have you considered using a higher level abstraction on Hadoop
>>>> such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are
>>>> a LOT easier to write in these languages.****
>>>>
>>>> ** **
>>>>
>>>> Hope this helps!****
>>>>
>>>> - Pradeep****
>>>>
>>>> ** **
>>>>
>>>> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>
>>>> wrote:****
>>>>
>>>> I need to improve my MR jobs which uses HBase as source as well as
>>>> sink.. ** **
>>>>
>>>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>>>> them out as one huge string for the reducer to do some computation and dump
>>>> into a HBase Table.. ****
>>>>
>>>> Table1 ~ 19 million rows.****
>>>>
>>>> Table2 ~ 2 million rows.****
>>>>
>>>> Table3 ~ 900,000 rows.****
>>>>
>>>> The output of the mapper is something like this : ****
>>>>
>>>> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp****
>>>>
>>>> I'm interested in sorting it on the basis of the HouseHoldID value so
>>>> i'm using this technique. I'm not interested in the V part of pair so i'm
>>>> kind of ignoring it. My mapper class is defined as follows:****
>>>>
>>>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }****
>>>>
>>>> For my MR job to be completed, it takes 22 hours to complete which is
>>>> not desirable at all. I'm supposed to optimize this somehow to run a lot
>>>> faster somehow..****
>>>>
>>>> scan.setCaching(750);        ****
>>>>
>>>> scan.setCacheBlocks(false); ****
>>>>
>>>> TableMapReduceUtil.initTableMapperJob (****
>>>>
>>>>                                        Table1,           // input HBase table name****
>>>>
>>>>                                        scan,                   ****
>>>>
>>>>                                        AnalyzeMapper.class,    // mapper****
>>>>
>>>>                                        Text.class,             // mapper output key****
>>>>
>>>>                                        IntWritable.class,      // mapper output value****
>>>>
>>>>                                        job);****
>>>>
>>>> ** **
>>>>
>>>>                 TableMapReduceUtil.initTableReducerJob(****
>>>>
>>>>                                         OutputTable,                // output table****
>>>>
>>>>                                         AnalyzeReducerTable.class,  // reducer class****
>>>>
>>>>                                         job);****
>>>>
>>>>                 job.setNumReduceTasks(RegionCount);  ****
>>>>
>>>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are
>>>> running a 8 node cloudera cluster.****
>>>>
>>>> Should i use a custom SortComparator or a Group Comparator? ****
>>>>
>>>>
>>>> ****
>>>>
>>>>
>>>> --
>>>> Regards-****
>>>>
>>>> Pavan****
>>>>
>>>> ** **
>>>>
>>>>
>>>>
>>>> ****
>>>>
>>>> --
>>>> Regards-****
>>>>
>>>> Pavan****
>>>>
>>>> ** **
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards-****
>>>>
>>>> Pavan****
>>>>
>>>
>>>
>>
>
>
> --
> Regards-
> Pavan
>



-- 
Regards-
Pavan

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Pavan Sudheendra <pa...@gmail.com>.
@Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers
are functional at the same time.. Although, @Pradeep, i should do the
compression like you say.. I'll give it a shot.. As far as i can see, i
think i'll need to implement Writable and write out the key of the mapper
using the specific data types instead of writing it out as a string which
might slow the operation down..


On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota <pr...@gmail.com>wrote:

> Pavan,
>
> It's hard to tell whether there's anything wrong with your design or not
> since you haven't given us specific enough details. The best thing you can
> do is instrument your code and see what is taking a long time. Rahul
> mentioned a problem that I myself have seen before, with only one region
> (or a couple) having any data. So even if you have 21 regions, only mapper
> might be doing the heavy lifting.
>
> A combiner is hugely helpful in terms of reducing the data output of
> mappers. Writing a combiner is a best practice and you should almost always
> have one. Compression can be turned on by setting the following properties
> in your job config.
>  <property>
>      <name> mapreduce.map.output.compress </name>
>      <value> true</value>
>  </property>
>  <property>
>      <name>mapreduce.map.output.compress.codec</name>
>      <value>org.apache.hadoop.io.compress.GzipCodec</value>
>  </property>
> You can also try other compression codes such as Lzo, Snappy, Bzip2, etc.
> depending on your use cases. Gzip is really slow but gets the best
> compression ratios. Snappy/Lzo are a lot faster but don't have as good of a
> compression ratio. If your computations are CPU bound, then you'd probably
> want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs
> are idle, you can use Gzip. You'll have to experiment and find the best
> settings for you. There are a lot of other tweaks that you can try to get
> the best performance out of your cluster.
>
> One of the best things you can do is to install Ganglia (or some other
> similar tool) on your cluster and monitor usage of resources while your job
> is running. This will tell you if your job is I/O bound or CPU bound.
>
> Take a look at this paper by Intel about optimizing your Hadoop cluster
> and see if that fits your deployment.
> http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf
>
> If your cluster is already optimized and your job is not I/O bound, then
> there might be a problem with your algorithm and might warrant a redesign.
>
> Hope this helps!
> - Pradeep
>
>
> On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> One mapper is spawned per hbase table region. You can use the admin ui of
>> hbase to find the number of regions per table. It might happen that all the
>> data is sitting in a single region , so a single mapper is spawned and you
>> are not getting enough parallel work getting done.
>>
>> If that is the case then you can recreate the tables with predefined
>> splits to create more regions.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <jo...@redpoint.net>wrote:
>>
>>>  Pavan,****
>>>
>>> How large are the rows in HBase?  22 million rows is not very much but
>>> you mentioned “huge strings”.  Can you tell which part of the processing is
>>> the limiting factor (read from HBase, mapper output, reducers)?****
>>>
>>> John****
>>>
>>> ** **
>>>
>>> ** **
>>>
>>> *From:* Pavan Sudheendra [mailto:pavan0591@gmail.com]
>>> *Sent:* Saturday, September 21, 2013 2:17 AM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: How to best decide mapper output/reducer input for a
>>> huge string?****
>>>
>>> ** **
>>>
>>> No, I don't have a combiner in place. Is it necessary? How do I make my
>>> map output compressed? Yes, the Tables in HBase are compressed.****
>>>
>>> Although, there's no real bottleneck, the time it takes to process the
>>> entire table is huge. I have to constantly check if i can optimize it
>>> somehow.. ****
>>>
>>> Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
>>> any thing wrong with my design? Does it require any kind of re-work? Thank
>>> you so much for helping..****
>>>
>>> ** **
>>>
>>> On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pr...@gmail.com>
>>> wrote:****
>>>
>>> One thing that comes to mind is that your keys are Strings which are
>>> highly inefficient. You might get a lot better performance if you write a
>>> custom writable for your Key object using the appropriate data types. For
>>> example, use a long (LongWritable) for timestamps. This should make
>>> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
>>> comparisons for sorting will also go up.****
>>>
>>> ** **
>>>
>>> Ensure that your map output's are being compressed. Are your tables in
>>> HBase compressed? Do you have a combiner?****
>>>
>>> ** **
>>>
>>> Have you been able to profile your code to see where the bottlenecks are?
>>> ****
>>>
>>> ** **
>>>
>>> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>
>>> wrote:****
>>>
>>> Hi Pradeep,****
>>>
>>> Yes.. Basically i'm only writing the key part as the map output.. The V
>>> of <K,V> is not of much use to me.. But i'm hoping to change that if it
>>> leads to faster execution.. I'm kind of a newbie so looking to make the
>>> map/reduce job run a lot faster.. ****
>>>
>>> Also, yes. It gets sorted by the HouseHoldID which is what i needed..
>>> But seems if i write a map output for each and every row of a 19 m row
>>> HBase table, its taking nearly a day to complete.. (21 mappers and 21
>>> reducers)****
>>>
>>> ** **
>>>
>>> I have looked at both Pig/Hive to do the job but i'm supposed to do this
>>> via a MR job.. So, cannot use either of that.. Do you recommend me to try
>>> something if i have the data in that format?****
>>>
>>> ** **
>>>
>>> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <
>>> pradeepg26@gmail.com> wrote:****
>>>
>>> I'm sorry but I don't understand your question. Is the output of the
>>> mapper you're describing the key portion? If it is the key, then your data
>>> should already be sorted by HouseHoldId since it occurs first in your key.
>>> ****
>>>
>>> ** **
>>>
>>> The SortComparator will tell Hadoop how to sort your data. So you use
>>> this if you have a need for a non lexical sort order. The
>>> GroupingComparator will tell Hadoop how to group your data for the reducer.
>>> All KV-pairs from the same group will be given to the same Reducer.****
>>>
>>> ** **
>>>
>>> If your reduce computation needs all the KV-pairs for the same
>>> HouseHoldId, then you will need to write a GroupingComparator.****
>>>
>>> ** **
>>>
>>> Also, have you considered using a higher level abstraction on Hadoop
>>> such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are
>>> a LOT easier to write in these languages.****
>>>
>>> ** **
>>>
>>> Hope this helps!****
>>>
>>> - Pradeep****
>>>
>>> ** **
>>>
>>> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>
>>> wrote:****
>>>
>>> I need to improve my MR jobs which uses HBase as source as well as
>>> sink.. ** **
>>>
>>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>>> them out as one huge string for the reducer to do some computation and dump
>>> into a HBase Table.. ****
>>>
>>> Table1 ~ 19 million rows.****
>>>
>>> Table2 ~ 2 million rows.****
>>>
>>> Table3 ~ 900,000 rows.****
>>>
>>> The output of the mapper is something like this : ****
>>>
>>> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp****
>>>
>>> I'm interested in sorting it on the basis of the HouseHoldID value so
>>> i'm using this technique. I'm not interested in the V part of pair so i'm
>>> kind of ignoring it. My mapper class is defined as follows:****
>>>
>>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }****
>>>
>>> For my MR job to be completed, it takes 22 hours to complete which is
>>> not desirable at all. I'm supposed to optimize this somehow to run a lot
>>> faster somehow..****
>>>
>>> scan.setCaching(750);        ****
>>>
>>> scan.setCacheBlocks(false); ****
>>>
>>> TableMapReduceUtil.initTableMapperJob (****
>>>
>>>                                        Table1,           // input HBase table name****
>>>
>>>                                        scan,                   ****
>>>
>>>                                        AnalyzeMapper.class,    // mapper****
>>>
>>>                                        Text.class,             // mapper output key****
>>>
>>>                                        IntWritable.class,      // mapper output value****
>>>
>>>                                        job);****
>>>
>>> ** **
>>>
>>>                 TableMapReduceUtil.initTableReducerJob(****
>>>
>>>                                         OutputTable,                // output table****
>>>
>>>                                         AnalyzeReducerTable.class,  // reducer class****
>>>
>>>                                         job);****
>>>
>>>                 job.setNumReduceTasks(RegionCount);  ****
>>>
>>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running
>>> a 8 node cloudera cluster.****
>>>
>>> Should i use a custom SortComparator or a Group Comparator? ****
>>>
>>>
>>> ****
>>>
>>>
>>> --
>>> Regards-****
>>>
>>> Pavan****
>>>
>>> ** **
>>>
>>>
>>>
>>> ****
>>>
>>> --
>>> Regards-****
>>>
>>> Pavan****
>>>
>>> ** **
>>>
>>>
>>>
>>>
>>> --
>>> Regards-****
>>>
>>> Pavan****
>>>
>>
>>
>


-- 
Regards-
Pavan

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Pavan Sudheendra <pa...@gmail.com>.
@Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers
are functional at the same time.. Although, @Pradeep, i should do the
compression like you say.. I'll give it a shot.. As far as i can see, i
think i'll need to implement Writable and write out the key of the mapper
using the specific data types instead of writing it out as a string which
might slow the operation down..


On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota <pr...@gmail.com>wrote:

> Pavan,
>
> It's hard to tell whether there's anything wrong with your design or not
> since you haven't given us specific enough details. The best thing you can
> do is instrument your code and see what is taking a long time. Rahul
> mentioned a problem that I myself have seen before, with only one region
> (or a couple) having any data. So even if you have 21 regions, only mapper
> might be doing the heavy lifting.
>
> A combiner is hugely helpful in terms of reducing the data output of
> mappers. Writing a combiner is a best practice and you should almost always
> have one. Compression can be turned on by setting the following properties
> in your job config.
>  <property>
>      <name> mapreduce.map.output.compress </name>
>      <value> true</value>
>  </property>
>  <property>
>      <name>mapreduce.map.output.compress.codec</name>
>      <value>org.apache.hadoop.io.compress.GzipCodec</value>
>  </property>
> You can also try other compression codes such as Lzo, Snappy, Bzip2, etc.
> depending on your use cases. Gzip is really slow but gets the best
> compression ratios. Snappy/Lzo are a lot faster but don't have as good of a
> compression ratio. If your computations are CPU bound, then you'd probably
> want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs
> are idle, you can use Gzip. You'll have to experiment and find the best
> settings for you. There are a lot of other tweaks that you can try to get
> the best performance out of your cluster.
>
> One of the best things you can do is to install Ganglia (or some other
> similar tool) on your cluster and monitor usage of resources while your job
> is running. This will tell you if your job is I/O bound or CPU bound.
>
> Take a look at this paper by Intel about optimizing your Hadoop cluster
> and see if that fits your deployment.
> http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf
>
> If your cluster is already optimized and your job is not I/O bound, then
> there might be a problem with your algorithm and might warrant a redesign.
>
> Hope this helps!
> - Pradeep
>
>
> On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> One mapper is spawned per hbase table region. You can use the admin ui of
>> hbase to find the number of regions per table. It might happen that all the
>> data is sitting in a single region , so a single mapper is spawned and you
>> are not getting enough parallel work getting done.
>>
>> If that is the case then you can recreate the tables with predefined
>> splits to create more regions.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <jo...@redpoint.net>wrote:
>>
>>>  Pavan,****
>>>
>>> How large are the rows in HBase?  22 million rows is not very much but
>>> you mentioned “huge strings”.  Can you tell which part of the processing is
>>> the limiting factor (read from HBase, mapper output, reducers)?****
>>>
>>> John****
>>>
>>> ** **
>>>
>>> ** **
>>>
>>> *From:* Pavan Sudheendra [mailto:pavan0591@gmail.com]
>>> *Sent:* Saturday, September 21, 2013 2:17 AM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: How to best decide mapper output/reducer input for a
>>> huge string?****
>>>
>>> ** **
>>>
>>> No, I don't have a combiner in place. Is it necessary? How do I make my
>>> map output compressed? Yes, the Tables in HBase are compressed.****
>>>
>>> Although, there's no real bottleneck, the time it takes to process the
>>> entire table is huge. I have to constantly check if i can optimize it
>>> somehow.. ****
>>>
>>> Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
>>> any thing wrong with my design? Does it require any kind of re-work? Thank
>>> you so much for helping..****
>>>
>>> ** **
>>>
>>> On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pr...@gmail.com>
>>> wrote:****
>>>
>>> One thing that comes to mind is that your keys are Strings which are
>>> highly inefficient. You might get a lot better performance if you write a
>>> custom writable for your Key object using the appropriate data types. For
>>> example, use a long (LongWritable) for timestamps. This should make
>>> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
>>> comparisons for sorting will also go up.****
>>>
>>> ** **
>>>
>>> Ensure that your map output's are being compressed. Are your tables in
>>> HBase compressed? Do you have a combiner?****
>>>
>>> ** **
>>>
>>> Have you been able to profile your code to see where the bottlenecks are?
>>> ****
>>>
>>> ** **
>>>
>>> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>
>>> wrote:****
>>>
>>> Hi Pradeep,****
>>>
>>> Yes.. Basically i'm only writing the key part as the map output.. The V
>>> of <K,V> is not of much use to me.. But i'm hoping to change that if it
>>> leads to faster execution.. I'm kind of a newbie so looking to make the
>>> map/reduce job run a lot faster.. ****
>>>
>>> Also, yes. It gets sorted by the HouseHoldID which is what i needed..
>>> But seems if i write a map output for each and every row of a 19 m row
>>> HBase table, its taking nearly a day to complete.. (21 mappers and 21
>>> reducers)****
>>>
>>> ** **
>>>
>>> I have looked at both Pig/Hive to do the job but i'm supposed to do this
>>> via a MR job.. So, cannot use either of that.. Do you recommend me to try
>>> something if i have the data in that format?****
>>>
>>> ** **
>>>
>>> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <
>>> pradeepg26@gmail.com> wrote:****
>>>
>>> I'm sorry but I don't understand your question. Is the output of the
>>> mapper you're describing the key portion? If it is the key, then your data
>>> should already be sorted by HouseHoldId since it occurs first in your key.
>>> ****
>>>
>>> ** **
>>>
>>> The SortComparator will tell Hadoop how to sort your data. So you use
>>> this if you have a need for a non lexical sort order. The
>>> GroupingComparator will tell Hadoop how to group your data for the reducer.
>>> All KV-pairs from the same group will be given to the same Reducer.****
>>>
>>> ** **
>>>
>>> If your reduce computation needs all the KV-pairs for the same
>>> HouseHoldId, then you will need to write a GroupingComparator.****
>>>
>>> ** **
>>>
>>> Also, have you considered using a higher level abstraction on Hadoop
>>> such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are
>>> a LOT easier to write in these languages.****
>>>
>>> ** **
>>>
>>> Hope this helps!****
>>>
>>> - Pradeep****
>>>
>>> ** **
>>>
>>> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>
>>> wrote:****
>>>
>>> I need to improve my MR jobs which uses HBase as source as well as
>>> sink.. ** **
>>>
>>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>>> them out as one huge string for the reducer to do some computation and dump
>>> into a HBase Table.. ****
>>>
>>> Table1 ~ 19 million rows.****
>>>
>>> Table2 ~ 2 million rows.****
>>>
>>> Table3 ~ 900,000 rows.****
>>>
>>> The output of the mapper is something like this : ****
>>>
>>> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp****
>>>
>>> I'm interested in sorting it on the basis of the HouseHoldID value so
>>> i'm using this technique. I'm not interested in the V part of pair so i'm
>>> kind of ignoring it. My mapper class is defined as follows:****
>>>
>>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }****
>>>
>>> For my MR job to be completed, it takes 22 hours to complete which is
>>> not desirable at all. I'm supposed to optimize this somehow to run a lot
>>> faster somehow..****
>>>
>>> scan.setCaching(750);        ****
>>>
>>> scan.setCacheBlocks(false); ****
>>>
>>> TableMapReduceUtil.initTableMapperJob (****
>>>
>>>                                        Table1,           // input HBase table name****
>>>
>>>                                        scan,                   ****
>>>
>>>                                        AnalyzeMapper.class,    // mapper****
>>>
>>>                                        Text.class,             // mapper output key****
>>>
>>>                                        IntWritable.class,      // mapper output value****
>>>
>>>                                        job);****
>>>
>>> ** **
>>>
>>>                 TableMapReduceUtil.initTableReducerJob(****
>>>
>>>                                         OutputTable,                // output table****
>>>
>>>                                         AnalyzeReducerTable.class,  // reducer class****
>>>
>>>                                         job);****
>>>
>>>                 job.setNumReduceTasks(RegionCount);  ****
>>>
>>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running
>>> a 8 node cloudera cluster.****
>>>
>>> Should i use a custom SortComparator or a Group Comparator? ****
>>>
>>>
>>> ****
>>>
>>>
>>> --
>>> Regards-****
>>>
>>> Pavan****
>>>
>>> ** **
>>>
>>>
>>>
>>> ****
>>>
>>> --
>>> Regards-****
>>>
>>> Pavan****
>>>
>>> ** **
>>>
>>>
>>>
>>>
>>> --
>>> Regards-****
>>>
>>> Pavan****
>>>
>>
>>
>


-- 
Regards-
Pavan

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Pavan Sudheendra <pa...@gmail.com>.
@Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers
are functional at the same time.. Although, @Pradeep, i should do the
compression like you say.. I'll give it a shot.. As far as i can see, i
think i'll need to implement Writable and write out the key of the mapper
using the specific data types instead of writing it out as a string which
might slow the operation down..


On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota <pr...@gmail.com>wrote:

> Pavan,
>
> It's hard to tell whether there's anything wrong with your design or not
> since you haven't given us specific enough details. The best thing you can
> do is instrument your code and see what is taking a long time. Rahul
> mentioned a problem that I myself have seen before, with only one region
> (or a couple) having any data. So even if you have 21 regions, only mapper
> might be doing the heavy lifting.
>
> A combiner is hugely helpful in terms of reducing the data output of
> mappers. Writing a combiner is a best practice and you should almost always
> have one. Compression can be turned on by setting the following properties
> in your job config.
>  <property>
>      <name> mapreduce.map.output.compress </name>
>      <value> true</value>
>  </property>
>  <property>
>      <name>mapreduce.map.output.compress.codec</name>
>      <value>org.apache.hadoop.io.compress.GzipCodec</value>
>  </property>
> You can also try other compression codes such as Lzo, Snappy, Bzip2, etc.
> depending on your use cases. Gzip is really slow but gets the best
> compression ratios. Snappy/Lzo are a lot faster but don't have as good of a
> compression ratio. If your computations are CPU bound, then you'd probably
> want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs
> are idle, you can use Gzip. You'll have to experiment and find the best
> settings for you. There are a lot of other tweaks that you can try to get
> the best performance out of your cluster.
>
> One of the best things you can do is to install Ganglia (or some other
> similar tool) on your cluster and monitor usage of resources while your job
> is running. This will tell you if your job is I/O bound or CPU bound.
>
> Take a look at this paper by Intel about optimizing your Hadoop cluster
> and see if that fits your deployment.
> http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf
>
> If your cluster is already optimized and your job is not I/O bound, then
> there might be a problem with your algorithm and might warrant a redesign.
>
> Hope this helps!
> - Pradeep
>
>
> On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> One mapper is spawned per hbase table region. You can use the admin ui of
>> hbase to find the number of regions per table. It might happen that all the
>> data is sitting in a single region , so a single mapper is spawned and you
>> are not getting enough parallel work getting done.
>>
>> If that is the case then you can recreate the tables with predefined
>> splits to create more regions.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <jo...@redpoint.net>wrote:
>>
>>>  Pavan,****
>>>
>>> How large are the rows in HBase?  22 million rows is not very much but
>>> you mentioned “huge strings”.  Can you tell which part of the processing is
>>> the limiting factor (read from HBase, mapper output, reducers)?****
>>>
>>> John****
>>>
>>> ** **
>>>
>>> ** **
>>>
>>> *From:* Pavan Sudheendra [mailto:pavan0591@gmail.com]
>>> *Sent:* Saturday, September 21, 2013 2:17 AM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: How to best decide mapper output/reducer input for a
>>> huge string?****
>>>
>>> ** **
>>>
>>> No, I don't have a combiner in place. Is it necessary? How do I make my
>>> map output compressed? Yes, the Tables in HBase are compressed.****
>>>
>>> Although, there's no real bottleneck, the time it takes to process the
>>> entire table is huge. I have to constantly check if i can optimize it
>>> somehow.. ****
>>>
>>> Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
>>> any thing wrong with my design? Does it require any kind of re-work? Thank
>>> you so much for helping..****
>>>
>>> ** **
>>>
>>> On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pr...@gmail.com>
>>> wrote:****
>>>
>>> One thing that comes to mind is that your keys are Strings which are
>>> highly inefficient. You might get a lot better performance if you write a
>>> custom writable for your Key object using the appropriate data types. For
>>> example, use a long (LongWritable) for timestamps. This should make
>>> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
>>> comparisons for sorting will also go up.****
>>>
>>> ** **
>>>
>>> Ensure that your map output's are being compressed. Are your tables in
>>> HBase compressed? Do you have a combiner?****
>>>
>>> ** **
>>>
>>> Have you been able to profile your code to see where the bottlenecks are?
>>> ****
>>>
>>> ** **
>>>
>>> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>
>>> wrote:****
>>>
>>> Hi Pradeep,****
>>>
>>> Yes.. Basically i'm only writing the key part as the map output.. The V
>>> of <K,V> is not of much use to me.. But i'm hoping to change that if it
>>> leads to faster execution.. I'm kind of a newbie so looking to make the
>>> map/reduce job run a lot faster.. ****
>>>
>>> Also, yes. It gets sorted by the HouseHoldID which is what i needed..
>>> But seems if i write a map output for each and every row of a 19 m row
>>> HBase table, its taking nearly a day to complete.. (21 mappers and 21
>>> reducers)****
>>>
>>> ** **
>>>
>>> I have looked at both Pig/Hive to do the job but i'm supposed to do this
>>> via a MR job.. So, cannot use either of that.. Do you recommend me to try
>>> something if i have the data in that format?****
>>>
>>> ** **
>>>
>>> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <
>>> pradeepg26@gmail.com> wrote:****
>>>
>>> I'm sorry but I don't understand your question. Is the output of the
>>> mapper you're describing the key portion? If it is the key, then your data
>>> should already be sorted by HouseHoldId since it occurs first in your key.
>>> ****
>>>
>>> ** **
>>>
>>> The SortComparator will tell Hadoop how to sort your data. So you use
>>> this if you have a need for a non lexical sort order. The
>>> GroupingComparator will tell Hadoop how to group your data for the reducer.
>>> All KV-pairs from the same group will be given to the same Reducer.****
>>>
>>> ** **
>>>
>>> If your reduce computation needs all the KV-pairs for the same
>>> HouseHoldId, then you will need to write a GroupingComparator.****
>>>
>>> ** **
>>>
>>> Also, have you considered using a higher level abstraction on Hadoop
>>> such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are
>>> a LOT easier to write in these languages.****
>>>
>>> ** **
>>>
>>> Hope this helps!****
>>>
>>> - Pradeep****
>>>
>>> ** **
>>>
>>> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>
>>> wrote:****
>>>
>>> I need to improve my MR jobs which uses HBase as source as well as
>>> sink.. ** **
>>>
>>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>>> them out as one huge string for the reducer to do some computation and dump
>>> into a HBase Table.. ****
>>>
>>> Table1 ~ 19 million rows.****
>>>
>>> Table2 ~ 2 million rows.****
>>>
>>> Table3 ~ 900,000 rows.****
>>>
>>> The output of the mapper is something like this : ****
>>>
>>> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp****
>>>
>>> I'm interested in sorting it on the basis of the HouseHoldID value so
>>> i'm using this technique. I'm not interested in the V part of pair so i'm
>>> kind of ignoring it. My mapper class is defined as follows:****
>>>
>>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }****
>>>
>>> For my MR job to be completed, it takes 22 hours to complete which is
>>> not desirable at all. I'm supposed to optimize this somehow to run a lot
>>> faster somehow..****
>>>
>>> scan.setCaching(750);        ****
>>>
>>> scan.setCacheBlocks(false); ****
>>>
>>> TableMapReduceUtil.initTableMapperJob (****
>>>
>>>                                        Table1,           // input HBase table name****
>>>
>>>                                        scan,                   ****
>>>
>>>                                        AnalyzeMapper.class,    // mapper****
>>>
>>>                                        Text.class,             // mapper output key****
>>>
>>>                                        IntWritable.class,      // mapper output value****
>>>
>>>                                        job);****
>>>
>>> ** **
>>>
>>>                 TableMapReduceUtil.initTableReducerJob(****
>>>
>>>                                         OutputTable,                // output table****
>>>
>>>                                         AnalyzeReducerTable.class,  // reducer class****
>>>
>>>                                         job);****
>>>
>>>                 job.setNumReduceTasks(RegionCount);  ****
>>>
>>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running
>>> a 8 node cloudera cluster.****
>>>
>>> Should i use a custom SortComparator or a Group Comparator? ****
>>>
>>>
>>> ****
>>>
>>>
>>> --
>>> Regards-****
>>>
>>> Pavan****
>>>
>>> ** **
>>>
>>>
>>>
>>> ****
>>>
>>> --
>>> Regards-****
>>>
>>> Pavan****
>>>
>>> ** **
>>>
>>>
>>>
>>>
>>> --
>>> Regards-****
>>>
>>> Pavan****
>>>
>>
>>
>


-- 
Regards-
Pavan

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Pavan Sudheendra <pa...@gmail.com>.
@Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers
are functional at the same time.. Although, @Pradeep, i should do the
compression like you say.. I'll give it a shot.. As far as i can see, i
think i'll need to implement Writable and write out the key of the mapper
using the specific data types instead of writing it out as a string which
might slow the operation down..


On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota <pr...@gmail.com>wrote:

> Pavan,
>
> It's hard to tell whether there's anything wrong with your design or not
> since you haven't given us specific enough details. The best thing you can
> do is instrument your code and see what is taking a long time. Rahul
> mentioned a problem that I myself have seen before, with only one region
> (or a couple) having any data. So even if you have 21 regions, only mapper
> might be doing the heavy lifting.
>
> A combiner is hugely helpful in terms of reducing the data output of
> mappers. Writing a combiner is a best practice and you should almost always
> have one. Compression can be turned on by setting the following properties
> in your job config.
>  <property>
>      <name> mapreduce.map.output.compress </name>
>      <value> true</value>
>  </property>
>  <property>
>      <name>mapreduce.map.output.compress.codec</name>
>      <value>org.apache.hadoop.io.compress.GzipCodec</value>
>  </property>
> You can also try other compression codes such as Lzo, Snappy, Bzip2, etc.
> depending on your use cases. Gzip is really slow but gets the best
> compression ratios. Snappy/Lzo are a lot faster but don't have as good of a
> compression ratio. If your computations are CPU bound, then you'd probably
> want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs
> are idle, you can use Gzip. You'll have to experiment and find the best
> settings for you. There are a lot of other tweaks that you can try to get
> the best performance out of your cluster.
>
> One of the best things you can do is to install Ganglia (or some other
> similar tool) on your cluster and monitor usage of resources while your job
> is running. This will tell you if your job is I/O bound or CPU bound.
>
> Take a look at this paper by Intel about optimizing your Hadoop cluster
> and see if that fits your deployment.
> http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf
>
> If your cluster is already optimized and your job is not I/O bound, then
> there might be a problem with your algorithm and might warrant a redesign.
>
> Hope this helps!
> - Pradeep
>
>
> On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> One mapper is spawned per hbase table region. You can use the admin ui of
>> hbase to find the number of regions per table. It might happen that all the
>> data is sitting in a single region , so a single mapper is spawned and you
>> are not getting enough parallel work getting done.
>>
>> If that is the case then you can recreate the tables with predefined
>> splits to create more regions.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <jo...@redpoint.net>wrote:
>>
>>>  Pavan,****
>>>
>>> How large are the rows in HBase?  22 million rows is not very much but
>>> you mentioned “huge strings”.  Can you tell which part of the processing is
>>> the limiting factor (read from HBase, mapper output, reducers)?****
>>>
>>> John****
>>>
>>> ** **
>>>
>>> ** **
>>>
>>> *From:* Pavan Sudheendra [mailto:pavan0591@gmail.com]
>>> *Sent:* Saturday, September 21, 2013 2:17 AM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: How to best decide mapper output/reducer input for a
>>> huge string?****
>>>
>>> ** **
>>>
>>> No, I don't have a combiner in place. Is it necessary? How do I make my
>>> map output compressed? Yes, the Tables in HBase are compressed.****
>>>
>>> Although, there's no real bottleneck, the time it takes to process the
>>> entire table is huge. I have to constantly check if i can optimize it
>>> somehow.. ****
>>>
>>> Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
>>> any thing wrong with my design? Does it require any kind of re-work? Thank
>>> you so much for helping..****
>>>
>>> ** **
>>>
>>> On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pr...@gmail.com>
>>> wrote:****
>>>
>>> One thing that comes to mind is that your keys are Strings which are
>>> highly inefficient. You might get a lot better performance if you write a
>>> custom writable for your Key object using the appropriate data types. For
>>> example, use a long (LongWritable) for timestamps. This should make
>>> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
>>> comparisons for sorting will also go up.****
>>>
>>> ** **
>>>
>>> Ensure that your map output's are being compressed. Are your tables in
>>> HBase compressed? Do you have a combiner?****
>>>
>>> ** **
>>>
>>> Have you been able to profile your code to see where the bottlenecks are?
>>> ****
>>>
>>> ** **
>>>
>>> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>
>>> wrote:****
>>>
>>> Hi Pradeep,****
>>>
>>> Yes.. Basically i'm only writing the key part as the map output.. The V
>>> of <K,V> is not of much use to me.. But i'm hoping to change that if it
>>> leads to faster execution.. I'm kind of a newbie so looking to make the
>>> map/reduce job run a lot faster.. ****
>>>
>>> Also, yes. It gets sorted by the HouseHoldID which is what i needed..
>>> But seems if i write a map output for each and every row of a 19 m row
>>> HBase table, its taking nearly a day to complete.. (21 mappers and 21
>>> reducers)****
>>>
>>> ** **
>>>
>>> I have looked at both Pig/Hive to do the job but i'm supposed to do this
>>> via a MR job.. So, cannot use either of that.. Do you recommend me to try
>>> something if i have the data in that format?****
>>>
>>> ** **
>>>
>>> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <
>>> pradeepg26@gmail.com> wrote:****
>>>
>>> I'm sorry but I don't understand your question. Is the output of the
>>> mapper you're describing the key portion? If it is the key, then your data
>>> should already be sorted by HouseHoldId since it occurs first in your key.
>>> ****
>>>
>>> ** **
>>>
>>> The SortComparator will tell Hadoop how to sort your data. So you use
>>> this if you have a need for a non lexical sort order. The
>>> GroupingComparator will tell Hadoop how to group your data for the reducer.
>>> All KV-pairs from the same group will be given to the same Reducer.****
>>>
>>> ** **
>>>
>>> If your reduce computation needs all the KV-pairs for the same
>>> HouseHoldId, then you will need to write a GroupingComparator.****
>>>
>>> ** **
>>>
>>> Also, have you considered using a higher level abstraction on Hadoop
>>> such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are
>>> a LOT easier to write in these languages.****
>>>
>>> ** **
>>>
>>> Hope this helps!****
>>>
>>> - Pradeep****
>>>
>>> ** **
>>>
>>> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>
>>> wrote:****
>>>
>>> I need to improve my MR jobs which uses HBase as source as well as
>>> sink.. ** **
>>>
>>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>>> them out as one huge string for the reducer to do some computation and dump
>>> into a HBase Table.. ****
>>>
>>> Table1 ~ 19 million rows.****
>>>
>>> Table2 ~ 2 million rows.****
>>>
>>> Table3 ~ 900,000 rows.****
>>>
>>> The output of the mapper is something like this : ****
>>>
>>> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp****
>>>
>>> I'm interested in sorting it on the basis of the HouseHoldID value so
>>> i'm using this technique. I'm not interested in the V part of pair so i'm
>>> kind of ignoring it. My mapper class is defined as follows:****
>>>
>>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }****
>>>
>>> For my MR job to be completed, it takes 22 hours to complete which is
>>> not desirable at all. I'm supposed to optimize this somehow to run a lot
>>> faster somehow..****
>>>
>>> scan.setCaching(750);        ****
>>>
>>> scan.setCacheBlocks(false); ****
>>>
>>> TableMapReduceUtil.initTableMapperJob (****
>>>
>>>                                        Table1,           // input HBase table name****
>>>
>>>                                        scan,                   ****
>>>
>>>                                        AnalyzeMapper.class,    // mapper****
>>>
>>>                                        Text.class,             // mapper output key****
>>>
>>>                                        IntWritable.class,      // mapper output value****
>>>
>>>                                        job);****
>>>
>>> ** **
>>>
>>>                 TableMapReduceUtil.initTableReducerJob(****
>>>
>>>                                         OutputTable,                // output table****
>>>
>>>                                         AnalyzeReducerTable.class,  // reducer class****
>>>
>>>                                         job);****
>>>
>>>                 job.setNumReduceTasks(RegionCount);  ****
>>>
>>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running
>>> a 8 node cloudera cluster.****
>>>
>>> Should i use a custom SortComparator or a Group Comparator? ****
>>>
>>>
>>> ****
>>>
>>>
>>> --
>>> Regards-****
>>>
>>> Pavan****
>>>
>>> ** **
>>>
>>>
>>>
>>> ****
>>>
>>> --
>>> Regards-****
>>>
>>> Pavan****
>>>
>>> ** **
>>>
>>>
>>>
>>>
>>> --
>>> Regards-****
>>>
>>> Pavan****
>>>
>>
>>
>


-- 
Regards-
Pavan

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Pradeep Gollakota <pr...@gmail.com>.
Pavan,

It's hard to tell whether there's anything wrong with your design or not
since you haven't given us specific enough details. The best thing you can
do is instrument your code and see what is taking a long time. Rahul
mentioned a problem that I myself have seen before, with only one region
(or a couple) having any data. So even if you have 21 regions, only mapper
might be doing the heavy lifting.

A combiner is hugely helpful in terms of reducing the data output of
mappers. Writing a combiner is a best practice and you should almost always
have one. Compression can be turned on by setting the following properties
in your job config.
<property>
    <name> mapreduce.map.output.compress </name>
    <value> true</value>
</property>
<property>
    <name>mapreduce.map.output.compress.codec</name>
    <value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>
You can also try other compression codes such as Lzo, Snappy, Bzip2, etc.
depending on your use cases. Gzip is really slow but gets the best
compression ratios. Snappy/Lzo are a lot faster but don't have as good of a
compression ratio. If your computations are CPU bound, then you'd probably
want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs
are idle, you can use Gzip. You'll have to experiment and find the best
settings for you. There are a lot of other tweaks that you can try to get
the best performance out of your cluster.

One of the best things you can do is to install Ganglia (or some other
similar tool) on your cluster and monitor usage of resources while your job
is running. This will tell you if your job is I/O bound or CPU bound.

Take a look at this paper by Intel about optimizing your Hadoop cluster and
see if that fits your deployment.
http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf

If your cluster is already optimized and your job is not I/O bound, then
there might be a problem with your algorithm and might warrant a redesign.

Hope this helps!
- Pradeep


On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> One mapper is spawned per hbase table region. You can use the admin ui of
> hbase to find the number of regions per table. It might happen that all the
> data is sitting in a single region , so a single mapper is spawned and you
> are not getting enough parallel work getting done.
>
> If that is the case then you can recreate the tables with predefined
> splits to create more regions.
>
> Thanks,
> Rahul
>
>
> On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <jo...@redpoint.net>wrote:
>
>>  Pavan,****
>>
>> How large are the rows in HBase?  22 million rows is not very much but
>> you mentioned “huge strings”.  Can you tell which part of the processing is
>> the limiting factor (read from HBase, mapper output, reducers)?****
>>
>> John****
>>
>> ** **
>>
>> ** **
>>
>> *From:* Pavan Sudheendra [mailto:pavan0591@gmail.com]
>> *Sent:* Saturday, September 21, 2013 2:17 AM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: How to best decide mapper output/reducer input for a huge
>> string?****
>>
>> ** **
>>
>> No, I don't have a combiner in place. Is it necessary? How do I make my
>> map output compressed? Yes, the Tables in HBase are compressed.****
>>
>> Although, there's no real bottleneck, the time it takes to process the
>> entire table is huge. I have to constantly check if i can optimize it
>> somehow.. ****
>>
>> Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
>> any thing wrong with my design? Does it require any kind of re-work? Thank
>> you so much for helping..****
>>
>> ** **
>>
>> On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pr...@gmail.com>
>> wrote:****
>>
>> One thing that comes to mind is that your keys are Strings which are
>> highly inefficient. You might get a lot better performance if you write a
>> custom writable for your Key object using the appropriate data types. For
>> example, use a long (LongWritable) for timestamps. This should make
>> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
>> comparisons for sorting will also go up.****
>>
>> ** **
>>
>> Ensure that your map output's are being compressed. Are your tables in
>> HBase compressed? Do you have a combiner?****
>>
>> ** **
>>
>> Have you been able to profile your code to see where the bottlenecks are?
>> ****
>>
>> ** **
>>
>> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>
>> wrote:****
>>
>> Hi Pradeep,****
>>
>> Yes.. Basically i'm only writing the key part as the map output.. The V
>> of <K,V> is not of much use to me.. But i'm hoping to change that if it
>> leads to faster execution.. I'm kind of a newbie so looking to make the
>> map/reduce job run a lot faster.. ****
>>
>> Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
>> seems if i write a map output for each and every row of a 19 m row HBase
>> table, its taking nearly a day to complete.. (21 mappers and 21 reducers)
>> ****
>>
>> ** **
>>
>> I have looked at both Pig/Hive to do the job but i'm supposed to do this
>> via a MR job.. So, cannot use either of that.. Do you recommend me to try
>> something if i have the data in that format?****
>>
>> ** **
>>
>> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pr...@gmail.com>
>> wrote:****
>>
>> I'm sorry but I don't understand your question. Is the output of the
>> mapper you're describing the key portion? If it is the key, then your data
>> should already be sorted by HouseHoldId since it occurs first in your key.
>> ****
>>
>> ** **
>>
>> The SortComparator will tell Hadoop how to sort your data. So you use
>> this if you have a need for a non lexical sort order. The
>> GroupingComparator will tell Hadoop how to group your data for the reducer.
>> All KV-pairs from the same group will be given to the same Reducer.****
>>
>> ** **
>>
>> If your reduce computation needs all the KV-pairs for the same
>> HouseHoldId, then you will need to write a GroupingComparator.****
>>
>> ** **
>>
>> Also, have you considered using a higher level abstraction on Hadoop such
>> as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
>> easier to write in these languages.****
>>
>> ** **
>>
>> Hope this helps!****
>>
>> - Pradeep****
>>
>> ** **
>>
>> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>
>> wrote:****
>>
>> I need to improve my MR jobs which uses HBase as source as well as sink..
>> ** **
>>
>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>> them out as one huge string for the reducer to do some computation and dump
>> into a HBase Table.. ****
>>
>> Table1 ~ 19 million rows.****
>>
>> Table2 ~ 2 million rows.****
>>
>> Table3 ~ 900,000 rows.****
>>
>> The output of the mapper is something like this : ****
>>
>> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp****
>>
>> I'm interested in sorting it on the basis of the HouseHoldID value so i'm
>> using this technique. I'm not interested in the V part of pair so i'm kind
>> of ignoring it. My mapper class is defined as follows:****
>>
>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }****
>>
>> For my MR job to be completed, it takes 22 hours to complete which is not
>> desirable at all. I'm supposed to optimize this somehow to run a lot faster
>> somehow..****
>>
>> scan.setCaching(750);        ****
>>
>> scan.setCacheBlocks(false); ****
>>
>> TableMapReduceUtil.initTableMapperJob (****
>>
>>                                        Table1,           // input HBase table name****
>>
>>                                        scan,                   ****
>>
>>                                        AnalyzeMapper.class,    // mapper****
>>
>>                                        Text.class,             // mapper output key****
>>
>>                                        IntWritable.class,      // mapper output value****
>>
>>                                        job);****
>>
>> ** **
>>
>>                 TableMapReduceUtil.initTableReducerJob(****
>>
>>                                         OutputTable,                // output table****
>>
>>                                         AnalyzeReducerTable.class,  // reducer class****
>>
>>                                         job);****
>>
>>                 job.setNumReduceTasks(RegionCount);  ****
>>
>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running
>> a 8 node cloudera cluster.****
>>
>> Should i use a custom SortComparator or a Group Comparator? ****
>>
>>
>> ****
>>
>>
>> --
>> Regards-****
>>
>> Pavan****
>>
>> ** **
>>
>>
>>
>> ****
>>
>> --
>> Regards-****
>>
>> Pavan****
>>
>> ** **
>>
>>
>>
>>
>> --
>> Regards-****
>>
>> Pavan****
>>
>
>

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Pradeep Gollakota <pr...@gmail.com>.
Pavan,

It's hard to tell whether there's anything wrong with your design or not
since you haven't given us specific enough details. The best thing you can
do is instrument your code and see what is taking a long time. Rahul
mentioned a problem that I myself have seen before, with only one region
(or a couple) having any data. So even if you have 21 regions, only mapper
might be doing the heavy lifting.

A combiner is hugely helpful in terms of reducing the data output of
mappers. Writing a combiner is a best practice and you should almost always
have one. Compression can be turned on by setting the following properties
in your job config.
<property>
    <name> mapreduce.map.output.compress </name>
    <value> true</value>
</property>
<property>
    <name>mapreduce.map.output.compress.codec</name>
    <value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>
You can also try other compression codes such as Lzo, Snappy, Bzip2, etc.
depending on your use cases. Gzip is really slow but gets the best
compression ratios. Snappy/Lzo are a lot faster but don't have as good of a
compression ratio. If your computations are CPU bound, then you'd probably
want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs
are idle, you can use Gzip. You'll have to experiment and find the best
settings for you. There are a lot of other tweaks that you can try to get
the best performance out of your cluster.

One of the best things you can do is to install Ganglia (or some other
similar tool) on your cluster and monitor usage of resources while your job
is running. This will tell you if your job is I/O bound or CPU bound.

Take a look at this paper by Intel about optimizing your Hadoop cluster and
see if that fits your deployment.
http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf

If your cluster is already optimized and your job is not I/O bound, then
there might be a problem with your algorithm and might warrant a redesign.

Hope this helps!
- Pradeep


On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> One mapper is spawned per hbase table region. You can use the admin ui of
> hbase to find the number of regions per table. It might happen that all the
> data is sitting in a single region , so a single mapper is spawned and you
> are not getting enough parallel work getting done.
>
> If that is the case then you can recreate the tables with predefined
> splits to create more regions.
>
> Thanks,
> Rahul
>
>
> On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <jo...@redpoint.net>wrote:
>
>>  Pavan,****
>>
>> How large are the rows in HBase?  22 million rows is not very much but
>> you mentioned “huge strings”.  Can you tell which part of the processing is
>> the limiting factor (read from HBase, mapper output, reducers)?****
>>
>> John****
>>
>> ** **
>>
>> ** **
>>
>> *From:* Pavan Sudheendra [mailto:pavan0591@gmail.com]
>> *Sent:* Saturday, September 21, 2013 2:17 AM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: How to best decide mapper output/reducer input for a huge
>> string?****
>>
>> ** **
>>
>> No, I don't have a combiner in place. Is it necessary? How do I make my
>> map output compressed? Yes, the Tables in HBase are compressed.****
>>
>> Although, there's no real bottleneck, the time it takes to process the
>> entire table is huge. I have to constantly check if i can optimize it
>> somehow.. ****
>>
>> Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
>> any thing wrong with my design? Does it require any kind of re-work? Thank
>> you so much for helping..****
>>
>> ** **
>>
>> On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pr...@gmail.com>
>> wrote:****
>>
>> One thing that comes to mind is that your keys are Strings which are
>> highly inefficient. You might get a lot better performance if you write a
>> custom writable for your Key object using the appropriate data types. For
>> example, use a long (LongWritable) for timestamps. This should make
>> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
>> comparisons for sorting will also go up.****
>>
>> ** **
>>
>> Ensure that your map output's are being compressed. Are your tables in
>> HBase compressed? Do you have a combiner?****
>>
>> ** **
>>
>> Have you been able to profile your code to see where the bottlenecks are?
>> ****
>>
>> ** **
>>
>> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>
>> wrote:****
>>
>> Hi Pradeep,****
>>
>> Yes.. Basically i'm only writing the key part as the map output.. The V
>> of <K,V> is not of much use to me.. But i'm hoping to change that if it
>> leads to faster execution.. I'm kind of a newbie so looking to make the
>> map/reduce job run a lot faster.. ****
>>
>> Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
>> seems if i write a map output for each and every row of a 19 m row HBase
>> table, its taking nearly a day to complete.. (21 mappers and 21 reducers)
>> ****
>>
>> ** **
>>
>> I have looked at both Pig/Hive to do the job but i'm supposed to do this
>> via a MR job.. So, cannot use either of that.. Do you recommend me to try
>> something if i have the data in that format?****
>>
>> ** **
>>
>> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pr...@gmail.com>
>> wrote:****
>>
>> I'm sorry but I don't understand your question. Is the output of the
>> mapper you're describing the key portion? If it is the key, then your data
>> should already be sorted by HouseHoldId since it occurs first in your key.
>> ****
>>
>> ** **
>>
>> The SortComparator will tell Hadoop how to sort your data. So you use
>> this if you have a need for a non lexical sort order. The
>> GroupingComparator will tell Hadoop how to group your data for the reducer.
>> All KV-pairs from the same group will be given to the same Reducer.****
>>
>> ** **
>>
>> If your reduce computation needs all the KV-pairs for the same
>> HouseHoldId, then you will need to write a GroupingComparator.****
>>
>> ** **
>>
>> Also, have you considered using a higher level abstraction on Hadoop such
>> as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
>> easier to write in these languages.****
>>
>> ** **
>>
>> Hope this helps!****
>>
>> - Pradeep****
>>
>> ** **
>>
>> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>
>> wrote:****
>>
>> I need to improve my MR jobs which uses HBase as source as well as sink..
>> ** **
>>
>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>> them out as one huge string for the reducer to do some computation and dump
>> into a HBase Table.. ****
>>
>> Table1 ~ 19 million rows.****
>>
>> Table2 ~ 2 million rows.****
>>
>> Table3 ~ 900,000 rows.****
>>
>> The output of the mapper is something like this : ****
>>
>> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp****
>>
>> I'm interested in sorting it on the basis of the HouseHoldID value so i'm
>> using this technique. I'm not interested in the V part of pair so i'm kind
>> of ignoring it. My mapper class is defined as follows:****
>>
>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }****
>>
>> For my MR job to be completed, it takes 22 hours to complete which is not
>> desirable at all. I'm supposed to optimize this somehow to run a lot faster
>> somehow..****
>>
>> scan.setCaching(750);        ****
>>
>> scan.setCacheBlocks(false); ****
>>
>> TableMapReduceUtil.initTableMapperJob (****
>>
>>                                        Table1,           // input HBase table name****
>>
>>                                        scan,                   ****
>>
>>                                        AnalyzeMapper.class,    // mapper****
>>
>>                                        Text.class,             // mapper output key****
>>
>>                                        IntWritable.class,      // mapper output value****
>>
>>                                        job);****
>>
>> ** **
>>
>>                 TableMapReduceUtil.initTableReducerJob(****
>>
>>                                         OutputTable,                // output table****
>>
>>                                         AnalyzeReducerTable.class,  // reducer class****
>>
>>                                         job);****
>>
>>                 job.setNumReduceTasks(RegionCount);  ****
>>
>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running
>> a 8 node cloudera cluster.****
>>
>> Should i use a custom SortComparator or a Group Comparator? ****
>>
>>
>> ****
>>
>>
>> --
>> Regards-****
>>
>> Pavan****
>>
>> ** **
>>
>>
>>
>> ****
>>
>> --
>> Regards-****
>>
>> Pavan****
>>
>> ** **
>>
>>
>>
>>
>> --
>> Regards-****
>>
>> Pavan****
>>
>
>

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Pradeep Gollakota <pr...@gmail.com>.
Pavan,

It's hard to tell whether there's anything wrong with your design or not
since you haven't given us specific enough details. The best thing you can
do is instrument your code and see what is taking a long time. Rahul
mentioned a problem that I myself have seen before, with only one region
(or a couple) having any data. So even if you have 21 regions, only mapper
might be doing the heavy lifting.

A combiner is hugely helpful in terms of reducing the data output of
mappers. Writing a combiner is a best practice and you should almost always
have one. Compression can be turned on by setting the following properties
in your job config.
<property>
    <name> mapreduce.map.output.compress </name>
    <value> true</value>
</property>
<property>
    <name>mapreduce.map.output.compress.codec</name>
    <value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>
You can also try other compression codes such as Lzo, Snappy, Bzip2, etc.
depending on your use cases. Gzip is really slow but gets the best
compression ratios. Snappy/Lzo are a lot faster but don't have as good of a
compression ratio. If your computations are CPU bound, then you'd probably
want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs
are idle, you can use Gzip. You'll have to experiment and find the best
settings for you. There are a lot of other tweaks that you can try to get
the best performance out of your cluster.

One of the best things you can do is to install Ganglia (or some other
similar tool) on your cluster and monitor usage of resources while your job
is running. This will tell you if your job is I/O bound or CPU bound.

Take a look at this paper by Intel about optimizing your Hadoop cluster and
see if that fits your deployment.
http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf

If your cluster is already optimized and your job is not I/O bound, then
there might be a problem with your algorithm and might warrant a redesign.

Hope this helps!
- Pradeep


On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> One mapper is spawned per hbase table region. You can use the admin ui of
> hbase to find the number of regions per table. It might happen that all the
> data is sitting in a single region , so a single mapper is spawned and you
> are not getting enough parallel work getting done.
>
> If that is the case then you can recreate the tables with predefined
> splits to create more regions.
>
> Thanks,
> Rahul
>
>
> On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <jo...@redpoint.net>wrote:
>
>>  Pavan,****
>>
>> How large are the rows in HBase?  22 million rows is not very much but
>> you mentioned “huge strings”.  Can you tell which part of the processing is
>> the limiting factor (read from HBase, mapper output, reducers)?****
>>
>> John****
>>
>> ** **
>>
>> ** **
>>
>> *From:* Pavan Sudheendra [mailto:pavan0591@gmail.com]
>> *Sent:* Saturday, September 21, 2013 2:17 AM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: How to best decide mapper output/reducer input for a huge
>> string?****
>>
>> ** **
>>
>> No, I don't have a combiner in place. Is it necessary? How do I make my
>> map output compressed? Yes, the Tables in HBase are compressed.****
>>
>> Although, there's no real bottleneck, the time it takes to process the
>> entire table is huge. I have to constantly check if i can optimize it
>> somehow.. ****
>>
>> Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
>> any thing wrong with my design? Does it require any kind of re-work? Thank
>> you so much for helping..****
>>
>> ** **
>>
>> On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pr...@gmail.com>
>> wrote:****
>>
>> One thing that comes to mind is that your keys are Strings which are
>> highly inefficient. You might get a lot better performance if you write a
>> custom writable for your Key object using the appropriate data types. For
>> example, use a long (LongWritable) for timestamps. This should make
>> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
>> comparisons for sorting will also go up.****
>>
>> ** **
>>
>> Ensure that your map output's are being compressed. Are your tables in
>> HBase compressed? Do you have a combiner?****
>>
>> ** **
>>
>> Have you been able to profile your code to see where the bottlenecks are?
>> ****
>>
>> ** **
>>
>> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>
>> wrote:****
>>
>> Hi Pradeep,****
>>
>> Yes.. Basically i'm only writing the key part as the map output.. The V
>> of <K,V> is not of much use to me.. But i'm hoping to change that if it
>> leads to faster execution.. I'm kind of a newbie so looking to make the
>> map/reduce job run a lot faster.. ****
>>
>> Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
>> seems if i write a map output for each and every row of a 19 m row HBase
>> table, its taking nearly a day to complete.. (21 mappers and 21 reducers)
>> ****
>>
>> ** **
>>
>> I have looked at both Pig/Hive to do the job but i'm supposed to do this
>> via a MR job.. So, cannot use either of that.. Do you recommend me to try
>> something if i have the data in that format?****
>>
>> ** **
>>
>> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pr...@gmail.com>
>> wrote:****
>>
>> I'm sorry but I don't understand your question. Is the output of the
>> mapper you're describing the key portion? If it is the key, then your data
>> should already be sorted by HouseHoldId since it occurs first in your key.
>> ****
>>
>> ** **
>>
>> The SortComparator will tell Hadoop how to sort your data. So you use
>> this if you have a need for a non lexical sort order. The
>> GroupingComparator will tell Hadoop how to group your data for the reducer.
>> All KV-pairs from the same group will be given to the same Reducer.****
>>
>> ** **
>>
>> If your reduce computation needs all the KV-pairs for the same
>> HouseHoldId, then you will need to write a GroupingComparator.****
>>
>> ** **
>>
>> Also, have you considered using a higher level abstraction on Hadoop such
>> as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
>> easier to write in these languages.****
>>
>> ** **
>>
>> Hope this helps!****
>>
>> - Pradeep****
>>
>> ** **
>>
>> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>
>> wrote:****
>>
>> I need to improve my MR jobs which uses HBase as source as well as sink..
>> ** **
>>
>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>> them out as one huge string for the reducer to do some computation and dump
>> into a HBase Table.. ****
>>
>> Table1 ~ 19 million rows.****
>>
>> Table2 ~ 2 million rows.****
>>
>> Table3 ~ 900,000 rows.****
>>
>> The output of the mapper is something like this : ****
>>
>> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp****
>>
>> I'm interested in sorting it on the basis of the HouseHoldID value so i'm
>> using this technique. I'm not interested in the V part of pair so i'm kind
>> of ignoring it. My mapper class is defined as follows:****
>>
>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }****
>>
>> For my MR job to be completed, it takes 22 hours to complete which is not
>> desirable at all. I'm supposed to optimize this somehow to run a lot faster
>> somehow..****
>>
>> scan.setCaching(750);        ****
>>
>> scan.setCacheBlocks(false); ****
>>
>> TableMapReduceUtil.initTableMapperJob (****
>>
>>                                        Table1,           // input HBase table name****
>>
>>                                        scan,                   ****
>>
>>                                        AnalyzeMapper.class,    // mapper****
>>
>>                                        Text.class,             // mapper output key****
>>
>>                                        IntWritable.class,      // mapper output value****
>>
>>                                        job);****
>>
>> ** **
>>
>>                 TableMapReduceUtil.initTableReducerJob(****
>>
>>                                         OutputTable,                // output table****
>>
>>                                         AnalyzeReducerTable.class,  // reducer class****
>>
>>                                         job);****
>>
>>                 job.setNumReduceTasks(RegionCount);  ****
>>
>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running
>> a 8 node cloudera cluster.****
>>
>> Should i use a custom SortComparator or a Group Comparator? ****
>>
>>
>> ****
>>
>>
>> --
>> Regards-****
>>
>> Pavan****
>>
>> ** **
>>
>>
>>
>> ****
>>
>> --
>> Regards-****
>>
>> Pavan****
>>
>> ** **
>>
>>
>>
>>
>> --
>> Regards-****
>>
>> Pavan****
>>
>
>

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Pradeep Gollakota <pr...@gmail.com>.
Pavan,

It's hard to tell whether there's anything wrong with your design or not
since you haven't given us specific enough details. The best thing you can
do is instrument your code and see what is taking a long time. Rahul
mentioned a problem that I myself have seen before, with only one region
(or a couple) having any data. So even if you have 21 regions, only mapper
might be doing the heavy lifting.

A combiner is hugely helpful in terms of reducing the data output of
mappers. Writing a combiner is a best practice and you should almost always
have one. Compression can be turned on by setting the following properties
in your job config.
<property>
    <name> mapreduce.map.output.compress </name>
    <value> true</value>
</property>
<property>
    <name>mapreduce.map.output.compress.codec</name>
    <value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>
You can also try other compression codes such as Lzo, Snappy, Bzip2, etc.
depending on your use cases. Gzip is really slow but gets the best
compression ratios. Snappy/Lzo are a lot faster but don't have as good of a
compression ratio. If your computations are CPU bound, then you'd probably
want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs
are idle, you can use Gzip. You'll have to experiment and find the best
settings for you. There are a lot of other tweaks that you can try to get
the best performance out of your cluster.

One of the best things you can do is to install Ganglia (or some other
similar tool) on your cluster and monitor usage of resources while your job
is running. This will tell you if your job is I/O bound or CPU bound.

Take a look at this paper by Intel about optimizing your Hadoop cluster and
see if that fits your deployment.
http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf

If your cluster is already optimized and your job is not I/O bound, then
there might be a problem with your algorithm and might warrant a redesign.

Hope this helps!
- Pradeep


On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> One mapper is spawned per hbase table region. You can use the admin ui of
> hbase to find the number of regions per table. It might happen that all the
> data is sitting in a single region , so a single mapper is spawned and you
> are not getting enough parallel work getting done.
>
> If that is the case then you can recreate the tables with predefined
> splits to create more regions.
>
> Thanks,
> Rahul
>
>
> On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <jo...@redpoint.net>wrote:
>
>>  Pavan,****
>>
>> How large are the rows in HBase?  22 million rows is not very much but
>> you mentioned “huge strings”.  Can you tell which part of the processing is
>> the limiting factor (read from HBase, mapper output, reducers)?****
>>
>> John****
>>
>> ** **
>>
>> ** **
>>
>> *From:* Pavan Sudheendra [mailto:pavan0591@gmail.com]
>> *Sent:* Saturday, September 21, 2013 2:17 AM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: How to best decide mapper output/reducer input for a huge
>> string?****
>>
>> ** **
>>
>> No, I don't have a combiner in place. Is it necessary? How do I make my
>> map output compressed? Yes, the Tables in HBase are compressed.****
>>
>> Although, there's no real bottleneck, the time it takes to process the
>> entire table is huge. I have to constantly check if i can optimize it
>> somehow.. ****
>>
>> Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
>> any thing wrong with my design? Does it require any kind of re-work? Thank
>> you so much for helping..****
>>
>> ** **
>>
>> On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pr...@gmail.com>
>> wrote:****
>>
>> One thing that comes to mind is that your keys are Strings which are
>> highly inefficient. You might get a lot better performance if you write a
>> custom writable for your Key object using the appropriate data types. For
>> example, use a long (LongWritable) for timestamps. This should make
>> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
>> comparisons for sorting will also go up.****
>>
>> ** **
>>
>> Ensure that your map output's are being compressed. Are your tables in
>> HBase compressed? Do you have a combiner?****
>>
>> ** **
>>
>> Have you been able to profile your code to see where the bottlenecks are?
>> ****
>>
>> ** **
>>
>> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>
>> wrote:****
>>
>> Hi Pradeep,****
>>
>> Yes.. Basically i'm only writing the key part as the map output.. The V
>> of <K,V> is not of much use to me.. But i'm hoping to change that if it
>> leads to faster execution.. I'm kind of a newbie so looking to make the
>> map/reduce job run a lot faster.. ****
>>
>> Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
>> seems if i write a map output for each and every row of a 19 m row HBase
>> table, its taking nearly a day to complete.. (21 mappers and 21 reducers)
>> ****
>>
>> ** **
>>
>> I have looked at both Pig/Hive to do the job but i'm supposed to do this
>> via a MR job.. So, cannot use either of that.. Do you recommend me to try
>> something if i have the data in that format?****
>>
>> ** **
>>
>> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pr...@gmail.com>
>> wrote:****
>>
>> I'm sorry but I don't understand your question. Is the output of the
>> mapper you're describing the key portion? If it is the key, then your data
>> should already be sorted by HouseHoldId since it occurs first in your key.
>> ****
>>
>> ** **
>>
>> The SortComparator will tell Hadoop how to sort your data. So you use
>> this if you have a need for a non lexical sort order. The
>> GroupingComparator will tell Hadoop how to group your data for the reducer.
>> All KV-pairs from the same group will be given to the same Reducer.****
>>
>> ** **
>>
>> If your reduce computation needs all the KV-pairs for the same
>> HouseHoldId, then you will need to write a GroupingComparator.****
>>
>> ** **
>>
>> Also, have you considered using a higher level abstraction on Hadoop such
>> as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
>> easier to write in these languages.****
>>
>> ** **
>>
>> Hope this helps!****
>>
>> - Pradeep****
>>
>> ** **
>>
>> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>
>> wrote:****
>>
>> I need to improve my MR jobs which uses HBase as source as well as sink..
>> ** **
>>
>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>> them out as one huge string for the reducer to do some computation and dump
>> into a HBase Table.. ****
>>
>> Table1 ~ 19 million rows.****
>>
>> Table2 ~ 2 million rows.****
>>
>> Table3 ~ 900,000 rows.****
>>
>> The output of the mapper is something like this : ****
>>
>> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp****
>>
>> I'm interested in sorting it on the basis of the HouseHoldID value so i'm
>> using this technique. I'm not interested in the V part of pair so i'm kind
>> of ignoring it. My mapper class is defined as follows:****
>>
>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }****
>>
>> For my MR job to be completed, it takes 22 hours to complete which is not
>> desirable at all. I'm supposed to optimize this somehow to run a lot faster
>> somehow..****
>>
>> scan.setCaching(750);        ****
>>
>> scan.setCacheBlocks(false); ****
>>
>> TableMapReduceUtil.initTableMapperJob (****
>>
>>                                        Table1,           // input HBase table name****
>>
>>                                        scan,                   ****
>>
>>                                        AnalyzeMapper.class,    // mapper****
>>
>>                                        Text.class,             // mapper output key****
>>
>>                                        IntWritable.class,      // mapper output value****
>>
>>                                        job);****
>>
>> ** **
>>
>>                 TableMapReduceUtil.initTableReducerJob(****
>>
>>                                         OutputTable,                // output table****
>>
>>                                         AnalyzeReducerTable.class,  // reducer class****
>>
>>                                         job);****
>>
>>                 job.setNumReduceTasks(RegionCount);  ****
>>
>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running
>> a 8 node cloudera cluster.****
>>
>> Should i use a custom SortComparator or a Group Comparator? ****
>>
>>
>> ****
>>
>>
>> --
>> Regards-****
>>
>> Pavan****
>>
>> ** **
>>
>>
>>
>> ****
>>
>> --
>> Regards-****
>>
>> Pavan****
>>
>> ** **
>>
>>
>>
>>
>> --
>> Regards-****
>>
>> Pavan****
>>
>
>

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
One mapper is spawned per hbase table region. You can use the admin ui of
hbase to find the number of regions per table. It might happen that all the
data is sitting in a single region , so a single mapper is spawned and you
are not getting enough parallel work getting done.

If that is the case then you can recreate the tables with predefined splits
to create more regions.

Thanks,
Rahul


On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <jo...@redpoint.net>wrote:

>  Pavan,****
>
> How large are the rows in HBase?  22 million rows is not very much but you
> mentioned “huge strings”.  Can you tell which part of the processing is the
> limiting factor (read from HBase, mapper output, reducers)?****
>
> John****
>
> ** **
>
> ** **
>
> *From:* Pavan Sudheendra [mailto:pavan0591@gmail.com]
> *Sent:* Saturday, September 21, 2013 2:17 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: How to best decide mapper output/reducer input for a huge
> string?****
>
> ** **
>
> No, I don't have a combiner in place. Is it necessary? How do I make my
> map output compressed? Yes, the Tables in HBase are compressed.****
>
> Although, there's no real bottleneck, the time it takes to process the
> entire table is huge. I have to constantly check if i can optimize it
> somehow.. ****
>
> Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
> any thing wrong with my design? Does it require any kind of re-work? Thank
> you so much for helping..****
>
> ** **
>
> On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pr...@gmail.com>
> wrote:****
>
> One thing that comes to mind is that your keys are Strings which are
> highly inefficient. You might get a lot better performance if you write a
> custom writable for your Key object using the appropriate data types. For
> example, use a long (LongWritable) for timestamps. This should make
> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
> comparisons for sorting will also go up.****
>
> ** **
>
> Ensure that your map output's are being compressed. Are your tables in
> HBase compressed? Do you have a combiner?****
>
> ** **
>
> Have you been able to profile your code to see where the bottlenecks are?*
> ***
>
> ** **
>
> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>
> wrote:****
>
> Hi Pradeep,****
>
> Yes.. Basically i'm only writing the key part as the map output.. The V of
> <K,V> is not of much use to me.. But i'm hoping to change that if it leads
> to faster execution.. I'm kind of a newbie so looking to make the
> map/reduce job run a lot faster.. ****
>
> Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
> seems if i write a map output for each and every row of a 19 m row HBase
> table, its taking nearly a day to complete.. (21 mappers and 21 reducers)*
> ***
>
> ** **
>
> I have looked at both Pig/Hive to do the job but i'm supposed to do this
> via a MR job.. So, cannot use either of that.. Do you recommend me to try
> something if i have the data in that format?****
>
> ** **
>
> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pr...@gmail.com>
> wrote:****
>
> I'm sorry but I don't understand your question. Is the output of the
> mapper you're describing the key portion? If it is the key, then your data
> should already be sorted by HouseHoldId since it occurs first in your key.
> ****
>
> ** **
>
> The SortComparator will tell Hadoop how to sort your data. So you use this
> if you have a need for a non lexical sort order. The GroupingComparator
> will tell Hadoop how to group your data for the reducer. All KV-pairs from
> the same group will be given to the same Reducer.****
>
> ** **
>
> If your reduce computation needs all the KV-pairs for the same
> HouseHoldId, then you will need to write a GroupingComparator.****
>
> ** **
>
> Also, have you considered using a higher level abstraction on Hadoop such
> as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
> easier to write in these languages.****
>
> ** **
>
> Hope this helps!****
>
> - Pradeep****
>
> ** **
>
> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>
> wrote:****
>
> I need to improve my MR jobs which uses HBase as source as well as sink..
> ** **
>
> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
> them out as one huge string for the reducer to do some computation and dump
> into a HBase Table.. ****
>
> Table1 ~ 19 million rows.****
>
> Table2 ~ 2 million rows.****
>
> Table3 ~ 900,000 rows.****
>
> The output of the mapper is something like this : ****
>
> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp****
>
> I'm interested in sorting it on the basis of the HouseHoldID value so i'm
> using this technique. I'm not interested in the V part of pair so i'm kind
> of ignoring it. My mapper class is defined as follows:****
>
> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }****
>
> For my MR job to be completed, it takes 22 hours to complete which is not
> desirable at all. I'm supposed to optimize this somehow to run a lot faster
> somehow..****
>
> scan.setCaching(750);        ****
>
> scan.setCacheBlocks(false); ****
>
> TableMapReduceUtil.initTableMapperJob (****
>
>                                        Table1,           // input HBase table name****
>
>                                        scan,                   ****
>
>                                        AnalyzeMapper.class,    // mapper****
>
>                                        Text.class,             // mapper output key****
>
>                                        IntWritable.class,      // mapper output value****
>
>                                        job);****
>
> ** **
>
>                 TableMapReduceUtil.initTableReducerJob(****
>
>                                         OutputTable,                // output table****
>
>                                         AnalyzeReducerTable.class,  // reducer class****
>
>                                         job);****
>
>                 job.setNumReduceTasks(RegionCount);  ****
>
> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a
> 8 node cloudera cluster.****
>
> Should i use a custom SortComparator or a Group Comparator? ****
>
>
> ****
>
>
> --
> Regards-****
>
> Pavan****
>
> ** **
>
>
>
> ****
>
> --
> Regards-****
>
> Pavan****
>
> ** **
>
>
>
>
> --
> Regards-****
>
> Pavan****
>

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
One mapper is spawned per hbase table region. You can use the admin ui of
hbase to find the number of regions per table. It might happen that all the
data is sitting in a single region , so a single mapper is spawned and you
are not getting enough parallel work getting done.

If that is the case then you can recreate the tables with predefined splits
to create more regions.

Thanks,
Rahul


On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <jo...@redpoint.net>wrote:

>  Pavan,****
>
> How large are the rows in HBase?  22 million rows is not very much but you
> mentioned “huge strings”.  Can you tell which part of the processing is the
> limiting factor (read from HBase, mapper output, reducers)?****
>
> John****
>
> ** **
>
> ** **
>
> *From:* Pavan Sudheendra [mailto:pavan0591@gmail.com]
> *Sent:* Saturday, September 21, 2013 2:17 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: How to best decide mapper output/reducer input for a huge
> string?****
>
> ** **
>
> No, I don't have a combiner in place. Is it necessary? How do I make my
> map output compressed? Yes, the Tables in HBase are compressed.****
>
> Although, there's no real bottleneck, the time it takes to process the
> entire table is huge. I have to constantly check if i can optimize it
> somehow.. ****
>
> Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
> any thing wrong with my design? Does it require any kind of re-work? Thank
> you so much for helping..****
>
> ** **
>
> On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pr...@gmail.com>
> wrote:****
>
> One thing that comes to mind is that your keys are Strings which are
> highly inefficient. You might get a lot better performance if you write a
> custom writable for your Key object using the appropriate data types. For
> example, use a long (LongWritable) for timestamps. This should make
> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
> comparisons for sorting will also go up.****
>
> ** **
>
> Ensure that your map output's are being compressed. Are your tables in
> HBase compressed? Do you have a combiner?****
>
> ** **
>
> Have you been able to profile your code to see where the bottlenecks are?*
> ***
>
> ** **
>
> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>
> wrote:****
>
> Hi Pradeep,****
>
> Yes.. Basically i'm only writing the key part as the map output.. The V of
> <K,V> is not of much use to me.. But i'm hoping to change that if it leads
> to faster execution.. I'm kind of a newbie so looking to make the
> map/reduce job run a lot faster.. ****
>
> Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
> seems if i write a map output for each and every row of a 19 m row HBase
> table, its taking nearly a day to complete.. (21 mappers and 21 reducers)*
> ***
>
> ** **
>
> I have looked at both Pig/Hive to do the job but i'm supposed to do this
> via a MR job.. So, cannot use either of that.. Do you recommend me to try
> something if i have the data in that format?****
>
> ** **
>
> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pr...@gmail.com>
> wrote:****
>
> I'm sorry but I don't understand your question. Is the output of the
> mapper you're describing the key portion? If it is the key, then your data
> should already be sorted by HouseHoldId since it occurs first in your key.
> ****
>
> ** **
>
> The SortComparator will tell Hadoop how to sort your data. So you use this
> if you have a need for a non lexical sort order. The GroupingComparator
> will tell Hadoop how to group your data for the reducer. All KV-pairs from
> the same group will be given to the same Reducer.****
>
> ** **
>
> If your reduce computation needs all the KV-pairs for the same
> HouseHoldId, then you will need to write a GroupingComparator.****
>
> ** **
>
> Also, have you considered using a higher level abstraction on Hadoop such
> as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
> easier to write in these languages.****
>
> ** **
>
> Hope this helps!****
>
> - Pradeep****
>
> ** **
>
> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>
> wrote:****
>
> I need to improve my MR jobs which uses HBase as source as well as sink..
> ** **
>
> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
> them out as one huge string for the reducer to do some computation and dump
> into a HBase Table.. ****
>
> Table1 ~ 19 million rows.****
>
> Table2 ~ 2 million rows.****
>
> Table3 ~ 900,000 rows.****
>
> The output of the mapper is something like this : ****
>
> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp****
>
> I'm interested in sorting it on the basis of the HouseHoldID value so i'm
> using this technique. I'm not interested in the V part of pair so i'm kind
> of ignoring it. My mapper class is defined as follows:****
>
> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }****
>
> For my MR job to be completed, it takes 22 hours to complete which is not
> desirable at all. I'm supposed to optimize this somehow to run a lot faster
> somehow..****
>
> scan.setCaching(750);        ****
>
> scan.setCacheBlocks(false); ****
>
> TableMapReduceUtil.initTableMapperJob (****
>
>                                        Table1,           // input HBase table name****
>
>                                        scan,                   ****
>
>                                        AnalyzeMapper.class,    // mapper****
>
>                                        Text.class,             // mapper output key****
>
>                                        IntWritable.class,      // mapper output value****
>
>                                        job);****
>
> ** **
>
>                 TableMapReduceUtil.initTableReducerJob(****
>
>                                         OutputTable,                // output table****
>
>                                         AnalyzeReducerTable.class,  // reducer class****
>
>                                         job);****
>
>                 job.setNumReduceTasks(RegionCount);  ****
>
> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a
> 8 node cloudera cluster.****
>
> Should i use a custom SortComparator or a Group Comparator? ****
>
>
> ****
>
>
> --
> Regards-****
>
> Pavan****
>
> ** **
>
>
>
> ****
>
> --
> Regards-****
>
> Pavan****
>
> ** **
>
>
>
>
> --
> Regards-****
>
> Pavan****
>

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
One mapper is spawned per hbase table region. You can use the admin ui of
hbase to find the number of regions per table. It might happen that all the
data is sitting in a single region , so a single mapper is spawned and you
are not getting enough parallel work getting done.

If that is the case then you can recreate the tables with predefined splits
to create more regions.

Thanks,
Rahul


On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <jo...@redpoint.net>wrote:

>  Pavan,****
>
> How large are the rows in HBase?  22 million rows is not very much but you
> mentioned “huge strings”.  Can you tell which part of the processing is the
> limiting factor (read from HBase, mapper output, reducers)?****
>
> John****
>
> ** **
>
> ** **
>
> *From:* Pavan Sudheendra [mailto:pavan0591@gmail.com]
> *Sent:* Saturday, September 21, 2013 2:17 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: How to best decide mapper output/reducer input for a huge
> string?****
>
> ** **
>
> No, I don't have a combiner in place. Is it necessary? How do I make my
> map output compressed? Yes, the Tables in HBase are compressed.****
>
> Although, there's no real bottleneck, the time it takes to process the
> entire table is huge. I have to constantly check if i can optimize it
> somehow.. ****
>
> Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
> any thing wrong with my design? Does it require any kind of re-work? Thank
> you so much for helping..****
>
> ** **
>
> On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pr...@gmail.com>
> wrote:****
>
> One thing that comes to mind is that your keys are Strings which are
> highly inefficient. You might get a lot better performance if you write a
> custom writable for your Key object using the appropriate data types. For
> example, use a long (LongWritable) for timestamps. This should make
> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
> comparisons for sorting will also go up.****
>
> ** **
>
> Ensure that your map output's are being compressed. Are your tables in
> HBase compressed? Do you have a combiner?****
>
> ** **
>
> Have you been able to profile your code to see where the bottlenecks are?*
> ***
>
> ** **
>
> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>
> wrote:****
>
> Hi Pradeep,****
>
> Yes.. Basically i'm only writing the key part as the map output.. The V of
> <K,V> is not of much use to me.. But i'm hoping to change that if it leads
> to faster execution.. I'm kind of a newbie so looking to make the
> map/reduce job run a lot faster.. ****
>
> Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
> seems if i write a map output for each and every row of a 19 m row HBase
> table, its taking nearly a day to complete.. (21 mappers and 21 reducers)*
> ***
>
> ** **
>
> I have looked at both Pig/Hive to do the job but i'm supposed to do this
> via a MR job.. So, cannot use either of that.. Do you recommend me to try
> something if i have the data in that format?****
>
> ** **
>
> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pr...@gmail.com>
> wrote:****
>
> I'm sorry but I don't understand your question. Is the output of the
> mapper you're describing the key portion? If it is the key, then your data
> should already be sorted by HouseHoldId since it occurs first in your key.
> ****
>
> ** **
>
> The SortComparator will tell Hadoop how to sort your data. So you use this
> if you have a need for a non lexical sort order. The GroupingComparator
> will tell Hadoop how to group your data for the reducer. All KV-pairs from
> the same group will be given to the same Reducer.****
>
> ** **
>
> If your reduce computation needs all the KV-pairs for the same
> HouseHoldId, then you will need to write a GroupingComparator.****
>
> ** **
>
> Also, have you considered using a higher level abstraction on Hadoop such
> as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
> easier to write in these languages.****
>
> ** **
>
> Hope this helps!****
>
> - Pradeep****
>
> ** **
>
> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>
> wrote:****
>
> I need to improve my MR jobs which uses HBase as source as well as sink..
> ** **
>
> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
> them out as one huge string for the reducer to do some computation and dump
> into a HBase Table.. ****
>
> Table1 ~ 19 million rows.****
>
> Table2 ~ 2 million rows.****
>
> Table3 ~ 900,000 rows.****
>
> The output of the mapper is something like this : ****
>
> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp****
>
> I'm interested in sorting it on the basis of the HouseHoldID value so i'm
> using this technique. I'm not interested in the V part of pair so i'm kind
> of ignoring it. My mapper class is defined as follows:****
>
> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }****
>
> For my MR job to be completed, it takes 22 hours to complete which is not
> desirable at all. I'm supposed to optimize this somehow to run a lot faster
> somehow..****
>
> scan.setCaching(750);        ****
>
> scan.setCacheBlocks(false); ****
>
> TableMapReduceUtil.initTableMapperJob (****
>
>                                        Table1,           // input HBase table name****
>
>                                        scan,                   ****
>
>                                        AnalyzeMapper.class,    // mapper****
>
>                                        Text.class,             // mapper output key****
>
>                                        IntWritable.class,      // mapper output value****
>
>                                        job);****
>
> ** **
>
>                 TableMapReduceUtil.initTableReducerJob(****
>
>                                         OutputTable,                // output table****
>
>                                         AnalyzeReducerTable.class,  // reducer class****
>
>                                         job);****
>
>                 job.setNumReduceTasks(RegionCount);  ****
>
> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a
> 8 node cloudera cluster.****
>
> Should i use a custom SortComparator or a Group Comparator? ****
>
>
> ****
>
>
> --
> Regards-****
>
> Pavan****
>
> ** **
>
>
>
> ****
>
> --
> Regards-****
>
> Pavan****
>
> ** **
>
>
>
>
> --
> Regards-****
>
> Pavan****
>

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
One mapper is spawned per hbase table region. You can use the admin ui of
hbase to find the number of regions per table. It might happen that all the
data is sitting in a single region , so a single mapper is spawned and you
are not getting enough parallel work getting done.

If that is the case then you can recreate the tables with predefined splits
to create more regions.

Thanks,
Rahul


On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <jo...@redpoint.net>wrote:

>  Pavan,****
>
> How large are the rows in HBase?  22 million rows is not very much but you
> mentioned “huge strings”.  Can you tell which part of the processing is the
> limiting factor (read from HBase, mapper output, reducers)?****
>
> John****
>
> ** **
>
> ** **
>
> *From:* Pavan Sudheendra [mailto:pavan0591@gmail.com]
> *Sent:* Saturday, September 21, 2013 2:17 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: How to best decide mapper output/reducer input for a huge
> string?****
>
> ** **
>
> No, I don't have a combiner in place. Is it necessary? How do I make my
> map output compressed? Yes, the Tables in HBase are compressed.****
>
> Although, there's no real bottleneck, the time it takes to process the
> entire table is huge. I have to constantly check if i can optimize it
> somehow.. ****
>
> Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
> any thing wrong with my design? Does it require any kind of re-work? Thank
> you so much for helping..****
>
> ** **
>
> On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pr...@gmail.com>
> wrote:****
>
> One thing that comes to mind is that your keys are Strings which are
> highly inefficient. You might get a lot better performance if you write a
> custom writable for your Key object using the appropriate data types. For
> example, use a long (LongWritable) for timestamps. This should make
> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
> comparisons for sorting will also go up.****
>
> ** **
>
> Ensure that your map output's are being compressed. Are your tables in
> HBase compressed? Do you have a combiner?****
>
> ** **
>
> Have you been able to profile your code to see where the bottlenecks are?*
> ***
>
> ** **
>
> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>
> wrote:****
>
> Hi Pradeep,****
>
> Yes.. Basically i'm only writing the key part as the map output.. The V of
> <K,V> is not of much use to me.. But i'm hoping to change that if it leads
> to faster execution.. I'm kind of a newbie so looking to make the
> map/reduce job run a lot faster.. ****
>
> Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
> seems if i write a map output for each and every row of a 19 m row HBase
> table, its taking nearly a day to complete.. (21 mappers and 21 reducers)*
> ***
>
> ** **
>
> I have looked at both Pig/Hive to do the job but i'm supposed to do this
> via a MR job.. So, cannot use either of that.. Do you recommend me to try
> something if i have the data in that format?****
>
> ** **
>
> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pr...@gmail.com>
> wrote:****
>
> I'm sorry but I don't understand your question. Is the output of the
> mapper you're describing the key portion? If it is the key, then your data
> should already be sorted by HouseHoldId since it occurs first in your key.
> ****
>
> ** **
>
> The SortComparator will tell Hadoop how to sort your data. So you use this
> if you have a need for a non lexical sort order. The GroupingComparator
> will tell Hadoop how to group your data for the reducer. All KV-pairs from
> the same group will be given to the same Reducer.****
>
> ** **
>
> If your reduce computation needs all the KV-pairs for the same
> HouseHoldId, then you will need to write a GroupingComparator.****
>
> ** **
>
> Also, have you considered using a higher level abstraction on Hadoop such
> as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
> easier to write in these languages.****
>
> ** **
>
> Hope this helps!****
>
> - Pradeep****
>
> ** **
>
> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>
> wrote:****
>
> I need to improve my MR jobs which uses HBase as source as well as sink..
> ** **
>
> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
> them out as one huge string for the reducer to do some computation and dump
> into a HBase Table.. ****
>
> Table1 ~ 19 million rows.****
>
> Table2 ~ 2 million rows.****
>
> Table3 ~ 900,000 rows.****
>
> The output of the mapper is something like this : ****
>
> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp****
>
> I'm interested in sorting it on the basis of the HouseHoldID value so i'm
> using this technique. I'm not interested in the V part of pair so i'm kind
> of ignoring it. My mapper class is defined as follows:****
>
> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }****
>
> For my MR job to be completed, it takes 22 hours to complete which is not
> desirable at all. I'm supposed to optimize this somehow to run a lot faster
> somehow..****
>
> scan.setCaching(750);        ****
>
> scan.setCacheBlocks(false); ****
>
> TableMapReduceUtil.initTableMapperJob (****
>
>                                        Table1,           // input HBase table name****
>
>                                        scan,                   ****
>
>                                        AnalyzeMapper.class,    // mapper****
>
>                                        Text.class,             // mapper output key****
>
>                                        IntWritable.class,      // mapper output value****
>
>                                        job);****
>
> ** **
>
>                 TableMapReduceUtil.initTableReducerJob(****
>
>                                         OutputTable,                // output table****
>
>                                         AnalyzeReducerTable.class,  // reducer class****
>
>                                         job);****
>
>                 job.setNumReduceTasks(RegionCount);  ****
>
> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a
> 8 node cloudera cluster.****
>
> Should i use a custom SortComparator or a Group Comparator? ****
>
>
> ****
>
>
> --
> Regards-****
>
> Pavan****
>
> ** **
>
>
>
> ****
>
> --
> Regards-****
>
> Pavan****
>
> ** **
>
>
>
>
> --
> Regards-****
>
> Pavan****
>

RE: How to best decide mapper output/reducer input for a huge string?

Posted by John Lilley <jo...@redpoint.net>.
Pavan,
How large are the rows in HBase?  22 million rows is not very much but you mentioned "huge strings".  Can you tell which part of the processing is the limiting factor (read from HBase, mapper output, reducers)?
John


From: Pavan Sudheendra [mailto:pavan0591@gmail.com]
Sent: Saturday, September 21, 2013 2:17 AM
To: user@hadoop.apache.org
Subject: Re: How to best decide mapper output/reducer input for a huge string?

No, I don't have a combiner in place. Is it necessary? How do I make my map output compressed? Yes, the Tables in HBase are compressed.
Although, there's no real bottleneck, the time it takes to process the entire table is huge. I have to constantly check if i can optimize it somehow..
Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see any thing wrong with my design? Does it require any kind of re-work? Thank you so much for helping..

On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pr...@gmail.com>> wrote:
One thing that comes to mind is that your keys are Strings which are highly inefficient. You might get a lot better performance if you write a custom writable for your Key object using the appropriate data types. For example, use a long (LongWritable) for timestamps. This should make (de)serialization a lot faster. If HouseHoldId is an integer, your speed of comparisons for sorting will also go up.

Ensure that your map output's are being compressed. Are your tables in HBase compressed? Do you have a combiner?

Have you been able to profile your code to see where the bottlenecks are?

On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>> wrote:
Hi Pradeep,
Yes.. Basically i'm only writing the key part as the map output.. The V of <K,V> is not of much use to me.. But i'm hoping to change that if it leads to faster execution.. I'm kind of a newbie so looking to make the map/reduce job run a lot faster..
Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But seems if i write a map output for each and every row of a 19 m row HBase table, its taking nearly a day to complete.. (21 mappers and 21 reducers)

I have looked at both Pig/Hive to do the job but i'm supposed to do this via a MR job.. So, cannot use either of that.. Do you recommend me to try something if i have the data in that format?

On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pr...@gmail.com>> wrote:
I'm sorry but I don't understand your question. Is the output of the mapper you're describing the key portion? If it is the key, then your data should already be sorted by HouseHoldId since it occurs first in your key.

The SortComparator will tell Hadoop how to sort your data. So you use this if you have a need for a non lexical sort order. The GroupingComparator will tell Hadoop how to group your data for the reducer. All KV-pairs from the same group will be given to the same Reducer.

If your reduce computation needs all the KV-pairs for the same HouseHoldId, then you will need to write a GroupingComparator.

Also, have you considered using a higher level abstraction on Hadoop such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT easier to write in these languages.

Hope this helps!
- Pradeep

On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>> wrote:

I need to improve my MR jobs which uses HBase as source as well as sink..

Basically, i'm reading data from 3 HBase Tables in the mapper, writing them out as one huge string for the reducer to do some computation and dump into a HBase Table..

Table1 ~ 19 million rows.

Table2 ~ 2 million rows.

Table3 ~ 900,000 rows.

The output of the mapper is something like this :

HouseHoldId contentID name duration genre type channelId personId televisionID timestamp

I'm interested in sorting it on the basis of the HouseHoldID value so i'm using this technique. I'm not interested in the V part of pair so i'm kind of ignoring it. My mapper class is defined as follows:

public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }

For my MR job to be completed, it takes 22 hours to complete which is not desirable at all. I'm supposed to optimize this somehow to run a lot faster somehow..

scan.setCaching(750);

scan.setCacheBlocks(false);

TableMapReduceUtil.initTableMapperJob (

                                       Table1,           // input HBase table name

                                       scan,

                                       AnalyzeMapper.class,    // mapper

                                       Text.class,             // mapper output key

                                       IntWritable.class,      // mapper output value

                                       job);



                TableMapReduceUtil.initTableReducerJob(

                                        OutputTable,                // output table

                                        AnalyzeReducerTable.class,  // reducer class

                                        job);

                job.setNumReduceTasks(RegionCount);

My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a 8 node cloudera cluster.

Should i use a custom SortComparator or a Group Comparator?


--
Regards-
Pavan



--
Regards-
Pavan




--
Regards-
Pavan

RE: How to best decide mapper output/reducer input for a huge string?

Posted by John Lilley <jo...@redpoint.net>.
Pavan,
How large are the rows in HBase?  22 million rows is not very much but you mentioned "huge strings".  Can you tell which part of the processing is the limiting factor (read from HBase, mapper output, reducers)?
John


From: Pavan Sudheendra [mailto:pavan0591@gmail.com]
Sent: Saturday, September 21, 2013 2:17 AM
To: user@hadoop.apache.org
Subject: Re: How to best decide mapper output/reducer input for a huge string?

No, I don't have a combiner in place. Is it necessary? How do I make my map output compressed? Yes, the Tables in HBase are compressed.
Although, there's no real bottleneck, the time it takes to process the entire table is huge. I have to constantly check if i can optimize it somehow..
Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see any thing wrong with my design? Does it require any kind of re-work? Thank you so much for helping..

On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pr...@gmail.com>> wrote:
One thing that comes to mind is that your keys are Strings which are highly inefficient. You might get a lot better performance if you write a custom writable for your Key object using the appropriate data types. For example, use a long (LongWritable) for timestamps. This should make (de)serialization a lot faster. If HouseHoldId is an integer, your speed of comparisons for sorting will also go up.

Ensure that your map output's are being compressed. Are your tables in HBase compressed? Do you have a combiner?

Have you been able to profile your code to see where the bottlenecks are?

On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>> wrote:
Hi Pradeep,
Yes.. Basically i'm only writing the key part as the map output.. The V of <K,V> is not of much use to me.. But i'm hoping to change that if it leads to faster execution.. I'm kind of a newbie so looking to make the map/reduce job run a lot faster..
Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But seems if i write a map output for each and every row of a 19 m row HBase table, its taking nearly a day to complete.. (21 mappers and 21 reducers)

I have looked at both Pig/Hive to do the job but i'm supposed to do this via a MR job.. So, cannot use either of that.. Do you recommend me to try something if i have the data in that format?

On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pr...@gmail.com>> wrote:
I'm sorry but I don't understand your question. Is the output of the mapper you're describing the key portion? If it is the key, then your data should already be sorted by HouseHoldId since it occurs first in your key.

The SortComparator will tell Hadoop how to sort your data. So you use this if you have a need for a non lexical sort order. The GroupingComparator will tell Hadoop how to group your data for the reducer. All KV-pairs from the same group will be given to the same Reducer.

If your reduce computation needs all the KV-pairs for the same HouseHoldId, then you will need to write a GroupingComparator.

Also, have you considered using a higher level abstraction on Hadoop such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT easier to write in these languages.

Hope this helps!
- Pradeep

On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>> wrote:

I need to improve my MR jobs which uses HBase as source as well as sink..

Basically, i'm reading data from 3 HBase Tables in the mapper, writing them out as one huge string for the reducer to do some computation and dump into a HBase Table..

Table1 ~ 19 million rows.

Table2 ~ 2 million rows.

Table3 ~ 900,000 rows.

The output of the mapper is something like this :

HouseHoldId contentID name duration genre type channelId personId televisionID timestamp

I'm interested in sorting it on the basis of the HouseHoldID value so i'm using this technique. I'm not interested in the V part of pair so i'm kind of ignoring it. My mapper class is defined as follows:

public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }

For my MR job to be completed, it takes 22 hours to complete which is not desirable at all. I'm supposed to optimize this somehow to run a lot faster somehow..

scan.setCaching(750);

scan.setCacheBlocks(false);

TableMapReduceUtil.initTableMapperJob (

                                       Table1,           // input HBase table name

                                       scan,

                                       AnalyzeMapper.class,    // mapper

                                       Text.class,             // mapper output key

                                       IntWritable.class,      // mapper output value

                                       job);



                TableMapReduceUtil.initTableReducerJob(

                                        OutputTable,                // output table

                                        AnalyzeReducerTable.class,  // reducer class

                                        job);

                job.setNumReduceTasks(RegionCount);

My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a 8 node cloudera cluster.

Should i use a custom SortComparator or a Group Comparator?


--
Regards-
Pavan



--
Regards-
Pavan




--
Regards-
Pavan

RE: How to best decide mapper output/reducer input for a huge string?

Posted by John Lilley <jo...@redpoint.net>.
Pavan,
How large are the rows in HBase?  22 million rows is not very much but you mentioned "huge strings".  Can you tell which part of the processing is the limiting factor (read from HBase, mapper output, reducers)?
John


From: Pavan Sudheendra [mailto:pavan0591@gmail.com]
Sent: Saturday, September 21, 2013 2:17 AM
To: user@hadoop.apache.org
Subject: Re: How to best decide mapper output/reducer input for a huge string?

No, I don't have a combiner in place. Is it necessary? How do I make my map output compressed? Yes, the Tables in HBase are compressed.
Although, there's no real bottleneck, the time it takes to process the entire table is huge. I have to constantly check if i can optimize it somehow..
Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see any thing wrong with my design? Does it require any kind of re-work? Thank you so much for helping..

On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pr...@gmail.com>> wrote:
One thing that comes to mind is that your keys are Strings which are highly inefficient. You might get a lot better performance if you write a custom writable for your Key object using the appropriate data types. For example, use a long (LongWritable) for timestamps. This should make (de)serialization a lot faster. If HouseHoldId is an integer, your speed of comparisons for sorting will also go up.

Ensure that your map output's are being compressed. Are your tables in HBase compressed? Do you have a combiner?

Have you been able to profile your code to see where the bottlenecks are?

On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>> wrote:
Hi Pradeep,
Yes.. Basically i'm only writing the key part as the map output.. The V of <K,V> is not of much use to me.. But i'm hoping to change that if it leads to faster execution.. I'm kind of a newbie so looking to make the map/reduce job run a lot faster..
Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But seems if i write a map output for each and every row of a 19 m row HBase table, its taking nearly a day to complete.. (21 mappers and 21 reducers)

I have looked at both Pig/Hive to do the job but i'm supposed to do this via a MR job.. So, cannot use either of that.. Do you recommend me to try something if i have the data in that format?

On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pr...@gmail.com>> wrote:
I'm sorry but I don't understand your question. Is the output of the mapper you're describing the key portion? If it is the key, then your data should already be sorted by HouseHoldId since it occurs first in your key.

The SortComparator will tell Hadoop how to sort your data. So you use this if you have a need for a non lexical sort order. The GroupingComparator will tell Hadoop how to group your data for the reducer. All KV-pairs from the same group will be given to the same Reducer.

If your reduce computation needs all the KV-pairs for the same HouseHoldId, then you will need to write a GroupingComparator.

Also, have you considered using a higher level abstraction on Hadoop such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT easier to write in these languages.

Hope this helps!
- Pradeep

On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>> wrote:

I need to improve my MR jobs which uses HBase as source as well as sink..

Basically, i'm reading data from 3 HBase Tables in the mapper, writing them out as one huge string for the reducer to do some computation and dump into a HBase Table..

Table1 ~ 19 million rows.

Table2 ~ 2 million rows.

Table3 ~ 900,000 rows.

The output of the mapper is something like this :

HouseHoldId contentID name duration genre type channelId personId televisionID timestamp

I'm interested in sorting it on the basis of the HouseHoldID value so i'm using this technique. I'm not interested in the V part of pair so i'm kind of ignoring it. My mapper class is defined as follows:

public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }

For my MR job to be completed, it takes 22 hours to complete which is not desirable at all. I'm supposed to optimize this somehow to run a lot faster somehow..

scan.setCaching(750);

scan.setCacheBlocks(false);

TableMapReduceUtil.initTableMapperJob (

                                       Table1,           // input HBase table name

                                       scan,

                                       AnalyzeMapper.class,    // mapper

                                       Text.class,             // mapper output key

                                       IntWritable.class,      // mapper output value

                                       job);



                TableMapReduceUtil.initTableReducerJob(

                                        OutputTable,                // output table

                                        AnalyzeReducerTable.class,  // reducer class

                                        job);

                job.setNumReduceTasks(RegionCount);

My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a 8 node cloudera cluster.

Should i use a custom SortComparator or a Group Comparator?


--
Regards-
Pavan



--
Regards-
Pavan




--
Regards-
Pavan

RE: How to best decide mapper output/reducer input for a huge string?

Posted by John Lilley <jo...@redpoint.net>.
Pavan,
How large are the rows in HBase?  22 million rows is not very much but you mentioned "huge strings".  Can you tell which part of the processing is the limiting factor (read from HBase, mapper output, reducers)?
John


From: Pavan Sudheendra [mailto:pavan0591@gmail.com]
Sent: Saturday, September 21, 2013 2:17 AM
To: user@hadoop.apache.org
Subject: Re: How to best decide mapper output/reducer input for a huge string?

No, I don't have a combiner in place. Is it necessary? How do I make my map output compressed? Yes, the Tables in HBase are compressed.
Although, there's no real bottleneck, the time it takes to process the entire table is huge. I have to constantly check if i can optimize it somehow..
Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see any thing wrong with my design? Does it require any kind of re-work? Thank you so much for helping..

On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pr...@gmail.com>> wrote:
One thing that comes to mind is that your keys are Strings which are highly inefficient. You might get a lot better performance if you write a custom writable for your Key object using the appropriate data types. For example, use a long (LongWritable) for timestamps. This should make (de)serialization a lot faster. If HouseHoldId is an integer, your speed of comparisons for sorting will also go up.

Ensure that your map output's are being compressed. Are your tables in HBase compressed? Do you have a combiner?

Have you been able to profile your code to see where the bottlenecks are?

On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>> wrote:
Hi Pradeep,
Yes.. Basically i'm only writing the key part as the map output.. The V of <K,V> is not of much use to me.. But i'm hoping to change that if it leads to faster execution.. I'm kind of a newbie so looking to make the map/reduce job run a lot faster..
Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But seems if i write a map output for each and every row of a 19 m row HBase table, its taking nearly a day to complete.. (21 mappers and 21 reducers)

I have looked at both Pig/Hive to do the job but i'm supposed to do this via a MR job.. So, cannot use either of that.. Do you recommend me to try something if i have the data in that format?

On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pr...@gmail.com>> wrote:
I'm sorry but I don't understand your question. Is the output of the mapper you're describing the key portion? If it is the key, then your data should already be sorted by HouseHoldId since it occurs first in your key.

The SortComparator will tell Hadoop how to sort your data. So you use this if you have a need for a non lexical sort order. The GroupingComparator will tell Hadoop how to group your data for the reducer. All KV-pairs from the same group will be given to the same Reducer.

If your reduce computation needs all the KV-pairs for the same HouseHoldId, then you will need to write a GroupingComparator.

Also, have you considered using a higher level abstraction on Hadoop such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT easier to write in these languages.

Hope this helps!
- Pradeep

On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>> wrote:

I need to improve my MR jobs which uses HBase as source as well as sink..

Basically, i'm reading data from 3 HBase Tables in the mapper, writing them out as one huge string for the reducer to do some computation and dump into a HBase Table..

Table1 ~ 19 million rows.

Table2 ~ 2 million rows.

Table3 ~ 900,000 rows.

The output of the mapper is something like this :

HouseHoldId contentID name duration genre type channelId personId televisionID timestamp

I'm interested in sorting it on the basis of the HouseHoldID value so i'm using this technique. I'm not interested in the V part of pair so i'm kind of ignoring it. My mapper class is defined as follows:

public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }

For my MR job to be completed, it takes 22 hours to complete which is not desirable at all. I'm supposed to optimize this somehow to run a lot faster somehow..

scan.setCaching(750);

scan.setCacheBlocks(false);

TableMapReduceUtil.initTableMapperJob (

                                       Table1,           // input HBase table name

                                       scan,

                                       AnalyzeMapper.class,    // mapper

                                       Text.class,             // mapper output key

                                       IntWritable.class,      // mapper output value

                                       job);



                TableMapReduceUtil.initTableReducerJob(

                                        OutputTable,                // output table

                                        AnalyzeReducerTable.class,  // reducer class

                                        job);

                job.setNumReduceTasks(RegionCount);

My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a 8 node cloudera cluster.

Should i use a custom SortComparator or a Group Comparator?


--
Regards-
Pavan



--
Regards-
Pavan




--
Regards-
Pavan

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Pavan Sudheendra <pa...@gmail.com>.
No, I don't have a combiner in place. Is it necessary? How do I make my map
output compressed? Yes, the Tables in HBase are compressed.

Although, there's no real bottleneck, the time it takes to process the
entire table is huge. I have to constantly check if i can optimize it
somehow..

Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
any thing wrong with my design? Does it require any kind of re-work? Thank
you so much for helping..


On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pr...@gmail.com>wrote:

> One thing that comes to mind is that your keys are Strings which are
> highly inefficient. You might get a lot better performance if you write a
> custom writable for your Key object using the appropriate data types. For
> example, use a long (LongWritable) for timestamps. This should make
> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
> comparisons for sorting will also go up.
>
> Ensure that your map output's are being compressed. Are your tables in
> HBase compressed? Do you have a combiner?
>
> Have you been able to profile your code to see where the bottlenecks are?
>
>
> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>wrote:
>
>> Hi Pradeep,
>> Yes.. Basically i'm only writing the key part as the map output.. The V
>> of <K,V> is not of much use to me.. But i'm hoping to change that if it
>> leads to faster execution.. I'm kind of a newbie so looking to make the
>> map/reduce job run a lot faster..
>>
>> Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
>> seems if i write a map output for each and every row of a 19 m row HBase
>> table, its taking nearly a day to complete.. (21 mappers and 21 reducers)
>>
>> I have looked at both Pig/Hive to do the job but i'm supposed to do this
>> via a MR job.. So, cannot use either of that.. Do you recommend me to try
>> something if i have the data in that format?
>>
>>
>> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pradeepg26@gmail.com
>> > wrote:
>>
>>> I'm sorry but I don't understand your question. Is the output of the
>>> mapper you're describing the key portion? If it is the key, then your data
>>> should already be sorted by HouseHoldId since it occurs first in your key.
>>>
>>> The SortComparator will tell Hadoop how to sort your data. So you use
>>> this if you have a need for a non lexical sort order. The
>>> GroupingComparator will tell Hadoop how to group your data for the reducer.
>>> All KV-pairs from the same group will be given to the same Reducer.
>>>
>>> If your reduce computation needs all the KV-pairs for the same
>>> HouseHoldId, then you will need to write a GroupingComparator.
>>>
>>> Also, have you considered using a higher level abstraction on Hadoop
>>> such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are
>>> a LOT easier to write in these languages.
>>>
>>> Hope this helps!
>>> - Pradeep
>>>
>>>
>>> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>wrote:
>>>
>>>> I need to improve my MR jobs which uses HBase as source as well as
>>>> sink..
>>>>
>>>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>>>> them out as one huge string for the reducer to do some computation and dump
>>>> into a HBase Table..
>>>>
>>>> Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.
>>>>
>>>> The output of the mapper is something like this :
>>>>
>>>> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp
>>>>
>>>> I'm interested in sorting it on the basis of the HouseHoldID value so
>>>> i'm using this technique. I'm not interested in the V part of pair so i'm
>>>> kind of ignoring it. My mapper class is defined as follows:
>>>>
>>>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }
>>>>
>>>> For my MR job to be completed, it takes 22 hours to complete which is
>>>> not desirable at all. I'm supposed to optimize this somehow to run a lot
>>>> faster somehow..
>>>>
>>>> scan.setCaching(750);
>>>> scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
>>>>                                        Table1,           // input HBase table name
>>>>                                        scan,
>>>>                                        AnalyzeMapper.class,    // mapper
>>>>                                        Text.class,             // mapper output key
>>>>                                        IntWritable.class,      // mapper output value
>>>>                                        job);
>>>>
>>>>                 TableMapReduceUtil.initTableReducerJob(
>>>>                                         OutputTable,                // output table
>>>>                                         AnalyzeReducerTable.class,  // reducer class
>>>>                                         job);
>>>>                 job.setNumReduceTasks(RegionCount);
>>>>
>>>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are
>>>> running a 8 node cloudera cluster.
>>>>
>>>> Should i use a custom SortComparator or a Group Comparator?
>>>>
>>>>
>>>> --
>>>> Regards-
>>>> Pavan
>>>>
>>>
>>>
>>
>>
>> --
>> Regards-
>> Pavan
>>
>
>


-- 
Regards-
Pavan

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Pavan Sudheendra <pa...@gmail.com>.
No, I don't have a combiner in place. Is it necessary? How do I make my map
output compressed? Yes, the Tables in HBase are compressed.

Although, there's no real bottleneck, the time it takes to process the
entire table is huge. I have to constantly check if i can optimize it
somehow..

Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
any thing wrong with my design? Does it require any kind of re-work? Thank
you so much for helping..


On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pr...@gmail.com>wrote:

> One thing that comes to mind is that your keys are Strings which are
> highly inefficient. You might get a lot better performance if you write a
> custom writable for your Key object using the appropriate data types. For
> example, use a long (LongWritable) for timestamps. This should make
> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
> comparisons for sorting will also go up.
>
> Ensure that your map output's are being compressed. Are your tables in
> HBase compressed? Do you have a combiner?
>
> Have you been able to profile your code to see where the bottlenecks are?
>
>
> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>wrote:
>
>> Hi Pradeep,
>> Yes.. Basically i'm only writing the key part as the map output.. The V
>> of <K,V> is not of much use to me.. But i'm hoping to change that if it
>> leads to faster execution.. I'm kind of a newbie so looking to make the
>> map/reduce job run a lot faster..
>>
>> Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
>> seems if i write a map output for each and every row of a 19 m row HBase
>> table, its taking nearly a day to complete.. (21 mappers and 21 reducers)
>>
>> I have looked at both Pig/Hive to do the job but i'm supposed to do this
>> via a MR job.. So, cannot use either of that.. Do you recommend me to try
>> something if i have the data in that format?
>>
>>
>> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pradeepg26@gmail.com
>> > wrote:
>>
>>> I'm sorry but I don't understand your question. Is the output of the
>>> mapper you're describing the key portion? If it is the key, then your data
>>> should already be sorted by HouseHoldId since it occurs first in your key.
>>>
>>> The SortComparator will tell Hadoop how to sort your data. So you use
>>> this if you have a need for a non lexical sort order. The
>>> GroupingComparator will tell Hadoop how to group your data for the reducer.
>>> All KV-pairs from the same group will be given to the same Reducer.
>>>
>>> If your reduce computation needs all the KV-pairs for the same
>>> HouseHoldId, then you will need to write a GroupingComparator.
>>>
>>> Also, have you considered using a higher level abstraction on Hadoop
>>> such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are
>>> a LOT easier to write in these languages.
>>>
>>> Hope this helps!
>>> - Pradeep
>>>
>>>
>>> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>wrote:
>>>
>>>> I need to improve my MR jobs which uses HBase as source as well as
>>>> sink..
>>>>
>>>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>>>> them out as one huge string for the reducer to do some computation and dump
>>>> into a HBase Table..
>>>>
>>>> Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.
>>>>
>>>> The output of the mapper is something like this :
>>>>
>>>> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp
>>>>
>>>> I'm interested in sorting it on the basis of the HouseHoldID value so
>>>> i'm using this technique. I'm not interested in the V part of pair so i'm
>>>> kind of ignoring it. My mapper class is defined as follows:
>>>>
>>>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }
>>>>
>>>> For my MR job to be completed, it takes 22 hours to complete which is
>>>> not desirable at all. I'm supposed to optimize this somehow to run a lot
>>>> faster somehow..
>>>>
>>>> scan.setCaching(750);
>>>> scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
>>>>                                        Table1,           // input HBase table name
>>>>                                        scan,
>>>>                                        AnalyzeMapper.class,    // mapper
>>>>                                        Text.class,             // mapper output key
>>>>                                        IntWritable.class,      // mapper output value
>>>>                                        job);
>>>>
>>>>                 TableMapReduceUtil.initTableReducerJob(
>>>>                                         OutputTable,                // output table
>>>>                                         AnalyzeReducerTable.class,  // reducer class
>>>>                                         job);
>>>>                 job.setNumReduceTasks(RegionCount);
>>>>
>>>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are
>>>> running a 8 node cloudera cluster.
>>>>
>>>> Should i use a custom SortComparator or a Group Comparator?
>>>>
>>>>
>>>> --
>>>> Regards-
>>>> Pavan
>>>>
>>>
>>>
>>
>>
>> --
>> Regards-
>> Pavan
>>
>
>


-- 
Regards-
Pavan

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Pavan Sudheendra <pa...@gmail.com>.
No, I don't have a combiner in place. Is it necessary? How do I make my map
output compressed? Yes, the Tables in HBase are compressed.

Although, there's no real bottleneck, the time it takes to process the
entire table is huge. I have to constantly check if i can optimize it
somehow..

Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
any thing wrong with my design? Does it require any kind of re-work? Thank
you so much for helping..


On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pr...@gmail.com>wrote:

> One thing that comes to mind is that your keys are Strings which are
> highly inefficient. You might get a lot better performance if you write a
> custom writable for your Key object using the appropriate data types. For
> example, use a long (LongWritable) for timestamps. This should make
> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
> comparisons for sorting will also go up.
>
> Ensure that your map output's are being compressed. Are your tables in
> HBase compressed? Do you have a combiner?
>
> Have you been able to profile your code to see where the bottlenecks are?
>
>
> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>wrote:
>
>> Hi Pradeep,
>> Yes.. Basically i'm only writing the key part as the map output.. The V
>> of <K,V> is not of much use to me.. But i'm hoping to change that if it
>> leads to faster execution.. I'm kind of a newbie so looking to make the
>> map/reduce job run a lot faster..
>>
>> Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
>> seems if i write a map output for each and every row of a 19 m row HBase
>> table, its taking nearly a day to complete.. (21 mappers and 21 reducers)
>>
>> I have looked at both Pig/Hive to do the job but i'm supposed to do this
>> via a MR job.. So, cannot use either of that.. Do you recommend me to try
>> something if i have the data in that format?
>>
>>
>> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pradeepg26@gmail.com
>> > wrote:
>>
>>> I'm sorry but I don't understand your question. Is the output of the
>>> mapper you're describing the key portion? If it is the key, then your data
>>> should already be sorted by HouseHoldId since it occurs first in your key.
>>>
>>> The SortComparator will tell Hadoop how to sort your data. So you use
>>> this if you have a need for a non lexical sort order. The
>>> GroupingComparator will tell Hadoop how to group your data for the reducer.
>>> All KV-pairs from the same group will be given to the same Reducer.
>>>
>>> If your reduce computation needs all the KV-pairs for the same
>>> HouseHoldId, then you will need to write a GroupingComparator.
>>>
>>> Also, have you considered using a higher level abstraction on Hadoop
>>> such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are
>>> a LOT easier to write in these languages.
>>>
>>> Hope this helps!
>>> - Pradeep
>>>
>>>
>>> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>wrote:
>>>
>>>> I need to improve my MR jobs which uses HBase as source as well as
>>>> sink..
>>>>
>>>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>>>> them out as one huge string for the reducer to do some computation and dump
>>>> into a HBase Table..
>>>>
>>>> Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.
>>>>
>>>> The output of the mapper is something like this :
>>>>
>>>> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp
>>>>
>>>> I'm interested in sorting it on the basis of the HouseHoldID value so
>>>> i'm using this technique. I'm not interested in the V part of pair so i'm
>>>> kind of ignoring it. My mapper class is defined as follows:
>>>>
>>>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }
>>>>
>>>> For my MR job to be completed, it takes 22 hours to complete which is
>>>> not desirable at all. I'm supposed to optimize this somehow to run a lot
>>>> faster somehow..
>>>>
>>>> scan.setCaching(750);
>>>> scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
>>>>                                        Table1,           // input HBase table name
>>>>                                        scan,
>>>>                                        AnalyzeMapper.class,    // mapper
>>>>                                        Text.class,             // mapper output key
>>>>                                        IntWritable.class,      // mapper output value
>>>>                                        job);
>>>>
>>>>                 TableMapReduceUtil.initTableReducerJob(
>>>>                                         OutputTable,                // output table
>>>>                                         AnalyzeReducerTable.class,  // reducer class
>>>>                                         job);
>>>>                 job.setNumReduceTasks(RegionCount);
>>>>
>>>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are
>>>> running a 8 node cloudera cluster.
>>>>
>>>> Should i use a custom SortComparator or a Group Comparator?
>>>>
>>>>
>>>> --
>>>> Regards-
>>>> Pavan
>>>>
>>>
>>>
>>
>>
>> --
>> Regards-
>> Pavan
>>
>
>


-- 
Regards-
Pavan

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Pavan Sudheendra <pa...@gmail.com>.
No, I don't have a combiner in place. Is it necessary? How do I make my map
output compressed? Yes, the Tables in HBase are compressed.

Although, there's no real bottleneck, the time it takes to process the
entire table is huge. I have to constantly check if i can optimize it
somehow..

Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
any thing wrong with my design? Does it require any kind of re-work? Thank
you so much for helping..


On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pr...@gmail.com>wrote:

> One thing that comes to mind is that your keys are Strings which are
> highly inefficient. You might get a lot better performance if you write a
> custom writable for your Key object using the appropriate data types. For
> example, use a long (LongWritable) for timestamps. This should make
> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
> comparisons for sorting will also go up.
>
> Ensure that your map output's are being compressed. Are your tables in
> HBase compressed? Do you have a combiner?
>
> Have you been able to profile your code to see where the bottlenecks are?
>
>
> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>wrote:
>
>> Hi Pradeep,
>> Yes.. Basically i'm only writing the key part as the map output.. The V
>> of <K,V> is not of much use to me.. But i'm hoping to change that if it
>> leads to faster execution.. I'm kind of a newbie so looking to make the
>> map/reduce job run a lot faster..
>>
>> Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
>> seems if i write a map output for each and every row of a 19 m row HBase
>> table, its taking nearly a day to complete.. (21 mappers and 21 reducers)
>>
>> I have looked at both Pig/Hive to do the job but i'm supposed to do this
>> via a MR job.. So, cannot use either of that.. Do you recommend me to try
>> something if i have the data in that format?
>>
>>
>> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pradeepg26@gmail.com
>> > wrote:
>>
>>> I'm sorry but I don't understand your question. Is the output of the
>>> mapper you're describing the key portion? If it is the key, then your data
>>> should already be sorted by HouseHoldId since it occurs first in your key.
>>>
>>> The SortComparator will tell Hadoop how to sort your data. So you use
>>> this if you have a need for a non lexical sort order. The
>>> GroupingComparator will tell Hadoop how to group your data for the reducer.
>>> All KV-pairs from the same group will be given to the same Reducer.
>>>
>>> If your reduce computation needs all the KV-pairs for the same
>>> HouseHoldId, then you will need to write a GroupingComparator.
>>>
>>> Also, have you considered using a higher level abstraction on Hadoop
>>> such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are
>>> a LOT easier to write in these languages.
>>>
>>> Hope this helps!
>>> - Pradeep
>>>
>>>
>>> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>wrote:
>>>
>>>> I need to improve my MR jobs which uses HBase as source as well as
>>>> sink..
>>>>
>>>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>>>> them out as one huge string for the reducer to do some computation and dump
>>>> into a HBase Table..
>>>>
>>>> Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.
>>>>
>>>> The output of the mapper is something like this :
>>>>
>>>> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp
>>>>
>>>> I'm interested in sorting it on the basis of the HouseHoldID value so
>>>> i'm using this technique. I'm not interested in the V part of pair so i'm
>>>> kind of ignoring it. My mapper class is defined as follows:
>>>>
>>>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }
>>>>
>>>> For my MR job to be completed, it takes 22 hours to complete which is
>>>> not desirable at all. I'm supposed to optimize this somehow to run a lot
>>>> faster somehow..
>>>>
>>>> scan.setCaching(750);
>>>> scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
>>>>                                        Table1,           // input HBase table name
>>>>                                        scan,
>>>>                                        AnalyzeMapper.class,    // mapper
>>>>                                        Text.class,             // mapper output key
>>>>                                        IntWritable.class,      // mapper output value
>>>>                                        job);
>>>>
>>>>                 TableMapReduceUtil.initTableReducerJob(
>>>>                                         OutputTable,                // output table
>>>>                                         AnalyzeReducerTable.class,  // reducer class
>>>>                                         job);
>>>>                 job.setNumReduceTasks(RegionCount);
>>>>
>>>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are
>>>> running a 8 node cloudera cluster.
>>>>
>>>> Should i use a custom SortComparator or a Group Comparator?
>>>>
>>>>
>>>> --
>>>> Regards-
>>>> Pavan
>>>>
>>>
>>>
>>
>>
>> --
>> Regards-
>> Pavan
>>
>
>


-- 
Regards-
Pavan

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Pradeep Gollakota <pr...@gmail.com>.
One thing that comes to mind is that your keys are Strings which are highly
inefficient. You might get a lot better performance if you write a custom
writable for your Key object using the appropriate data types. For example,
use a long (LongWritable) for timestamps. This should make
(de)serialization a lot faster. If HouseHoldId is an integer, your speed of
comparisons for sorting will also go up.

Ensure that your map output's are being compressed. Are your tables in
HBase compressed? Do you have a combiner?

Have you been able to profile your code to see where the bottlenecks are?


On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>wrote:

> Hi Pradeep,
> Yes.. Basically i'm only writing the key part as the map output.. The V of
> <K,V> is not of much use to me.. But i'm hoping to change that if it leads
> to faster execution.. I'm kind of a newbie so looking to make the
> map/reduce job run a lot faster..
>
> Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
> seems if i write a map output for each and every row of a 19 m row HBase
> table, its taking nearly a day to complete.. (21 mappers and 21 reducers)
>
> I have looked at both Pig/Hive to do the job but i'm supposed to do this
> via a MR job.. So, cannot use either of that.. Do you recommend me to try
> something if i have the data in that format?
>
>
> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pr...@gmail.com>wrote:
>
>> I'm sorry but I don't understand your question. Is the output of the
>> mapper you're describing the key portion? If it is the key, then your data
>> should already be sorted by HouseHoldId since it occurs first in your key.
>>
>> The SortComparator will tell Hadoop how to sort your data. So you use
>> this if you have a need for a non lexical sort order. The
>> GroupingComparator will tell Hadoop how to group your data for the reducer.
>> All KV-pairs from the same group will be given to the same Reducer.
>>
>> If your reduce computation needs all the KV-pairs for the same
>> HouseHoldId, then you will need to write a GroupingComparator.
>>
>> Also, have you considered using a higher level abstraction on Hadoop such
>> as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
>> easier to write in these languages.
>>
>> Hope this helps!
>> - Pradeep
>>
>>
>> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>wrote:
>>
>>> I need to improve my MR jobs which uses HBase as source as well as
>>> sink..
>>>
>>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>>> them out as one huge string for the reducer to do some computation and dump
>>> into a HBase Table..
>>>
>>> Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.
>>>
>>> The output of the mapper is something like this :
>>>
>>> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp
>>>
>>> I'm interested in sorting it on the basis of the HouseHoldID value so
>>> i'm using this technique. I'm not interested in the V part of pair so i'm
>>> kind of ignoring it. My mapper class is defined as follows:
>>>
>>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }
>>>
>>> For my MR job to be completed, it takes 22 hours to complete which is
>>> not desirable at all. I'm supposed to optimize this somehow to run a lot
>>> faster somehow..
>>>
>>> scan.setCaching(750);
>>> scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
>>>                                        Table1,           // input HBase table name
>>>                                        scan,
>>>                                        AnalyzeMapper.class,    // mapper
>>>                                        Text.class,             // mapper output key
>>>                                        IntWritable.class,      // mapper output value
>>>                                        job);
>>>
>>>                 TableMapReduceUtil.initTableReducerJob(
>>>                                         OutputTable,                // output table
>>>                                         AnalyzeReducerTable.class,  // reducer class
>>>                                         job);
>>>                 job.setNumReduceTasks(RegionCount);
>>>
>>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running
>>> a 8 node cloudera cluster.
>>>
>>> Should i use a custom SortComparator or a Group Comparator?
>>>
>>>
>>> --
>>> Regards-
>>> Pavan
>>>
>>
>>
>
>
> --
> Regards-
> Pavan
>

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Pradeep Gollakota <pr...@gmail.com>.
One thing that comes to mind is that your keys are Strings which are highly
inefficient. You might get a lot better performance if you write a custom
writable for your Key object using the appropriate data types. For example,
use a long (LongWritable) for timestamps. This should make
(de)serialization a lot faster. If HouseHoldId is an integer, your speed of
comparisons for sorting will also go up.

Ensure that your map output's are being compressed. Are your tables in
HBase compressed? Do you have a combiner?

Have you been able to profile your code to see where the bottlenecks are?


On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>wrote:

> Hi Pradeep,
> Yes.. Basically i'm only writing the key part as the map output.. The V of
> <K,V> is not of much use to me.. But i'm hoping to change that if it leads
> to faster execution.. I'm kind of a newbie so looking to make the
> map/reduce job run a lot faster..
>
> Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
> seems if i write a map output for each and every row of a 19 m row HBase
> table, its taking nearly a day to complete.. (21 mappers and 21 reducers)
>
> I have looked at both Pig/Hive to do the job but i'm supposed to do this
> via a MR job.. So, cannot use either of that.. Do you recommend me to try
> something if i have the data in that format?
>
>
> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pr...@gmail.com>wrote:
>
>> I'm sorry but I don't understand your question. Is the output of the
>> mapper you're describing the key portion? If it is the key, then your data
>> should already be sorted by HouseHoldId since it occurs first in your key.
>>
>> The SortComparator will tell Hadoop how to sort your data. So you use
>> this if you have a need for a non lexical sort order. The
>> GroupingComparator will tell Hadoop how to group your data for the reducer.
>> All KV-pairs from the same group will be given to the same Reducer.
>>
>> If your reduce computation needs all the KV-pairs for the same
>> HouseHoldId, then you will need to write a GroupingComparator.
>>
>> Also, have you considered using a higher level abstraction on Hadoop such
>> as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
>> easier to write in these languages.
>>
>> Hope this helps!
>> - Pradeep
>>
>>
>> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>wrote:
>>
>>> I need to improve my MR jobs which uses HBase as source as well as
>>> sink..
>>>
>>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>>> them out as one huge string for the reducer to do some computation and dump
>>> into a HBase Table..
>>>
>>> Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.
>>>
>>> The output of the mapper is something like this :
>>>
>>> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp
>>>
>>> I'm interested in sorting it on the basis of the HouseHoldID value so
>>> i'm using this technique. I'm not interested in the V part of pair so i'm
>>> kind of ignoring it. My mapper class is defined as follows:
>>>
>>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }
>>>
>>> For my MR job to be completed, it takes 22 hours to complete which is
>>> not desirable at all. I'm supposed to optimize this somehow to run a lot
>>> faster somehow..
>>>
>>> scan.setCaching(750);
>>> scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
>>>                                        Table1,           // input HBase table name
>>>                                        scan,
>>>                                        AnalyzeMapper.class,    // mapper
>>>                                        Text.class,             // mapper output key
>>>                                        IntWritable.class,      // mapper output value
>>>                                        job);
>>>
>>>                 TableMapReduceUtil.initTableReducerJob(
>>>                                         OutputTable,                // output table
>>>                                         AnalyzeReducerTable.class,  // reducer class
>>>                                         job);
>>>                 job.setNumReduceTasks(RegionCount);
>>>
>>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running
>>> a 8 node cloudera cluster.
>>>
>>> Should i use a custom SortComparator or a Group Comparator?
>>>
>>>
>>> --
>>> Regards-
>>> Pavan
>>>
>>
>>
>
>
> --
> Regards-
> Pavan
>

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Pradeep Gollakota <pr...@gmail.com>.
One thing that comes to mind is that your keys are Strings which are highly
inefficient. You might get a lot better performance if you write a custom
writable for your Key object using the appropriate data types. For example,
use a long (LongWritable) for timestamps. This should make
(de)serialization a lot faster. If HouseHoldId is an integer, your speed of
comparisons for sorting will also go up.

Ensure that your map output's are being compressed. Are your tables in
HBase compressed? Do you have a combiner?

Have you been able to profile your code to see where the bottlenecks are?


On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>wrote:

> Hi Pradeep,
> Yes.. Basically i'm only writing the key part as the map output.. The V of
> <K,V> is not of much use to me.. But i'm hoping to change that if it leads
> to faster execution.. I'm kind of a newbie so looking to make the
> map/reduce job run a lot faster..
>
> Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
> seems if i write a map output for each and every row of a 19 m row HBase
> table, its taking nearly a day to complete.. (21 mappers and 21 reducers)
>
> I have looked at both Pig/Hive to do the job but i'm supposed to do this
> via a MR job.. So, cannot use either of that.. Do you recommend me to try
> something if i have the data in that format?
>
>
> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pr...@gmail.com>wrote:
>
>> I'm sorry but I don't understand your question. Is the output of the
>> mapper you're describing the key portion? If it is the key, then your data
>> should already be sorted by HouseHoldId since it occurs first in your key.
>>
>> The SortComparator will tell Hadoop how to sort your data. So you use
>> this if you have a need for a non lexical sort order. The
>> GroupingComparator will tell Hadoop how to group your data for the reducer.
>> All KV-pairs from the same group will be given to the same Reducer.
>>
>> If your reduce computation needs all the KV-pairs for the same
>> HouseHoldId, then you will need to write a GroupingComparator.
>>
>> Also, have you considered using a higher level abstraction on Hadoop such
>> as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
>> easier to write in these languages.
>>
>> Hope this helps!
>> - Pradeep
>>
>>
>> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>wrote:
>>
>>> I need to improve my MR jobs which uses HBase as source as well as
>>> sink..
>>>
>>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>>> them out as one huge string for the reducer to do some computation and dump
>>> into a HBase Table..
>>>
>>> Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.
>>>
>>> The output of the mapper is something like this :
>>>
>>> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp
>>>
>>> I'm interested in sorting it on the basis of the HouseHoldID value so
>>> i'm using this technique. I'm not interested in the V part of pair so i'm
>>> kind of ignoring it. My mapper class is defined as follows:
>>>
>>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }
>>>
>>> For my MR job to be completed, it takes 22 hours to complete which is
>>> not desirable at all. I'm supposed to optimize this somehow to run a lot
>>> faster somehow..
>>>
>>> scan.setCaching(750);
>>> scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
>>>                                        Table1,           // input HBase table name
>>>                                        scan,
>>>                                        AnalyzeMapper.class,    // mapper
>>>                                        Text.class,             // mapper output key
>>>                                        IntWritable.class,      // mapper output value
>>>                                        job);
>>>
>>>                 TableMapReduceUtil.initTableReducerJob(
>>>                                         OutputTable,                // output table
>>>                                         AnalyzeReducerTable.class,  // reducer class
>>>                                         job);
>>>                 job.setNumReduceTasks(RegionCount);
>>>
>>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running
>>> a 8 node cloudera cluster.
>>>
>>> Should i use a custom SortComparator or a Group Comparator?
>>>
>>>
>>> --
>>> Regards-
>>> Pavan
>>>
>>
>>
>
>
> --
> Regards-
> Pavan
>

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Pradeep Gollakota <pr...@gmail.com>.
One thing that comes to mind is that your keys are Strings which are highly
inefficient. You might get a lot better performance if you write a custom
writable for your Key object using the appropriate data types. For example,
use a long (LongWritable) for timestamps. This should make
(de)serialization a lot faster. If HouseHoldId is an integer, your speed of
comparisons for sorting will also go up.

Ensure that your map output's are being compressed. Are your tables in
HBase compressed? Do you have a combiner?

Have you been able to profile your code to see where the bottlenecks are?


On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pa...@gmail.com>wrote:

> Hi Pradeep,
> Yes.. Basically i'm only writing the key part as the map output.. The V of
> <K,V> is not of much use to me.. But i'm hoping to change that if it leads
> to faster execution.. I'm kind of a newbie so looking to make the
> map/reduce job run a lot faster..
>
> Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
> seems if i write a map output for each and every row of a 19 m row HBase
> table, its taking nearly a day to complete.. (21 mappers and 21 reducers)
>
> I have looked at both Pig/Hive to do the job but i'm supposed to do this
> via a MR job.. So, cannot use either of that.. Do you recommend me to try
> something if i have the data in that format?
>
>
> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pr...@gmail.com>wrote:
>
>> I'm sorry but I don't understand your question. Is the output of the
>> mapper you're describing the key portion? If it is the key, then your data
>> should already be sorted by HouseHoldId since it occurs first in your key.
>>
>> The SortComparator will tell Hadoop how to sort your data. So you use
>> this if you have a need for a non lexical sort order. The
>> GroupingComparator will tell Hadoop how to group your data for the reducer.
>> All KV-pairs from the same group will be given to the same Reducer.
>>
>> If your reduce computation needs all the KV-pairs for the same
>> HouseHoldId, then you will need to write a GroupingComparator.
>>
>> Also, have you considered using a higher level abstraction on Hadoop such
>> as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
>> easier to write in these languages.
>>
>> Hope this helps!
>> - Pradeep
>>
>>
>> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>wrote:
>>
>>> I need to improve my MR jobs which uses HBase as source as well as
>>> sink..
>>>
>>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>>> them out as one huge string for the reducer to do some computation and dump
>>> into a HBase Table..
>>>
>>> Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.
>>>
>>> The output of the mapper is something like this :
>>>
>>> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp
>>>
>>> I'm interested in sorting it on the basis of the HouseHoldID value so
>>> i'm using this technique. I'm not interested in the V part of pair so i'm
>>> kind of ignoring it. My mapper class is defined as follows:
>>>
>>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }
>>>
>>> For my MR job to be completed, it takes 22 hours to complete which is
>>> not desirable at all. I'm supposed to optimize this somehow to run a lot
>>> faster somehow..
>>>
>>> scan.setCaching(750);
>>> scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
>>>                                        Table1,           // input HBase table name
>>>                                        scan,
>>>                                        AnalyzeMapper.class,    // mapper
>>>                                        Text.class,             // mapper output key
>>>                                        IntWritable.class,      // mapper output value
>>>                                        job);
>>>
>>>                 TableMapReduceUtil.initTableReducerJob(
>>>                                         OutputTable,                // output table
>>>                                         AnalyzeReducerTable.class,  // reducer class
>>>                                         job);
>>>                 job.setNumReduceTasks(RegionCount);
>>>
>>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running
>>> a 8 node cloudera cluster.
>>>
>>> Should i use a custom SortComparator or a Group Comparator?
>>>
>>>
>>> --
>>> Regards-
>>> Pavan
>>>
>>
>>
>
>
> --
> Regards-
> Pavan
>

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Pavan Sudheendra <pa...@gmail.com>.
Hi Pradeep,
Yes.. Basically i'm only writing the key part as the map output.. The V of
<K,V> is not of much use to me.. But i'm hoping to change that if it leads
to faster execution.. I'm kind of a newbie so looking to make the
map/reduce job run a lot faster..

Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
seems if i write a map output for each and every row of a 19 m row HBase
table, its taking nearly a day to complete.. (21 mappers and 21 reducers)

I have looked at both Pig/Hive to do the job but i'm supposed to do this
via a MR job.. So, cannot use either of that.. Do you recommend me to try
something if i have the data in that format?


On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pr...@gmail.com>wrote:

> I'm sorry but I don't understand your question. Is the output of the
> mapper you're describing the key portion? If it is the key, then your data
> should already be sorted by HouseHoldId since it occurs first in your key.
>
> The SortComparator will tell Hadoop how to sort your data. So you use this
> if you have a need for a non lexical sort order. The GroupingComparator
> will tell Hadoop how to group your data for the reducer. All KV-pairs from
> the same group will be given to the same Reducer.
>
> If your reduce computation needs all the KV-pairs for the same
> HouseHoldId, then you will need to write a GroupingComparator.
>
> Also, have you considered using a higher level abstraction on Hadoop such
> as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
> easier to write in these languages.
>
> Hope this helps!
> - Pradeep
>
>
> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>wrote:
>
>> I need to improve my MR jobs which uses HBase as source as well as sink..
>>
>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>> them out as one huge string for the reducer to do some computation and dump
>> into a HBase Table..
>>
>> Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.
>>
>> The output of the mapper is something like this :
>>
>> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp
>>
>> I'm interested in sorting it on the basis of the HouseHoldID value so i'm
>> using this technique. I'm not interested in the V part of pair so i'm kind
>> of ignoring it. My mapper class is defined as follows:
>>
>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }
>>
>> For my MR job to be completed, it takes 22 hours to complete which is not
>> desirable at all. I'm supposed to optimize this somehow to run a lot faster
>> somehow..
>>
>> scan.setCaching(750);
>> scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
>>                                        Table1,           // input HBase table name
>>                                        scan,
>>                                        AnalyzeMapper.class,    // mapper
>>                                        Text.class,             // mapper output key
>>                                        IntWritable.class,      // mapper output value
>>                                        job);
>>
>>                 TableMapReduceUtil.initTableReducerJob(
>>                                         OutputTable,                // output table
>>                                         AnalyzeReducerTable.class,  // reducer class
>>                                         job);
>>                 job.setNumReduceTasks(RegionCount);
>>
>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running
>> a 8 node cloudera cluster.
>>
>> Should i use a custom SortComparator or a Group Comparator?
>>
>>
>> --
>> Regards-
>> Pavan
>>
>
>


-- 
Regards-
Pavan

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Pavan Sudheendra <pa...@gmail.com>.
Hi Pradeep,
Yes.. Basically i'm only writing the key part as the map output.. The V of
<K,V> is not of much use to me.. But i'm hoping to change that if it leads
to faster execution.. I'm kind of a newbie so looking to make the
map/reduce job run a lot faster..

Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
seems if i write a map output for each and every row of a 19 m row HBase
table, its taking nearly a day to complete.. (21 mappers and 21 reducers)

I have looked at both Pig/Hive to do the job but i'm supposed to do this
via a MR job.. So, cannot use either of that.. Do you recommend me to try
something if i have the data in that format?


On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pr...@gmail.com>wrote:

> I'm sorry but I don't understand your question. Is the output of the
> mapper you're describing the key portion? If it is the key, then your data
> should already be sorted by HouseHoldId since it occurs first in your key.
>
> The SortComparator will tell Hadoop how to sort your data. So you use this
> if you have a need for a non lexical sort order. The GroupingComparator
> will tell Hadoop how to group your data for the reducer. All KV-pairs from
> the same group will be given to the same Reducer.
>
> If your reduce computation needs all the KV-pairs for the same
> HouseHoldId, then you will need to write a GroupingComparator.
>
> Also, have you considered using a higher level abstraction on Hadoop such
> as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
> easier to write in these languages.
>
> Hope this helps!
> - Pradeep
>
>
> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>wrote:
>
>> I need to improve my MR jobs which uses HBase as source as well as sink..
>>
>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>> them out as one huge string for the reducer to do some computation and dump
>> into a HBase Table..
>>
>> Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.
>>
>> The output of the mapper is something like this :
>>
>> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp
>>
>> I'm interested in sorting it on the basis of the HouseHoldID value so i'm
>> using this technique. I'm not interested in the V part of pair so i'm kind
>> of ignoring it. My mapper class is defined as follows:
>>
>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }
>>
>> For my MR job to be completed, it takes 22 hours to complete which is not
>> desirable at all. I'm supposed to optimize this somehow to run a lot faster
>> somehow..
>>
>> scan.setCaching(750);
>> scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
>>                                        Table1,           // input HBase table name
>>                                        scan,
>>                                        AnalyzeMapper.class,    // mapper
>>                                        Text.class,             // mapper output key
>>                                        IntWritable.class,      // mapper output value
>>                                        job);
>>
>>                 TableMapReduceUtil.initTableReducerJob(
>>                                         OutputTable,                // output table
>>                                         AnalyzeReducerTable.class,  // reducer class
>>                                         job);
>>                 job.setNumReduceTasks(RegionCount);
>>
>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running
>> a 8 node cloudera cluster.
>>
>> Should i use a custom SortComparator or a Group Comparator?
>>
>>
>> --
>> Regards-
>> Pavan
>>
>
>


-- 
Regards-
Pavan

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Pavan Sudheendra <pa...@gmail.com>.
Hi Pradeep,
Yes.. Basically i'm only writing the key part as the map output.. The V of
<K,V> is not of much use to me.. But i'm hoping to change that if it leads
to faster execution.. I'm kind of a newbie so looking to make the
map/reduce job run a lot faster..

Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
seems if i write a map output for each and every row of a 19 m row HBase
table, its taking nearly a day to complete.. (21 mappers and 21 reducers)

I have looked at both Pig/Hive to do the job but i'm supposed to do this
via a MR job.. So, cannot use either of that.. Do you recommend me to try
something if i have the data in that format?


On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pr...@gmail.com>wrote:

> I'm sorry but I don't understand your question. Is the output of the
> mapper you're describing the key portion? If it is the key, then your data
> should already be sorted by HouseHoldId since it occurs first in your key.
>
> The SortComparator will tell Hadoop how to sort your data. So you use this
> if you have a need for a non lexical sort order. The GroupingComparator
> will tell Hadoop how to group your data for the reducer. All KV-pairs from
> the same group will be given to the same Reducer.
>
> If your reduce computation needs all the KV-pairs for the same
> HouseHoldId, then you will need to write a GroupingComparator.
>
> Also, have you considered using a higher level abstraction on Hadoop such
> as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
> easier to write in these languages.
>
> Hope this helps!
> - Pradeep
>
>
> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>wrote:
>
>> I need to improve my MR jobs which uses HBase as source as well as sink..
>>
>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>> them out as one huge string for the reducer to do some computation and dump
>> into a HBase Table..
>>
>> Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.
>>
>> The output of the mapper is something like this :
>>
>> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp
>>
>> I'm interested in sorting it on the basis of the HouseHoldID value so i'm
>> using this technique. I'm not interested in the V part of pair so i'm kind
>> of ignoring it. My mapper class is defined as follows:
>>
>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }
>>
>> For my MR job to be completed, it takes 22 hours to complete which is not
>> desirable at all. I'm supposed to optimize this somehow to run a lot faster
>> somehow..
>>
>> scan.setCaching(750);
>> scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
>>                                        Table1,           // input HBase table name
>>                                        scan,
>>                                        AnalyzeMapper.class,    // mapper
>>                                        Text.class,             // mapper output key
>>                                        IntWritable.class,      // mapper output value
>>                                        job);
>>
>>                 TableMapReduceUtil.initTableReducerJob(
>>                                         OutputTable,                // output table
>>                                         AnalyzeReducerTable.class,  // reducer class
>>                                         job);
>>                 job.setNumReduceTasks(RegionCount);
>>
>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running
>> a 8 node cloudera cluster.
>>
>> Should i use a custom SortComparator or a Group Comparator?
>>
>>
>> --
>> Regards-
>> Pavan
>>
>
>


-- 
Regards-
Pavan

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Pavan Sudheendra <pa...@gmail.com>.
Hi Pradeep,
Yes.. Basically i'm only writing the key part as the map output.. The V of
<K,V> is not of much use to me.. But i'm hoping to change that if it leads
to faster execution.. I'm kind of a newbie so looking to make the
map/reduce job run a lot faster..

Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
seems if i write a map output for each and every row of a 19 m row HBase
table, its taking nearly a day to complete.. (21 mappers and 21 reducers)

I have looked at both Pig/Hive to do the job but i'm supposed to do this
via a MR job.. So, cannot use either of that.. Do you recommend me to try
something if i have the data in that format?


On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pr...@gmail.com>wrote:

> I'm sorry but I don't understand your question. Is the output of the
> mapper you're describing the key portion? If it is the key, then your data
> should already be sorted by HouseHoldId since it occurs first in your key.
>
> The SortComparator will tell Hadoop how to sort your data. So you use this
> if you have a need for a non lexical sort order. The GroupingComparator
> will tell Hadoop how to group your data for the reducer. All KV-pairs from
> the same group will be given to the same Reducer.
>
> If your reduce computation needs all the KV-pairs for the same
> HouseHoldId, then you will need to write a GroupingComparator.
>
> Also, have you considered using a higher level abstraction on Hadoop such
> as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
> easier to write in these languages.
>
> Hope this helps!
> - Pradeep
>
>
> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>wrote:
>
>> I need to improve my MR jobs which uses HBase as source as well as sink..
>>
>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>> them out as one huge string for the reducer to do some computation and dump
>> into a HBase Table..
>>
>> Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.
>>
>> The output of the mapper is something like this :
>>
>> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp
>>
>> I'm interested in sorting it on the basis of the HouseHoldID value so i'm
>> using this technique. I'm not interested in the V part of pair so i'm kind
>> of ignoring it. My mapper class is defined as follows:
>>
>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }
>>
>> For my MR job to be completed, it takes 22 hours to complete which is not
>> desirable at all. I'm supposed to optimize this somehow to run a lot faster
>> somehow..
>>
>> scan.setCaching(750);
>> scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
>>                                        Table1,           // input HBase table name
>>                                        scan,
>>                                        AnalyzeMapper.class,    // mapper
>>                                        Text.class,             // mapper output key
>>                                        IntWritable.class,      // mapper output value
>>                                        job);
>>
>>                 TableMapReduceUtil.initTableReducerJob(
>>                                         OutputTable,                // output table
>>                                         AnalyzeReducerTable.class,  // reducer class
>>                                         job);
>>                 job.setNumReduceTasks(RegionCount);
>>
>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running
>> a 8 node cloudera cluster.
>>
>> Should i use a custom SortComparator or a Group Comparator?
>>
>>
>> --
>> Regards-
>> Pavan
>>
>
>


-- 
Regards-
Pavan

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Pradeep Gollakota <pr...@gmail.com>.
I'm sorry but I don't understand your question. Is the output of the mapper
you're describing the key portion? If it is the key, then your data should
already be sorted by HouseHoldId since it occurs first in your key.

The SortComparator will tell Hadoop how to sort your data. So you use this
if you have a need for a non lexical sort order. The GroupingComparator
will tell Hadoop how to group your data for the reducer. All KV-pairs from
the same group will be given to the same Reducer.

If your reduce computation needs all the KV-pairs for the same HouseHoldId,
then you will need to write a GroupingComparator.

Also, have you considered using a higher level abstraction on Hadoop such
as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
easier to write in these languages.

Hope this helps!
- Pradeep


On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>wrote:

> I need to improve my MR jobs which uses HBase as source as well as sink..
>
> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
> them out as one huge string for the reducer to do some computation and dump
> into a HBase Table..
>
> Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.
>
> The output of the mapper is something like this :
>
> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp
>
> I'm interested in sorting it on the basis of the HouseHoldID value so i'm
> using this technique. I'm not interested in the V part of pair so i'm kind
> of ignoring it. My mapper class is defined as follows:
>
> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }
>
> For my MR job to be completed, it takes 22 hours to complete which is not
> desirable at all. I'm supposed to optimize this somehow to run a lot faster
> somehow..
>
> scan.setCaching(750);
> scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
>                                        Table1,           // input HBase table name
>                                        scan,
>                                        AnalyzeMapper.class,    // mapper
>                                        Text.class,             // mapper output key
>                                        IntWritable.class,      // mapper output value
>                                        job);
>
>                 TableMapReduceUtil.initTableReducerJob(
>                                         OutputTable,                // output table
>                                         AnalyzeReducerTable.class,  // reducer class
>                                         job);
>                 job.setNumReduceTasks(RegionCount);
>
> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a
> 8 node cloudera cluster.
>
> Should i use a custom SortComparator or a Group Comparator?
>
>
> --
> Regards-
> Pavan
>

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Pradeep Gollakota <pr...@gmail.com>.
I'm sorry but I don't understand your question. Is the output of the mapper
you're describing the key portion? If it is the key, then your data should
already be sorted by HouseHoldId since it occurs first in your key.

The SortComparator will tell Hadoop how to sort your data. So you use this
if you have a need for a non lexical sort order. The GroupingComparator
will tell Hadoop how to group your data for the reducer. All KV-pairs from
the same group will be given to the same Reducer.

If your reduce computation needs all the KV-pairs for the same HouseHoldId,
then you will need to write a GroupingComparator.

Also, have you considered using a higher level abstraction on Hadoop such
as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
easier to write in these languages.

Hope this helps!
- Pradeep


On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>wrote:

> I need to improve my MR jobs which uses HBase as source as well as sink..
>
> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
> them out as one huge string for the reducer to do some computation and dump
> into a HBase Table..
>
> Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.
>
> The output of the mapper is something like this :
>
> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp
>
> I'm interested in sorting it on the basis of the HouseHoldID value so i'm
> using this technique. I'm not interested in the V part of pair so i'm kind
> of ignoring it. My mapper class is defined as follows:
>
> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }
>
> For my MR job to be completed, it takes 22 hours to complete which is not
> desirable at all. I'm supposed to optimize this somehow to run a lot faster
> somehow..
>
> scan.setCaching(750);
> scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
>                                        Table1,           // input HBase table name
>                                        scan,
>                                        AnalyzeMapper.class,    // mapper
>                                        Text.class,             // mapper output key
>                                        IntWritable.class,      // mapper output value
>                                        job);
>
>                 TableMapReduceUtil.initTableReducerJob(
>                                         OutputTable,                // output table
>                                         AnalyzeReducerTable.class,  // reducer class
>                                         job);
>                 job.setNumReduceTasks(RegionCount);
>
> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a
> 8 node cloudera cluster.
>
> Should i use a custom SortComparator or a Group Comparator?
>
>
> --
> Regards-
> Pavan
>

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Pradeep Gollakota <pr...@gmail.com>.
I'm sorry but I don't understand your question. Is the output of the mapper
you're describing the key portion? If it is the key, then your data should
already be sorted by HouseHoldId since it occurs first in your key.

The SortComparator will tell Hadoop how to sort your data. So you use this
if you have a need for a non lexical sort order. The GroupingComparator
will tell Hadoop how to group your data for the reducer. All KV-pairs from
the same group will be given to the same Reducer.

If your reduce computation needs all the KV-pairs for the same HouseHoldId,
then you will need to write a GroupingComparator.

Also, have you considered using a higher level abstraction on Hadoop such
as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
easier to write in these languages.

Hope this helps!
- Pradeep


On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>wrote:

> I need to improve my MR jobs which uses HBase as source as well as sink..
>
> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
> them out as one huge string for the reducer to do some computation and dump
> into a HBase Table..
>
> Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.
>
> The output of the mapper is something like this :
>
> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp
>
> I'm interested in sorting it on the basis of the HouseHoldID value so i'm
> using this technique. I'm not interested in the V part of pair so i'm kind
> of ignoring it. My mapper class is defined as follows:
>
> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }
>
> For my MR job to be completed, it takes 22 hours to complete which is not
> desirable at all. I'm supposed to optimize this somehow to run a lot faster
> somehow..
>
> scan.setCaching(750);
> scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
>                                        Table1,           // input HBase table name
>                                        scan,
>                                        AnalyzeMapper.class,    // mapper
>                                        Text.class,             // mapper output key
>                                        IntWritable.class,      // mapper output value
>                                        job);
>
>                 TableMapReduceUtil.initTableReducerJob(
>                                         OutputTable,                // output table
>                                         AnalyzeReducerTable.class,  // reducer class
>                                         job);
>                 job.setNumReduceTasks(RegionCount);
>
> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a
> 8 node cloudera cluster.
>
> Should i use a custom SortComparator or a Group Comparator?
>
>
> --
> Regards-
> Pavan
>

Re: How to best decide mapper output/reducer input for a huge string?

Posted by Pradeep Gollakota <pr...@gmail.com>.
I'm sorry but I don't understand your question. Is the output of the mapper
you're describing the key portion? If it is the key, then your data should
already be sorted by HouseHoldId since it occurs first in your key.

The SortComparator will tell Hadoop how to sort your data. So you use this
if you have a need for a non lexical sort order. The GroupingComparator
will tell Hadoop how to group your data for the reducer. All KV-pairs from
the same group will be given to the same Reducer.

If your reduce computation needs all the KV-pairs for the same HouseHoldId,
then you will need to write a GroupingComparator.

Also, have you considered using a higher level abstraction on Hadoop such
as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
easier to write in these languages.

Hope this helps!
- Pradeep


On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pa...@gmail.com>wrote:

> I need to improve my MR jobs which uses HBase as source as well as sink..
>
> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
> them out as one huge string for the reducer to do some computation and dump
> into a HBase Table..
>
> Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.
>
> The output of the mapper is something like this :
>
> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp
>
> I'm interested in sorting it on the basis of the HouseHoldID value so i'm
> using this technique. I'm not interested in the V part of pair so i'm kind
> of ignoring it. My mapper class is defined as follows:
>
> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }
>
> For my MR job to be completed, it takes 22 hours to complete which is not
> desirable at all. I'm supposed to optimize this somehow to run a lot faster
> somehow..
>
> scan.setCaching(750);
> scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
>                                        Table1,           // input HBase table name
>                                        scan,
>                                        AnalyzeMapper.class,    // mapper
>                                        Text.class,             // mapper output key
>                                        IntWritable.class,      // mapper output value
>                                        job);
>
>                 TableMapReduceUtil.initTableReducerJob(
>                                         OutputTable,                // output table
>                                         AnalyzeReducerTable.class,  // reducer class
>                                         job);
>                 job.setNumReduceTasks(RegionCount);
>
> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a
> 8 node cloudera cluster.
>
> Should i use a custom SortComparator or a Group Comparator?
>
>
> --
> Regards-
> Pavan
>