You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Billy Pearson (JIRA)" <ji...@apache.org> on 2008/11/10 02:56:44 UTC
[jira] Created: (HBASE-987) We need a Hbase Partitioner for
TableMapReduceUtil.initTableReduceJob MR Jobs
We need a Hbase Partitioner for TableMapReduceUtil.initTableReduceJob MR Jobs
-----------------------------------------------------------------------------
Key: HBASE-987
URL: https://issues.apache.org/jira/browse/HBASE-987
Project: Hadoop HBase
Issue Type: Improvement
Components: mapred
Reporter: Billy Pearson
Priority: Minor
Fix For: 0.20.0
When we run say 20 reducers they all get ~1/20th of the data to output to the table.
The problem for us on large import jobs is the data gets sorted by key and the all Reducers
pound one region at a time.
we need to add onto the TableMapReduceUtil.initTableReduceJob method so it can set the partitioner
and set the number of reducers = number of regions as the table map does for maps.
Then the Partitioner will send all the BatchUpdates for one region to one reducer.
So we get a more even spread of writers to the regions this would assure that only one reducer will send
updates to one region keeping any one region from getting more overloaded the others.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (HBASE-987) We need a Hbase Partitioner for
TableMapReduceUtil.initTableReduceJob MR Jobs
Posted by Michael Stack <st...@duboce.net>.
Billy Pearson (JIRA) wrote:
> [ https://issues.apache.org/jira/browse/HBASE-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646495#action_12646495 ]
>
> Billy Pearson commented on HBASE-987:
> -------------------------------------
>
> your method needs a
>
> throws IOException
> or a try catch
>
> currently breaks build
Thanks Billy. Fixed now.
St.Ack
[jira] Commented: (HBASE-987) We need a Hbase Partitioner for
TableMapReduceUtil.initTableReduceJob MR Jobs
Posted by "Billy Pearson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646495#action_12646495 ]
Billy Pearson commented on HBASE-987:
-------------------------------------
your method needs a
throws IOException
or a try catch
currently breaks build
> We need a Hbase Partitioner for TableMapReduceUtil.initTableReduceJob MR Jobs
> -----------------------------------------------------------------------------
>
> Key: HBASE-987
> URL: https://issues.apache.org/jira/browse/HBASE-987
> Project: Hadoop HBase
> Issue Type: Improvement
> Components: mapred
> Reporter: Billy Pearson
> Assignee: Billy Pearson
> Priority: Minor
> Fix For: 0.19.0
>
> Attachments: 987-2.patch.txt, 987-3.patch.txt, 987.patch.txt
>
>
> When we run say 20 reducers they all get ~1/20th of the data to output to the table.
> The problem for us on large import jobs is the data gets sorted by key and the all Reducers
> pound one region at a time.
> we need to add onto the TableMapReduceUtil.initTableReduceJob method so it can set the partitioner
> and set the number of reducers = number of regions as the table map does for maps.
> Then the Partitioner will send all the BatchUpdates for one region to one reducer.
> So we get a more even spread of writers to the regions this would assure that only one reducer will send
> updates to one region keeping any one region from getting more overloaded the others.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-987) We need a Hbase Partitioner for
TableMapReduceUtil.initTableReduceJob MR Jobs
Posted by "stack (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646377#action_12646377 ]
stack commented on HBASE-987:
-----------------------------
Nice patch Billy.
Is it true to say that your partitioner makes sense when tables are 'mature'; i.e. have many regions, when upload will not generally disturb the total number of regions involved? Or to put it another way, if table starts out with a low number of regions and if we load a big dataset that makes the regions grow considerably in number, would the old default partitioner be a better fit? The key-space would be divided by the count of regions assigned?
If so, should this new partitioner not be the default but a nice option users can select after call to initReduceJob? Perhaps add a method that will configure this new partitioner into place along with your trick of keeping number of reduces == number of regions.
Otherwise, HRegionPartitioner needs an Apache license.
Line lengths are generally < 80 characters in hadoop projects.
Otherwise, patch is great.
> We need a Hbase Partitioner for TableMapReduceUtil.initTableReduceJob MR Jobs
> -----------------------------------------------------------------------------
>
> Key: HBASE-987
> URL: https://issues.apache.org/jira/browse/HBASE-987
> Project: Hadoop HBase
> Issue Type: Improvement
> Components: mapred
> Reporter: Billy Pearson
> Assignee: Billy Pearson
> Priority: Minor
> Fix For: 0.19.0
>
> Attachments: 987-2.patch.txt, 987.patch.txt
>
>
> When we run say 20 reducers they all get ~1/20th of the data to output to the table.
> The problem for us on large import jobs is the data gets sorted by key and the all Reducers
> pound one region at a time.
> we need to add onto the TableMapReduceUtil.initTableReduceJob method so it can set the partitioner
> and set the number of reducers = number of regions as the table map does for maps.
> Then the Partitioner will send all the BatchUpdates for one region to one reducer.
> So we get a more even spread of writers to the regions this would assure that only one reducer will send
> updates to one region keeping any one region from getting more overloaded the others.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-987) We need a Hbase Partitioner for
TableMapReduceUtil.initTableReduceJob MR Jobs
Posted by "Billy Pearson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Billy Pearson updated HBASE-987:
--------------------------------
Fix Version/s: (was: 0.20.0)
0.19.0
> We need a Hbase Partitioner for TableMapReduceUtil.initTableReduceJob MR Jobs
> -----------------------------------------------------------------------------
>
> Key: HBASE-987
> URL: https://issues.apache.org/jira/browse/HBASE-987
> Project: Hadoop HBase
> Issue Type: Improvement
> Components: mapred
> Reporter: Billy Pearson
> Assignee: Billy Pearson
> Priority: Minor
> Fix For: 0.19.0
>
>
> When we run say 20 reducers they all get ~1/20th of the data to output to the table.
> The problem for us on large import jobs is the data gets sorted by key and the all Reducers
> pound one region at a time.
> we need to add onto the TableMapReduceUtil.initTableReduceJob method so it can set the partitioner
> and set the number of reducers = number of regions as the table map does for maps.
> Then the Partitioner will send all the BatchUpdates for one region to one reducer.
> So we get a more even spread of writers to the regions this would assure that only one reducer will send
> updates to one region keeping any one region from getting more overloaded the others.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-987) We need a Hbase Partitioner for
TableMapReduceUtil.initTableReduceJob MR Jobs
Posted by "Billy Pearson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Billy Pearson updated HBASE-987:
--------------------------------
Attachment: 987-3.patch.txt
ok I thank this is what you are talking about
updated max 80 chr per line
and added Apache license to both files
> We need a Hbase Partitioner for TableMapReduceUtil.initTableReduceJob MR Jobs
> -----------------------------------------------------------------------------
>
> Key: HBASE-987
> URL: https://issues.apache.org/jira/browse/HBASE-987
> Project: Hadoop HBase
> Issue Type: Improvement
> Components: mapred
> Reporter: Billy Pearson
> Assignee: Billy Pearson
> Priority: Minor
> Fix For: 0.19.0
>
> Attachments: 987-2.patch.txt, 987-3.patch.txt, 987.patch.txt
>
>
> When we run say 20 reducers they all get ~1/20th of the data to output to the table.
> The problem for us on large import jobs is the data gets sorted by key and the all Reducers
> pound one region at a time.
> we need to add onto the TableMapReduceUtil.initTableReduceJob method so it can set the partitioner
> and set the number of reducers = number of regions as the table map does for maps.
> Then the Partitioner will send all the BatchUpdates for one region to one reducer.
> So we get a more even spread of writers to the regions this would assure that only one reducer will send
> updates to one region keeping any one region from getting more overloaded the others.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-987) We need a Hbase Partitioner for
TableMapReduceUtil.initTableReduceJob MR Jobs
Posted by "Billy Pearson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646198#action_12646198 ]
Billy Pearson commented on HBASE-987:
-------------------------------------
Ran a import job importing 9,972,859 rows in to 8 regions on 4 servers
With out HRegionPartitioner:
Average time taken by Reduce tasks: 5mins, 18sec
With HRegionPartitioner:
Average time taken by Reduce tasks: 3mins, 11sec = +166.5% improvement
Saves quite a bit time on imports with just a small set of server and spreads
out the average load on the regions so no one server is getting hit hard while others
are set idle.
Would still like to see some stats form someone that has a larger server group then me.
> We need a Hbase Partitioner for TableMapReduceUtil.initTableReduceJob MR Jobs
> -----------------------------------------------------------------------------
>
> Key: HBASE-987
> URL: https://issues.apache.org/jira/browse/HBASE-987
> Project: Hadoop HBase
> Issue Type: Improvement
> Components: mapred
> Reporter: Billy Pearson
> Assignee: Billy Pearson
> Priority: Minor
> Fix For: 0.19.0
>
> Attachments: 987.patch.txt
>
>
> When we run say 20 reducers they all get ~1/20th of the data to output to the table.
> The problem for us on large import jobs is the data gets sorted by key and the all Reducers
> pound one region at a time.
> we need to add onto the TableMapReduceUtil.initTableReduceJob method so it can set the partitioner
> and set the number of reducers = number of regions as the table map does for maps.
> Then the Partitioner will send all the BatchUpdates for one region to one reducer.
> So we get a more even spread of writers to the regions this would assure that only one reducer will send
> updates to one region keeping any one region from getting more overloaded the others.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-987) We need a Hbase Partitioner for
TableMapReduceUtil.initTableReduceJob MR Jobs
Posted by "Billy Pearson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Billy Pearson updated HBASE-987:
--------------------------------
Attachment: 987-2.patch.txt
updated patch to set the reduced to match the number of regions if set higher then the number of regions
the setting is adjusted in TableMapReduceUtil.initTableReduceJob but the user can still
set the reduced after they call the above and override it.
> We need a Hbase Partitioner for TableMapReduceUtil.initTableReduceJob MR Jobs
> -----------------------------------------------------------------------------
>
> Key: HBASE-987
> URL: https://issues.apache.org/jira/browse/HBASE-987
> Project: Hadoop HBase
> Issue Type: Improvement
> Components: mapred
> Reporter: Billy Pearson
> Assignee: Billy Pearson
> Priority: Minor
> Fix For: 0.19.0
>
> Attachments: 987-2.patch.txt, 987.patch.txt
>
>
> When we run say 20 reducers they all get ~1/20th of the data to output to the table.
> The problem for us on large import jobs is the data gets sorted by key and the all Reducers
> pound one region at a time.
> we need to add onto the TableMapReduceUtil.initTableReduceJob method so it can set the partitioner
> and set the number of reducers = number of regions as the table map does for maps.
> Then the Partitioner will send all the BatchUpdates for one region to one reducer.
> So we get a more even spread of writers to the regions this would assure that only one reducer will send
> updates to one region keeping any one region from getting more overloaded the others.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HBASE-987) We need a Hbase Partitioner for
TableMapReduceUtil.initTableReduceJob MR Jobs
Posted by "Billy Pearson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Billy Pearson reassigned HBASE-987:
-----------------------------------
Assignee: Billy Pearson
> We need a Hbase Partitioner for TableMapReduceUtil.initTableReduceJob MR Jobs
> -----------------------------------------------------------------------------
>
> Key: HBASE-987
> URL: https://issues.apache.org/jira/browse/HBASE-987
> Project: Hadoop HBase
> Issue Type: Improvement
> Components: mapred
> Reporter: Billy Pearson
> Assignee: Billy Pearson
> Priority: Minor
> Fix For: 0.19.0
>
>
> When we run say 20 reducers they all get ~1/20th of the data to output to the table.
> The problem for us on large import jobs is the data gets sorted by key and the all Reducers
> pound one region at a time.
> we need to add onto the TableMapReduceUtil.initTableReduceJob method so it can set the partitioner
> and set the number of reducers = number of regions as the table map does for maps.
> Then the Partitioner will send all the BatchUpdates for one region to one reducer.
> So we get a more even spread of writers to the regions this would assure that only one reducer will send
> updates to one region keeping any one region from getting more overloaded the others.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-987) We need a Hbase Partitioner for
TableMapReduceUtil.initTableReduceJob MR Jobs
Posted by "Billy Pearson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646413#action_12646413 ]
Billy Pearson commented on HBASE-987:
-------------------------------------
I was thanking about that also if what if we do this
enum the partitioner to (Hash,HRegion)
then to call
TableMapReduceUtil.initTableReduceJob(Table, ReduceClass, Conf, Partitioner);
then add java docs about the different partitioners so the end user can select as needed.
that way we can add more partitioners as needed.
> We need a Hbase Partitioner for TableMapReduceUtil.initTableReduceJob MR Jobs
> -----------------------------------------------------------------------------
>
> Key: HBASE-987
> URL: https://issues.apache.org/jira/browse/HBASE-987
> Project: Hadoop HBase
> Issue Type: Improvement
> Components: mapred
> Reporter: Billy Pearson
> Assignee: Billy Pearson
> Priority: Minor
> Fix For: 0.19.0
>
> Attachments: 987-2.patch.txt, 987.patch.txt
>
>
> When we run say 20 reducers they all get ~1/20th of the data to output to the table.
> The problem for us on large import jobs is the data gets sorted by key and the all Reducers
> pound one region at a time.
> we need to add onto the TableMapReduceUtil.initTableReduceJob method so it can set the partitioner
> and set the number of reducers = number of regions as the table map does for maps.
> Then the Partitioner will send all the BatchUpdates for one region to one reducer.
> So we get a more even spread of writers to the regions this would assure that only one reducer will send
> updates to one region keeping any one region from getting more overloaded the others.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-987) We need a Hbase Partitioner for
TableMapReduceUtil.initTableReduceJob MR Jobs
Posted by "stack (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
stack updated HBASE-987:
------------------------
Resolution: Fixed
Status: Resolved (was: Patch Available)
Committed. Thanks for the patch Billy. I added a note to the mapred package documentation that makes mention of the new partitioner.
> We need a Hbase Partitioner for TableMapReduceUtil.initTableReduceJob MR Jobs
> -----------------------------------------------------------------------------
>
> Key: HBASE-987
> URL: https://issues.apache.org/jira/browse/HBASE-987
> Project: Hadoop HBase
> Issue Type: Improvement
> Components: mapred
> Reporter: Billy Pearson
> Assignee: Billy Pearson
> Priority: Minor
> Fix For: 0.19.0
>
> Attachments: 987-2.patch.txt, 987-3.patch.txt, 987.patch.txt
>
>
> When we run say 20 reducers they all get ~1/20th of the data to output to the table.
> The problem for us on large import jobs is the data gets sorted by key and the all Reducers
> pound one region at a time.
> we need to add onto the TableMapReduceUtil.initTableReduceJob method so it can set the partitioner
> and set the number of reducers = number of regions as the table map does for maps.
> Then the Partitioner will send all the BatchUpdates for one region to one reducer.
> So we get a more even spread of writers to the regions this would assure that only one reducer will send
> updates to one region keeping any one region from getting more overloaded the others.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-987) We need a Hbase Partitioner for
TableMapReduceUtil.initTableReduceJob MR Jobs
Posted by "Billy Pearson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Billy Pearson updated HBASE-987:
--------------------------------
Status: Patch Available (was: Open)
> We need a Hbase Partitioner for TableMapReduceUtil.initTableReduceJob MR Jobs
> -----------------------------------------------------------------------------
>
> Key: HBASE-987
> URL: https://issues.apache.org/jira/browse/HBASE-987
> Project: Hadoop HBase
> Issue Type: Improvement
> Components: mapred
> Reporter: Billy Pearson
> Assignee: Billy Pearson
> Priority: Minor
> Fix For: 0.19.0
>
> Attachments: 987.patch.txt
>
>
> When we run say 20 reducers they all get ~1/20th of the data to output to the table.
> The problem for us on large import jobs is the data gets sorted by key and the all Reducers
> pound one region at a time.
> we need to add onto the TableMapReduceUtil.initTableReduceJob method so it can set the partitioner
> and set the number of reducers = number of regions as the table map does for maps.
> Then the Partitioner will send all the BatchUpdates for one region to one reducer.
> So we get a more even spread of writers to the regions this would assure that only one reducer will send
> updates to one region keeping any one region from getting more overloaded the others.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-987) We need a Hbase Partitioner for
TableMapReduceUtil.initTableReduceJob MR Jobs
Posted by "Billy Pearson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Billy Pearson updated HBASE-987:
--------------------------------
Attachment: 987.patch.txt
The only thing I have not addressed in this patch is if someone sets
the number of reducer higher then then regions a table has then the > then region count
task will not have any work.. Somewhere in the process reduce the reducers count to
number of regions like we do in TableMap
But do not know where you guys would like me to do that maybe can do it in
the TableMapReduceUtil.initTableReduceJob any other ideas?
Need someone to review with a larger number of server then I have.
> We need a Hbase Partitioner for TableMapReduceUtil.initTableReduceJob MR Jobs
> -----------------------------------------------------------------------------
>
> Key: HBASE-987
> URL: https://issues.apache.org/jira/browse/HBASE-987
> Project: Hadoop HBase
> Issue Type: Improvement
> Components: mapred
> Reporter: Billy Pearson
> Assignee: Billy Pearson
> Priority: Minor
> Fix For: 0.19.0
>
> Attachments: 987.patch.txt
>
>
> When we run say 20 reducers they all get ~1/20th of the data to output to the table.
> The problem for us on large import jobs is the data gets sorted by key and the all Reducers
> pound one region at a time.
> we need to add onto the TableMapReduceUtil.initTableReduceJob method so it can set the partitioner
> and set the number of reducers = number of regions as the table map does for maps.
> Then the Partitioner will send all the BatchUpdates for one region to one reducer.
> So we get a more even spread of writers to the regions this would assure that only one reducer will send
> updates to one region keeping any one region from getting more overloaded the others.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-987) We need a Hbase Partitioner for
TableMapReduceUtil.initTableReduceJob MR Jobs
Posted by "stack (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646415#action_12646415 ]
stack commented on HBASE-987:
-----------------------------
Rather than enum, just have an override for TMRU.initTableReduceJob that takes a Partitioner. Default is use the old Partitioner?
If you do the above, I'll do the bit that adds documentation on this new facility?
> We need a Hbase Partitioner for TableMapReduceUtil.initTableReduceJob MR Jobs
> -----------------------------------------------------------------------------
>
> Key: HBASE-987
> URL: https://issues.apache.org/jira/browse/HBASE-987
> Project: Hadoop HBase
> Issue Type: Improvement
> Components: mapred
> Reporter: Billy Pearson
> Assignee: Billy Pearson
> Priority: Minor
> Fix For: 0.19.0
>
> Attachments: 987-2.patch.txt, 987.patch.txt
>
>
> When we run say 20 reducers they all get ~1/20th of the data to output to the table.
> The problem for us on large import jobs is the data gets sorted by key and the all Reducers
> pound one region at a time.
> we need to add onto the TableMapReduceUtil.initTableReduceJob method so it can set the partitioner
> and set the number of reducers = number of regions as the table map does for maps.
> Then the Partitioner will send all the BatchUpdates for one region to one reducer.
> So we get a more even spread of writers to the regions this would assure that only one reducer will send
> updates to one region keeping any one region from getting more overloaded the others.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.