You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "stack (JIRA)" <ji...@apache.org> on 2007/10/18 20:33:51 UTC

[jira] Created: (HADOOP-2075) [hbase] Bulk load and dump tools

[hbase] Bulk load and dump tools
--------------------------------

                 Key: HADOOP-2075
                 URL: https://issues.apache.org/jira/browse/HADOOP-2075
             Project: Hadoop
          Issue Type: New Feature
          Components: contrib/hbase
            Reporter: stack
            Priority: Minor


Hbase needs tools to facilitate bulk upload and possibly dumping.  Going via the current APIs, particularly if the dataset is large and cell content is small, uploads can take a long time even when using many concurrent clients.

PNUTS folks talked of need for a different API to manage bulk upload/dump.

Another notion would be to somehow have the bulk loader tools somehow write regions directly in hdfs.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2075) [hbase] Bulk load and dump tools

Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12549765 ] 

Bryan Duxbury commented on HADOOP-2075:
---------------------------------------

Actually you wouldn't have to be too concerned with the distribution of splits early on, because even if some of the regions ended up being abnormally small, they would eventually be merged with neighboring regions, no?

> [hbase] Bulk load and dump tools
> --------------------------------
>
>                 Key: HADOOP-2075
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2075
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/hbase
>            Reporter: stack
>            Priority: Minor
>
> Hbase needs tools to facilitate bulk upload and possibly dumping.  Going via the current APIs, particularly if the dataset is large and cell content is small, uploads can take a long time even when using many concurrent clients.
> PNUTS folks talked of need for a different API to manage bulk upload/dump.
> Another notion would be to somehow have the bulk loader tools somehow write regions directly in hdfs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2075) [hbase] Bulk load and dump tools

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540582 ] 

stack commented on HADOOP-2075:
-------------------------------

Bulk uploader needs to be able to tolerate myriad data input types.  Data will likely need massaging and ultimately, if writing HRegion content directly into HDFS rather than going against hbase API -- preferred since it'll be dog slow doing bulk uploads going against hbase API -- then it has to be sorted.  Using mapreduce would make sense.

Look too at using PIG because it has a few LOAD implementations -- from files on local or HDFS -- and some facility for doing transforms on data moving tuples around.  Would need to write a special STORE operator that wrote the data sorted out as HRegions direct into HDFS (This would be different than PIG-6 which is about writing into hbase via API).

Also, chatting with Jim, this is a pretty important issue.  This is the first folks run into when they start to get serious about hbase.

> [hbase] Bulk load and dump tools
> --------------------------------
>
>                 Key: HADOOP-2075
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2075
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/hbase
>            Reporter: stack
>            Priority: Minor
>
> Hbase needs tools to facilitate bulk upload and possibly dumping.  Going via the current APIs, particularly if the dataset is large and cell content is small, uploads can take a long time even when using many concurrent clients.
> PNUTS folks talked of need for a different API to manage bulk upload/dump.
> Another notion would be to somehow have the bulk loader tools somehow write regions directly in hdfs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2075) [hbase] Bulk load and dump tools

Posted by "Chad Walters (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12549753 ] 

Chad Walters commented on HADOOP-2075:
--------------------------------------

I like the idea of lots of splits early on when the number of regions is less than the number of region servers. You want to make sure the splits are made at points that relatively well-distributed, of course, so don't make it so small that you split without a representative sampling. This would be a good general purpose solution that doesn't create a new API. Then the bulk upload simply looks like partitioning the data set and uploading via Map-Reduce, perhaps with batched inserts. Do you really think this would be dog slow?

If that is not fast enough, I suppose we could have a mapfile uploader. This would require the dataset to be prepared properly, which could be a bit fidgety (sorting, properly splitting across columns, etc.).

> [hbase] Bulk load and dump tools
> --------------------------------
>
>                 Key: HADOOP-2075
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2075
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/hbase
>            Reporter: stack
>            Priority: Minor
>
> Hbase needs tools to facilitate bulk upload and possibly dumping.  Going via the current APIs, particularly if the dataset is large and cell content is small, uploads can take a long time even when using many concurrent clients.
> PNUTS folks talked of need for a different API to manage bulk upload/dump.
> Another notion would be to somehow have the bulk loader tools somehow write regions directly in hdfs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (HADOOP-2075) [hbase] Bulk load and dump tools

Posted by Bryan Duxbury <br...@rapleaf.com>.

Well... someone should create a ticket for automatic merging  
behavior. It seems like the sort of thing you'd really want to have  
to avoid fragmentation in tables with a lot of deletion.

On Dec 8, 2007, at 5:45 PM, Chad Walters (JIRA) wrote:

>
>     [ https://issues.apache.org/jira/browse/HADOOP-2075? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
> tabpanel#action_12549785 ]
>
> Chad Walters commented on HADOOP-2075:
> --------------------------------------
>
> Eventually that might be true but merging is currently a manually- 
> triggered operation. Also, unless a more intelligent heuristic were  
> in place, a small region would count against a whole region server  
> until it was merged, which would slow down the loading.
>
>> [hbase] Bulk load and dump tools
>> --------------------------------
>>
>>                 Key: HADOOP-2075
>>                 URL: https://issues.apache.org/jira/browse/ 
>> HADOOP-2075
>>             Project: Hadoop
>>          Issue Type: New Feature
>>          Components: contrib/hbase
>>            Reporter: stack
>>            Priority: Minor
>>
>> Hbase needs tools to facilitate bulk upload and possibly dumping.   
>> Going via the current APIs, particularly if the dataset is large  
>> and cell content is small, uploads can take a long time even when  
>> using many concurrent clients.
>> PNUTS folks talked of need for a different API to manage bulk  
>> upload/dump.
>> Another notion would be to somehow have the bulk loader tools  
>> somehow write regions directly in hdfs.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>

[jira] Commented: (HADOOP-2075) [hbase] Bulk load and dump tools

Posted by "Chad Walters (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12549785 ] 

Chad Walters commented on HADOOP-2075:
--------------------------------------

Eventually that might be true but merging is currently a manually-triggered operation. Also, unless a more intelligent heuristic were in place, a small region would count against a whole region server until it was merged, which would slow down the loading.

> [hbase] Bulk load and dump tools
> --------------------------------
>
>                 Key: HADOOP-2075
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2075
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/hbase
>            Reporter: stack
>            Priority: Minor
>
> Hbase needs tools to facilitate bulk upload and possibly dumping.  Going via the current APIs, particularly if the dataset is large and cell content is small, uploads can take a long time even when using many concurrent clients.
> PNUTS folks talked of need for a different API to manage bulk upload/dump.
> Another notion would be to somehow have the bulk loader tools somehow write regions directly in hdfs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2075) [hbase] Bulk load and dump tools

Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12549661 ] 

Bryan Duxbury commented on HADOOP-2075:
---------------------------------------

A really cool feature for bulk loading would be artificially lowering the split size so that splits occur really often, at least until there are as many regions as there are regionservers. That way, the load operation could have a lot more parallelism early on. 

> [hbase] Bulk load and dump tools
> --------------------------------
>
>                 Key: HADOOP-2075
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2075
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/hbase
>            Reporter: stack
>            Priority: Minor
>
> Hbase needs tools to facilitate bulk upload and possibly dumping.  Going via the current APIs, particularly if the dataset is large and cell content is small, uploads can take a long time even when using many concurrent clients.
> PNUTS folks talked of need for a different API to manage bulk upload/dump.
> Another notion would be to somehow have the bulk loader tools somehow write regions directly in hdfs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.