You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Ted Malaska (JIRA)" <ji...@apache.org> on 2015/07/31 21:20:05 UTC

[jira] [Commented] (HBASE-14150) Add BulkLoad functionality to HBase-Spark Module

    [ https://issues.apache.org/jira/browse/HBASE-14150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14649676#comment-14649676 ] 

Ted Malaska commented on HBASE-14150:
-------------------------------------

Just to update.  I started this yesterday.  I will hopefully have a patch by end of day Monday.

The code is done, but I haven't started on the unit tests.  I figure that is going to take a lot of time.

> Add BulkLoad functionality to HBase-Spark Module
> ------------------------------------------------
>
>                 Key: HBASE-14150
>                 URL: https://issues.apache.org/jira/browse/HBASE-14150
>             Project: HBase
>          Issue Type: New Feature
>          Components: spark
>            Reporter: Ted Malaska
>            Assignee: Ted Malaska
>
> Add on to the work done in HBASE-13992 to add functionality to do a bulk load from a given RDD.
> This will do the following:
> 1. figure out the number of regions and sort and partition the data correctly to be written out to HFiles
> 2. Also unlike the MR bulkload I would like that the columns to be sorted in the shuffle stage and not in the memory of the reducer.  This will allow this design to support super wide records with out going out of memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)