You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Viraj Jasani (Jira)" <ji...@apache.org> on 2023/06/02 19:21:00 UTC
[jira] [Commented] (HBASE-27904) A random data generator tool leveraging bulk load.

    [ https://issues.apache.org/jira/browse/HBASE-27904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728850#comment-17728850 ] 

Viraj Jasani commented on HBASE-27904:
--------------------------------------

Thanks for filing the Jira [~hgwalani81]!

Assigned it to you and now you have the contributor access to get any Jira assigned going forward. Thanks again.

> A random data generator tool leveraging bulk load.
> --------------------------------------------------
>
>                 Key: HBASE-27904
>                 URL: https://issues.apache.org/jira/browse/HBASE-27904
>             Project: HBase
>          Issue Type: New Feature
>          Components: util
>            Reporter: Himanshu Gwalani
>            Assignee: Himanshu Gwalani
>            Priority: Minor
>
> As of now, there is no data generator tool in HBase leveraging bulk load. Since bulk load skips client writes path, it's much faster to generate data and use of for load/performance tests where client writes are not a mandate.
> {*}Example{*}: Any tooling over HBase that need x TBs of HBase Table for load testing.
> {*}Requirements{*}:
> 1. Tooling should generate RANDOM data on the fly and should not require any pre-generated data as CSV/XML files as input.
> 2. Tooling should support pre-splited tables (number of splits to be taken as input).
> 3. Data should be UNIFORMLY distributed across all regions of the table.
> *High-level Steps*
> 1. A table will be created (pre-splited with number of splits as input)
> 2. The mapper of a custom Map Reduce job will generate random key-value pair and ensure that those are equally distributed across all regions of the table.
> 3. [HFileOutputFormat2|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java] will be used to add reducer to the MR job and create HFiles based on key value pairs generated by mapper. 
> 4. Bulk load those HFiles to the respective regions of the table using [LoadIncrementalFiles|https://hbase.apache.org/2.2/devapidocs/org/apache/hadoop/hbase/tool/LoadIncrementalHFiles.html]
> *Results*
> We had POC for this tool in our organization, tested this tool with a 11 nodes HBase cluster (having HBase + Hadoop services running). The tool generated:
> 1. *100* *GB* of data in *6 minutes*
> 2. *340 GB* of data in *13 minutes*
> 3. *3.5 TB* of data in *3 hours and 10 minutes*
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)