You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Istvan Toth (Jira)" <ji...@apache.org> on 2021/10/26 17:19:00 UTC

[jira] [Created] (HBASE-26398) CellCounter fails for large tables filling up local disk

Istvan Toth created HBASE-26398:
-----------------------------------

             Summary: CellCounter fails for large tables filling up local disk
                 Key: HBASE-26398
                 URL: https://issues.apache.org/jira/browse/HBASE-26398
             Project: HBase
          Issue Type: Bug
          Components: mapreduce
    Affects Versions: 3.0.0-alpha-2
            Reporter: Istvan Toth
            Assignee: Istvan Toth


CellCounter dumps all cell coordinates into its output, which can become huge.

The spill can fill the local disk on the reducer. 
CellCounter hardcodes *mapreduce.job.reduces* to *1*, so it is not possible to use multiple reducers to get around this.

Fixing this is easy, by not hardcoding *mapreduce.job.reduces*, it still defaults to 1, but can be overriden by the user. 

CellCounter also generates two extra records with constant keys for each cell, which have to be processed by the reducer.
Even with multiple reducers, these (1/3 of the totcal records) will got the same reducer, which can also fill up the disk.

This can be fixed by adding a Combiner to the Mapper, which sums the counter records, thereby reducing the Mapper output records to 1/3 of their previous amount.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)