You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Istvan Toth (Jira)" <ji...@apache.org> on 2021/10/26 17:19:00 UTC

[jira] [Updated] (HBASE-26398) CellCounter fails for large tables filling up local disk

     [ https://issues.apache.org/jira/browse/HBASE-26398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Istvan Toth updated HBASE-26398:
--------------------------------
    Priority: Minor  (was: Major)

> CellCounter fails for large tables filling up local disk
> --------------------------------------------------------
>
>                 Key: HBASE-26398
>                 URL: https://issues.apache.org/jira/browse/HBASE-26398
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 3.0.0-alpha-2
>            Reporter: Istvan Toth
>            Assignee: Istvan Toth
>            Priority: Minor
>
> CellCounter dumps all cell coordinates into its output, which can become huge.
> The spill can fill the local disk on the reducer. 
> CellCounter hardcodes *mapreduce.job.reduces* to *1*, so it is not possible to use multiple reducers to get around this.
> Fixing this is easy, by not hardcoding *mapreduce.job.reduces*, it still defaults to 1, but can be overriden by the user. 
> CellCounter also generates two extra records with constant keys for each cell, which have to be processed by the reducer.
> Even with multiple reducers, these (1/3 of the totcal records) will got the same reducer, which can also fill up the disk.
> This can be fixed by adding a Combiner to the Mapper, which sums the counter records, thereby reducing the Mapper output records to 1/3 of their previous amount.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)