You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Jean-Marc Spaggiari (JIRA)" <ji...@apache.org> on 2016/08/17 00:40:20 UTC

[jira] [Commented] (HBASE-16425) [Operability] Autohandling 'bad data'

    [ https://issues.apache.org/jira/browse/HBASE-16425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15423675#comment-15423675 ] 

Jean-Marc Spaggiari commented on HBASE-16425:
---------------------------------------------

I like this thread!

Another thing related to the bulk load. If someone bulkloads a cell wich is WAY too big, the region server might not be able to load it. Like, a 2GB cell. And will fail. Might be nice to detect that and alert the user/log the issue/skip the cell...

> [Operability] Autohandling 'bad data'
> -------------------------------------
>
>                 Key: HBASE-16425
>                 URL: https://issues.apache.org/jira/browse/HBASE-16425
>             Project: HBase
>          Issue Type: Brainstorming
>          Components: Operability
>            Reporter: stack
>
> This is a brainstorming issue. It came up chatting w/ a couple of operators talking about 'bad data'; i.e. no matter how you control your clients, someone by mistake or under a misconception will load an out-of-spec Cell or Row. In this particular case, two types of 'bad data' were talked about:
> (on) The Big Cell: An upload of a 'big cell' came in via bulkload but it so happened that their frontend all arrived at the malignant Cell at the same time so hundreds of threads requesting the big cell. The RS OOME'd. Then when the region opened on the new RS, it OOME'd, etc. Could we switch to chunking when a Server sees that it has a large Cell on its hands? I suppose bulk load could defeat any Put chunking we had in place but would be good to have this too. Chatting w/ Matteo, we probably want to just move to the streaming Interface that we've talked of in the past at various times; the Get would chunk out the big Cell for assembly on the Client, or just give back the Cell in pieces -- an OutputStream for the Application to suck on. New API and/or old API could use it when Cells are big.
> (on) The user had a row with 29M Columns in it because the default entity had id=-1.... In this case chunking the Scan (v1.1+) helps but the operator was having trouble finding the problem row. How could we surface anomalies like this for operators? On flush, add even more meta data to the HFile (Yahoo! Data Sketches as [~jleach] has been suggesting) and then an offline tool to read metadata and run it through a few simple rules. Data Sketches are mergeable so could build up a region-view or store-view....
> This is sketchy and I'm pretty sure repeats stuff in old issues but parking this note here while the encounter still fresh.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)