You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Andrew Purtell (JIRA)" <ji...@apache.org> on 2013/01/04 02:27:11 UTC
[jira] [Comment Edited] (HBASE-6768) HBase Rest server crashes if client tries to retrieve data size > 5 MB

    [ https://issues.apache.org/jira/browse/HBASE-6768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13543499#comment-13543499 ] 

Andrew Purtell edited comment on HBASE-6768 at 1/4/13 1:25 AM:
---------------------------------------------------------------

bq. With many concurrent curl commands, the REST server will OOM very likely. In this case, I don't think there are much we can do, right?

Yeah, REST has limits. To avoid the cost of HTTP overheads in cases of scanning or multigets, REST has been designed to build the response -- issuing multiple RPCs to the HBase cluster to do so -- and then send the response back to the client all in one HTTP transaction. If a client request produces a really big response, it has to fit in heap on the REST gateway. REST scanners do their own batching to handle large datasets in chunks. For a given row request however we can't divide it up, or request byte sub-ranges of values from the RegionServers. First the Result (a row) is assembled from the Get or Scan results inside the HBase client library. Then REST builds a model from the Result, which ends up copying all of the data, because REST came before KV, so its representation (Cell) predates it. Then the model is sent out by Jersey/Jetty. The Result becomes a candidate for GC as soon as the model is finished. The model is a candidate for GC as soon as Jersey finishes the request. 

Edit: So if the REST URL is a request for an entire row, the data in the row must fit in heap (x ~2). If the REST URL is a request for a large cell (100s of MB), likewise. Multiply by concurrent connections expected. Like with HBase in general, storing large values should be avoided. Put big blobs in HDFS. As far as I know the Thrift gateway operates similarly. For large rows or large cells, direct cluster access via Java API is the best option.

There are probably some clever things we can do to reduce copying, especially if we also consider changing the client library at the same time, but to date this hasn't been urgent enough to try.
                
      was (Author: apurtell):
    bq. With many concurrent curl commands, the REST server will OOM very likely. In this case, I don't think there are much we can do, right?

Yeah, REST has limits. To avoid the cost of HTTP overheads in cases of scanning or multigets, REST has been designed to build the response -- issuing multiple RPCs to the HBase cluster to do so -- and then send the response back to the client all in one HTTP transaction. If a client request produces a really big response, it has to fit in heap on the REST gateway. REST scanners do their own batching to handle large datasets in chunks. For a given row request however we can't divide it up, or request byte sub-ranges of values from the RegionServers. First the Result (a row) is assembled from the Get or Scan results inside the HBase client library. Then REST builds a model from the Result, which ends up copying all of the data, because REST came before KV, so its representation (Cell) predates it. Then the model is sent out by Jersey/Jetty. The Result becomes a candidate for GC as soon as the model is finished. The model is a candidate for GC as soon as Jersey finishes the request. 
                  
> HBase Rest server crashes if client tries to retrieve data size > 5 MB
> ----------------------------------------------------------------------
>
>                 Key: HBASE-6768
>                 URL: https://issues.apache.org/jira/browse/HBASE-6768
>             Project: HBase
>          Issue Type: Bug
>          Components: REST
>    Affects Versions: 0.90.5
>            Reporter: Mubarak Seyed
>            Assignee: Jimmy Xiang
>              Labels: noob
>
> I have a CF with one qualifier, data size is > 5 MB, when i try to read the raw binary data as octet-stream using curl, rest server got crashed and curl throws exception as
> {code}
>  curl -v -H "Accept: application/octet-stream" http://abcdefgh-hbase003.test1.test.com:9090/table1/row_key1/cf:qualifer1 > /tmp/out
> * About to connect() to abcdefgh-hbase003.test1.test.com port 9090
> *   Trying xx.xx.xx.xxx... connected
> * Connected to abcdefgh-hbase003.test1.test.com (xx.xxx.xx.xxx) port 9090
> > GET /table1/row_key1/cf:qualifer1 HTTP/1.1
> > User-Agent: curl/7.15.5 (x86_64-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5
> > Host: abcdefgh-hbase003.test1.test.com:9090
> > Accept: application/octet-stream
> > 
>   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
>                                  Dload  Upload   Total   Spent    Left  Speed
>   0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0< HTTP/1.1 200 OK
> < Content-Length: 5129836
> < X-Timestamp: 1347338813129
> < Content-Type: application/octet-stream
>   0 5009k    0 16272    0     0   7460      0  0:11:27  0:00:02  0:11:25 13872transfer closed with 1148524 bytes remaining to read
>  77 5009k   77 3888k    0     0  1765k      0  0:00:02  0:00:02 --:--:-- 3253k* Closing connection #0
> curl: (18) transfer closed with 1148524 bytes remaining to read
> {code}
> Couldn't find the exception in rest server log or no core dump either. This issue is constantly reproducible. Even i tried with HBase Rest client (HRemoteTable) and i could recreate this issue if the data size is > 10 MB (even with MIME_PROTOBUF accept header)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira