You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by Bill Havanki <bh...@clouderagovt.com> on 2014/06/06 22:31:52 UTC

Re: Supporting large values

Just to close this email thread out:

I found that the scanner in the test reader client was using the default
batch size of 1000 (or is it 10,000? don't remember exactly) and requesting
an entire split from the table. I reduced the batch size to 1, and I think
that was key to fixing the memory issue and getting the test to complete.

Thank you so much to everyone who took time to think about this.

Bill


On Wed, May 28, 2014 at 11:15 AM, Bill Havanki <bh...@clouderagovt.com>
wrote:

> The immediate intent is to run some memory stress tests as part of a
> deliverable of ours for Accumulo. So, right now I'm just trying to get the
> tests to pass. A greater goal is indeed to understand what's needed to
> support really large keys or values. I don't think we're looking to
> generate a formula yet, but maybe just advice on what configuration
> settings and such to look out for in general.
>
> I did discover that the test client was a) requesting one split at a time
> and b) not setting a low scanner batch size (the default is 1000). Setting
> the batch size down to 1 seems to have helped a lot, so things are slowly
> improving. :)
>
>
> On Wed, May 28, 2014 at 10:24 AM, Josh Elser <jo...@gmail.com> wrote:
>
>> On 5/28/14, 9:39 AM, Bill Havanki wrote:
>>
>>> Thanks Josh!
>>>
>>> - This is indeed under CDH 4.6.0. If there is a particular line number
>>> you
>>> want to see code for, just name it and I'll look it up.
>>>
>>
>> I was generally curious to see what kind of batching the DfsOutputStream
>> does (it looked like it was checksumming small chunks of data), but I can
>> look into that some more to satisfy my curiosity.
>>
>>
>>  - Re #2, the test client is sending mutations of only one cell each, so a
>>> mutation should be 100 MB + a little, due to the large value. It's
>>> inefficient, but it seems to be a good idea just for getting this test to
>>> survive. Maybe the logger code is hanging on to mutations in memory
>>> before
>>> writing them out? (That would surprise me, but I dunno.)
>>>
>>
>> Well, I think you're going to have to be able to keep "about" two copies
>> in memory (what I was trying to get at before). The tserver is going to get
>> the Mutation objects from the client. So, that's one instance of, say,
>> 100MB. Before that write finishes, you'll also need to write those out to
>> the WAL, which means that you'll be serializing each Mutation using the
>> Writable methods, which, while it isn't quite the same as having a discrete
>> object of that size on the heap, you're still writing out those bytes to
>> the DataOutput which are going to be buffered through JVM heap.
>>
>>
>>  Another fact I didn't mention is that I am running 2 writers and 2
>>> readers
>>> for the test. Perhaps 612 and 613 are the write threads, and then 615 is
>>> one scan, which might leave 614 as the remains of the other scan, which
>>> has
>>> already failed and is logging an OOME (which is what the monitor shows)?
>>>
>>
>> Perhaps! That might make sense.
>>
>>
>>  My thought from looking at this again is that Thrift is running out of
>>> space forming the scan result message as it fills up a
>>> ByteArrayOutputStream. Maybe there is some way to force Thrift to break
>>> things up?
>>>
>>
>> I don't know of anything inside of thrift that we could use to do that.
>>
>> Overall, though, what's your intent by testing this? Is it to have a
>> better understanding of server-side memory usage? Generally speaking, if
>> you have clients getting back 100MB values and the server is writing 100MB
>> values, that would intuitively use up a bit of heap space.
>>
>> I could see merit in constructing a general formula for memory
>> consumption based on avg key-value size, number of threads available to
>> read, number of threads available to write, and number of MinC/MajC
>> threads. It probably wouldn't be much more valuable than a starting point
>> due to variance, but it would be a starting point!
>>
>>
>>  Thanks for burning cycles on this.
>>>
>>> Bill
>>>
>>>
>>> On Tue, May 27, 2014 at 7:11 PM, Josh Elser <jo...@gmail.com>
>>> wrote:
>>>
>>>  Well, for this one, it looks to me that you have two threads writing
>>>> data
>>>> (ClientPool 612 and 613), with 612 being blocked by 613. There are two
>>>> threads reading data, but they both appear to be in nativemap code, so I
>>>> don't expect too much memory usage from them. ClientPool 615 is the
>>>> thrift
>>>> call for one of those scans. I'm not quite sure what ClientPool 614 is
>>>> doing.
>>>>
>>>> Much hunch is that 613 is what actually pushed you into the OOME. I
>>>> can't
>>>> really say much more because I assume you're running on CDH as the line
>>>> numbers don't match up to the Hadoop sources I have locally.
>>>>
>>>> I don't think there's much inside the logger code that will hold onto
>>>> duplicate mutations, so the two things I'm curious about are:
>>>>
>>>> 1. Any chunking/buffering done inside of the DFSOutputStream (and if we
>>>> should be using/configuring something differently). I see some signs of
>>>> this from the method names in the stack trace.
>>>>
>>>> 2. Figuring out a formula for sizes of Mutations that are directly (via
>>>> (Server)Mutation objects on heap) or indirectly (being written out to
>>>> some
>>>> OutputStream, like the DfsOutputStream previously mentioned), relative
>>>> to
>>>> the Accumulo configuration.
>>>>
>>>> I imagine #2 is where the most value we could gain would come from.
>>>>
>>>> Hopefully that brain dump is helpful :)
>>>>
>>>>
>>>> On 5/27/14, 6:19 PM, Bill Havanki wrote:
>>>>
>>>>  Stack traces are here:
>>>>>
>>>>> https://gist.github.com/kbzod/e6e21ea15cf5670ba534
>>>>>
>>>>> This time something showed up in the monitor, often there is no stack
>>>>> trace
>>>>> there. The thread dump is from setting ACCUMULO_KILL_CMD to "kill -3
>>>>> %p".
>>>>>
>>>>> Thanks again
>>>>> Bill
>>>>>
>>>>>
>>>>> On Tue, May 27, 2014 at 5:09 PM, Bill Havanki <
>>>>> bhavanki@clouderagovt.com>
>>>>> wrote:
>>>>>
>>>>>   I left the default key size constraint in place. I had set the
>>>>> tserver
>>>>>
>>>>>> mesage size up from 1 GB to 1.5 GB, but it didn't help. (I forgot that
>>>>>> config item.)
>>>>>>
>>>>>> Stack trace(s) coming up! I got tired of failures all day so I'm
>>>>>> running
>>>>>> a
>>>>>> different test that will hopefully work. I'll re-break it shortly :D
>>>>>>
>>>>>>
>>>>>> On Tue, May 27, 2014 at 5:04 PM, Josh Elser <jo...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>   Stack traces would definitely be helpful, IMO.
>>>>>>
>>>>>>>
>>>>>>> (or interesting if nothing else :D)
>>>>>>>
>>>>>>>
>>>>>>> On 5/27/14, 4:55 PM, Bill Havanki wrote:
>>>>>>>
>>>>>>>   No sir. I am seeing general out of heap space messages, nothing
>>>>>>> about
>>>>>>>
>>>>>>>> direct buffers. One specific example would be while Thrift is
>>>>>>>> writing
>>>>>>>> to
>>>>>>>> a
>>>>>>>> ByteArrayOutputStream to send off scan results. (I can get an exact
>>>>>>>> stack
>>>>>>>> trace - easily :} - if it would be helpful.) It seems as if there
>>>>>>>> just
>>>>>>>> isn't enough heap left, after controlling for what I have so far.
>>>>>>>>
>>>>>>>> As a clarification of my original email: each row has 100 cells, and
>>>>>>>> each
>>>>>>>> cell has a 100 MB value. So, one row would occupy just over 10 GB.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, May 27, 2014 at 4:49 PM, <dl...@comcast.net> wrote:
>>>>>>>>
>>>>>>>>    Are you seeing something similar to the error in
>>>>>>>>
>>>>>>>>  https://issues.apache.org/jira/browse/ACCUMULO-2495?
>>>>>>>>>
>>>>>>>>> ----- Original Message -----
>>>>>>>>>
>>>>>>>>> From: "Bill Havanki" <bh...@clouderagovt.com>
>>>>>>>>> To: "Accumulo Dev List" <de...@accumulo.apache.org>
>>>>>>>>> Sent: Tuesday, May 27, 2014 4:30:59 PM
>>>>>>>>> Subject: Supporting large values
>>>>>>>>>
>>>>>>>>> I'm trying to run a stress test where each row in a table has 100
>>>>>>>>> cells,
>>>>>>>>> each with a value of 100 MB of random data. (This is using Bill
>>>>>>>>> Slacum's
>>>>>>>>> memory stress test tool). Despite fiddling with the cluster
>>>>>>>>> configuration,
>>>>>>>>> I always run out of tablet server heap space before too long.
>>>>>>>>>
>>>>>>>>> Here are the configurations I've tried so far, with valuable
>>>>>>>>> guidance
>>>>>>>>> from
>>>>>>>>> Busbey and madrob:
>>>>>>>>>
>>>>>>>>> - native maps are enabled, tserver.memory.maps.max = 8G
>>>>>>>>> - table.compaction.minor.logs.threshold = 8
>>>>>>>>> - tserver.walog.max.size = 1G
>>>>>>>>> - Tablet server has 4G heap (-Xmx4g)
>>>>>>>>> - table is pre-split into 8 tablets (split points 0x20, 0x40, 0x60,
>>>>>>>>> ...), 5
>>>>>>>>> tablet servers are available
>>>>>>>>> - tserver.cache.data.size = 256M
>>>>>>>>> - tserver.cache.index.size = 40M (keys are small - 4 bytes - in
>>>>>>>>> this
>>>>>>>>> test)
>>>>>>>>> - table.scan.max.memory = 256M
>>>>>>>>> - tserver.readahead.concurrent.max = 4 (default is 16)
>>>>>>>>>
>>>>>>>>> It's often hard to tell where the OOM error comes from, but I have
>>>>>>>>> seen
>>>>>>>>> it
>>>>>>>>> frequently coming from Thrift as it is writing out scan results.
>>>>>>>>>
>>>>>>>>> Does anyone have any good conventions for supporting large values?
>>>>>>>>> (Warning: I'll want to work on large keys (and tiny values) next!
>>>>>>>>> :) )
>>>>>>>>>
>>>>>>>>> Thanks very much
>>>>>>>>> Bill
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> // Bill Havanki
>>>>>>>>> // Solutions Architect, Cloudera Govt Solutions
>>>>>>>>> // 443.686.9283
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> --
>>>>>> // Bill Havanki
>>>>>> // Solutions Architect, Cloudera Govt Solutions
>>>>>> // 443.686.9283
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>
>
> --
> // Bill Havanki
> // Solutions Architect, Cloudera Govt Solutions
> // 443.686.9283
>



-- 
// Bill Havanki
// Solutions Architect, Cloudera Govt Solutions
// 443.686.9283