You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by Joe Stein <jo...@stealth.ly> on 2014/08/18 17:36:53 UTC

Column size limit

Hi, for Accumulo is there a recommended max for column value size? So if
want to store files at what point do we have to split the file into parts
or (rather) just store it in HDFS with a reference path to it?

/*******************************************
 Joe Stein
 Founder, Principal Consultant
 Big Data Open Source Security LLC
 http://www.stealth.ly
 Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
********************************************/

Re: Column size limit

Posted by Joe Stein <jo...@stealth.ly>.
Thanks! &&  Thanks!


On Mon, Aug 18, 2014 at 1:14 PM, Josh Elser <jo...@gmail.com> wrote:

> I think Billie's project is one of our "examples" --
> http://accumulo.apache.org/1.6/examples/dirlist.html
>
>
> On 8/18/14, 1:05 PM, Adam Fuchs wrote:
>
>> Joe,
>>
>> I would say that a rule of thumb would be tens of megabytes for a single
>> cell. There are two limits that affect this:
>>
>> 1) Amount of memory used: This includes ingesting into the batchwriter,
>> buffering in the in-memory maps, scanning RFiles, and preparing query
>> responses. At any given point, there could be a few copies of the cell
>> hanging out in memory, so you don't want to pack things too tightly. If
>> you
>> have ridiculous amounts of memory then you can squeeze in some pretty
>> large
>> docs.
>> 2) Message size for client/server communication: This is limited to 1G by
>> default, but can be increased if needed. A single key/value pair will not
>> be fragmented across these message frames.
>>
>> Whether to store bigger files in fragmented cells or as references to HDFS
>> files typically has to do with security and lifecycle management. If you
>> want cell-level security and encryption protection, you'll probably want
>> to
>> go with a fragmented key/value approach. If you want to keep all of your
>> data in one spot for easier management you might also prefer to fragment
>> the files in Accumulo. Otherwise sticking it in HDFS and storing a
>> reference is a pretty simple and good solution.
>>
>> Billie did a project a while ago to fragment and store larger files in
>> Accumulo. I'm not sure what happened with that, but it might be out there
>> somewhere for you to use.
>>
>> Cheers,
>> Adam
>>
>>
>>
>> On Mon, Aug 18, 2014 at 11:36 AM, Joe Stein <jo...@stealth.ly> wrote:
>>
>>  Hi, for Accumulo is there a recommended max for column value size? So if
>>> want to store files at what point do we have to split the file into parts
>>> or (rather) just store it in HDFS with a reference path to it?
>>>
>>> /*******************************************
>>>   Joe Stein
>>>   Founder, Principal Consultant
>>>   Big Data Open Source Security LLC
>>>   http://www.stealth.ly
>>>   Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
>>> ********************************************/
>>>
>>>
>>

Re: Column size limit

Posted by Josh Elser <jo...@gmail.com>.
I think Billie's project is one of our "examples" -- 
http://accumulo.apache.org/1.6/examples/dirlist.html

On 8/18/14, 1:05 PM, Adam Fuchs wrote:
> Joe,
>
> I would say that a rule of thumb would be tens of megabytes for a single
> cell. There are two limits that affect this:
>
> 1) Amount of memory used: This includes ingesting into the batchwriter,
> buffering in the in-memory maps, scanning RFiles, and preparing query
> responses. At any given point, there could be a few copies of the cell
> hanging out in memory, so you don't want to pack things too tightly. If you
> have ridiculous amounts of memory then you can squeeze in some pretty large
> docs.
> 2) Message size for client/server communication: This is limited to 1G by
> default, but can be increased if needed. A single key/value pair will not
> be fragmented across these message frames.
>
> Whether to store bigger files in fragmented cells or as references to HDFS
> files typically has to do with security and lifecycle management. If you
> want cell-level security and encryption protection, you'll probably want to
> go with a fragmented key/value approach. If you want to keep all of your
> data in one spot for easier management you might also prefer to fragment
> the files in Accumulo. Otherwise sticking it in HDFS and storing a
> reference is a pretty simple and good solution.
>
> Billie did a project a while ago to fragment and store larger files in
> Accumulo. I'm not sure what happened with that, but it might be out there
> somewhere for you to use.
>
> Cheers,
> Adam
>
>
>
> On Mon, Aug 18, 2014 at 11:36 AM, Joe Stein <jo...@stealth.ly> wrote:
>
>> Hi, for Accumulo is there a recommended max for column value size? So if
>> want to store files at what point do we have to split the file into parts
>> or (rather) just store it in HDFS with a reference path to it?
>>
>> /*******************************************
>>   Joe Stein
>>   Founder, Principal Consultant
>>   Big Data Open Source Security LLC
>>   http://www.stealth.ly
>>   Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
>> ********************************************/
>>
>

Re: Column size limit

Posted by Adam Fuchs <sc...@gmail.com>.
Joe,

I would say that a rule of thumb would be tens of megabytes for a single
cell. There are two limits that affect this:

1) Amount of memory used: This includes ingesting into the batchwriter,
buffering in the in-memory maps, scanning RFiles, and preparing query
responses. At any given point, there could be a few copies of the cell
hanging out in memory, so you don't want to pack things too tightly. If you
have ridiculous amounts of memory then you can squeeze in some pretty large
docs.
2) Message size for client/server communication: This is limited to 1G by
default, but can be increased if needed. A single key/value pair will not
be fragmented across these message frames.

Whether to store bigger files in fragmented cells or as references to HDFS
files typically has to do with security and lifecycle management. If you
want cell-level security and encryption protection, you'll probably want to
go with a fragmented key/value approach. If you want to keep all of your
data in one spot for easier management you might also prefer to fragment
the files in Accumulo. Otherwise sticking it in HDFS and storing a
reference is a pretty simple and good solution.

Billie did a project a while ago to fragment and store larger files in
Accumulo. I'm not sure what happened with that, but it might be out there
somewhere for you to use.

Cheers,
Adam



On Mon, Aug 18, 2014 at 11:36 AM, Joe Stein <jo...@stealth.ly> wrote:

> Hi, for Accumulo is there a recommended max for column value size? So if
> want to store files at what point do we have to split the file into parts
> or (rather) just store it in HDFS with a reference path to it?
>
> /*******************************************
>  Joe Stein
>  Founder, Principal Consultant
>  Big Data Open Source Security LLC
>  http://www.stealth.ly
>  Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
> ********************************************/
>