You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by Frank Smith <fr...@outlook.com> on 2013/06/09 22:37:45 UTC

Best practices in sizing values?

I have an application where I have a block of unstructured text.  Normally that text is relatively small <500k, but there are conditions where it can be up to GBs of text.  
I was considering of using a threshold where I simply decide to change from storing the text in the value of my mutation, and just add a reference to the HDFS location, but I wanted to get some advice on where that threshold should (best practice) or must (system limitation) be?
Also, can I stream data into a value, vice passing a byte array?  Similar to how CLOBs and BLOBs are handled in an RDBMS.
Thanks,
Frank

Re: Best practices in sizing values?

Posted by Christopher <ct...@apache.org>.

The HDFS namenode may have problems with many small files, but it
depends on how many you're talking about. If the quantity you're
talking about becomes problematic for HDFS, you could consider a
chunking strategy to break up files larger than your threshold and
store the chunks in Accumulo. (Accumulo does not have the ability to
stream content into values, in response to your first post, but
chunking could achieve a similar result.)

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


On Sun, Jun 9, 2013 at 8:21 PM, Frank Smith <fr...@outlook.com> wrote:
> So, what are your thoughts on storing a bunch of small files on the HDFS?
> Sequence Files, Avro?
>
> I will note that these are essentially write once and read heavy chunks of
> text.
>
>> Date: Sun, 9 Jun 2013 17:08:42 -0400
>> Subject: Re: Best practices in sizing values?
>> From: ctubbsii@apache.org
>> To: user@accumulo.apache.org
>
>>
>> At the very least, I would keep it under the size of your compressed
>> data blocks in your RFiles (this may mean you should increase value of
>> table.file.compress.blocksize to be larger than the default of 100K).
>>
>> You could also tweak this according to your application. Say, for
>> example, you wanted to limit the additional work to resolve the
>> pointer and retrieve from HDFS only 5% of the time, you could sample
>> your data, and choose a cutoff value that keeps 95% of your data in
>> the Accumulo table.
>>
>> Personally, I like to keep things under 1MB in the value, and under 1K
>> in the key, as a crude rule of thumb, but it very much depends on the
>> application.
>>
>> --
>> Christopher L Tubbs II
>> http://gravatar.com/ctubbsii
>>
>>
>> On Sun, Jun 9, 2013 at 4:37 PM, Frank Smith <fr...@outlook.com>
>> wrote:
>> > I have an application where I have a block of unstructured text.
>> > Normally
>> > that text is relatively small <500k, but there are conditions where it
>> > can
>> > be up to GBs of text.
>> >
>> > I was considering of using a threshold where I simply decide to change
>> > from
>> > storing the text in the value of my mutation, and just add a reference
>> > to
>> > the HDFS location, but I wanted to get some advice on where that
>> > threshold
>> > should (best practice) or must (system limitation) be?
>> >
>> > Also, can I stream data into a value, vice passing a byte array? Similar
>> > to
>> > how CLOBs and BLOBs are handled in an RDBMS.
>> >
>> > Thanks,
>> >
>> > Frank

Re: Best practices in sizing values?

Posted by Billie Rinaldi <bi...@gmail.com>.

See also the filedata example of splitting a file into chunks in a similar
way to what Josh describes.
http://accumulo.apache.org/1.5/examples/filedata.html
There is more information about the table structure for this example under
Data Table in the dirlist example.
http://accumulo.apache.org/1.5/examples/dirlist.html


On Sun, Jun 9, 2013 at 6:33 PM, Josh Elser <jo...@gmail.com> wrote:

> You would likely want to keep some common prefix in the key. This would
> make seeking to an arbitrary point in the file easier.
>
> e.g.
>
> doc1 data:0000001 [] _bytes_
> doc1 data:0000002 [] _bytes_
> doc1 data:0000003 [] _bytes_
>
> As far as chunk size, Christopher's advice is probably better than
> anything I could provide without direct experimentation with the HDFS block
> size, Accumulo table.file.compress.blocksize, and size of each Value. The
> best choice for you likely depends on your usage patterns.
>
> You could even store additional metadata for each "document" you store,
> such as chunk size, number of chunks, etc. Lots of flexibility with how you
> could approach this given the flexibility Accumulo provides with the
> columns you can use.
>
>
> On 06/09/2013 08:56 PM, Frank Smith wrote:
>
>> Josh,
>>
>> That is an interesting idea.  Would you link them through the keys, or
>> append the key to the end of the value of the previous part?
>>
>> You have thoughts on how big the chunks should be?
>>
>> I definitely agree that it would be better to keep the data in Accumulo,
>> vice references to the HDFS.  Accumulo already gives me a scheme for
>> organizing files very effectively on the HDFS, rolling my own doesn't
>> make sense, unless I don't have a good sense for the limitations of a
>> tablet server to manage those large files.
>>
>> Thanks,
>>
>> Frank
>>
>>  > Date: Sun, 9 Jun 2013 20:45:15 -0400
>>  > From: josh.elser@gmail.com
>>  > To: user@accumulo.apache.org
>>  > Subject: Re: Best practices in sizing values?
>>  >
>>  > One thing I wanted to add is that you will likely fare quite well
>>  > storing your very large files as a linked-list of bytes (multiple
>>  > key-value pairs make up one of your large blobs of text). You can even
>>  > use your segmentation of the large chunks of text to do more efficient
>>  > seek'ing within the file, if applicable to your application.
>>  >
>>  > I personally don't like the idea of using storing HDFS URIs into
>>  > Accumulo. If you think about what Accumulo is providing you, one of the
>>  > things it's great at is abstracting away the notion of that underlying
>>  > filesystem. Just a thought.
>>  >
>>  > On 06/09/2013 08:21 PM, Frank Smith wrote:
>>  > > So, what are your thoughts on storing a bunch of small files on the
>>  > > HDFS? Sequence Files, Avro?
>>  > >
>>  > > I will note that these are essentially write once and read heavy
>> chunks
>>  > > of text.
>>  > >
>>  > > > Date: Sun, 9 Jun 2013 17:08:42 -0400
>>  > > > Subject: Re: Best practices in sizing values?
>>  > > > From: ctubbsii@apache.org
>>  > > > To: user@accumulo.apache.org
>>  > > >
>>  > > > At the very least, I would keep it under the size of your
>> compressed
>>  > > > data blocks in your RFiles (this may mean you should increase
>> value of
>>  > > > table.file.compress.blocksize to be larger than the default of
>> 100K).
>>  > > >
>>  > > > You could also tweak this according to your application. Say, for
>>  > > > example, you wanted to limit the additional work to resolve the
>>  > > > pointer and retrieve from HDFS only 5% of the time, you could
>> sample
>>  > > > your data, and choose a cutoff value that keeps 95% of your data in
>>  > > > the Accumulo table.
>>  > > >
>>  > > > Personally, I like to keep things under 1MB in the value, and
>> under 1K
>>  > > > in the key, as a crude rule of thumb, but it very much depends on
>> the
>>  > > > application.
>>  > > >
>>  > > > --
>>  > > > Christopher L Tubbs II
>>  > > > http://gravatar.com/ctubbsii
>>  > > >
>>  > > >
>>  > > > On Sun, Jun 9, 2013 at 4:37 PM, Frank Smith
>>  > > <fr...@outlook.com> wrote:
>>  > > > > I have an application where I have a block of unstructured text.
>>  > > Normally
>>  > > > > that text is relatively small <500k, but there are conditions
>> where
>>  > > it can
>>  > > > > be up to GBs of text.
>>  > > > >
>>  > > > > I was considering of using a threshold where I simply decide to
>>  > > change from
>>  > > > > storing the text in the value of my mutation, and just add a
>>  > > reference to
>>  > > > > the HDFS location, but I wanted to get some advice on where that
>>  > > threshold
>>  > > > > should (best practice) or must (system limitation) be?
>>  > > > >
>>  > > > > Also, can I stream data into a value, vice passing a byte array?
>>  > > Similar to
>>  > > > > how CLOBs and BLOBs are handled in an RDBMS.
>>  > > > >
>>  > > > > Thanks,
>>  > > > >
>>  > > > > Frank
>>
>

Re: Best practices in sizing values?

Posted by Josh Elser <jo...@gmail.com>.

You would likely want to keep some common prefix in the key. This would 
make seeking to an arbitrary point in the file easier.

e.g.

doc1 data:0000001 [] _bytes_
doc1 data:0000002 [] _bytes_
doc1 data:0000003 [] _bytes_

As far as chunk size, Christopher's advice is probably better than 
anything I could provide without direct experimentation with the HDFS 
block size, Accumulo table.file.compress.blocksize, and size of each 
Value. The best choice for you likely depends on your usage patterns.

You could even store additional metadata for each "document" you store, 
such as chunk size, number of chunks, etc. Lots of flexibility with how 
you could approach this given the flexibility Accumulo provides with the 
columns you can use.

On 06/09/2013 08:56 PM, Frank Smith wrote:
> Josh,
>
> That is an interesting idea.  Would you link them through the keys, or
> append the key to the end of the value of the previous part?
>
> You have thoughts on how big the chunks should be?
>
> I definitely agree that it would be better to keep the data in Accumulo,
> vice references to the HDFS.  Accumulo already gives me a scheme for
> organizing files very effectively on the HDFS, rolling my own doesn't
> make sense, unless I don't have a good sense for the limitations of a
> tablet server to manage those large files.
>
> Thanks,
>
> Frank
>
>  > Date: Sun, 9 Jun 2013 20:45:15 -0400
>  > From: josh.elser@gmail.com
>  > To: user@accumulo.apache.org
>  > Subject: Re: Best practices in sizing values?
>  >
>  > One thing I wanted to add is that you will likely fare quite well
>  > storing your very large files as a linked-list of bytes (multiple
>  > key-value pairs make up one of your large blobs of text). You can even
>  > use your segmentation of the large chunks of text to do more efficient
>  > seek'ing within the file, if applicable to your application.
>  >
>  > I personally don't like the idea of using storing HDFS URIs into
>  > Accumulo. If you think about what Accumulo is providing you, one of the
>  > things it's great at is abstracting away the notion of that underlying
>  > filesystem. Just a thought.
>  >
>  > On 06/09/2013 08:21 PM, Frank Smith wrote:
>  > > So, what are your thoughts on storing a bunch of small files on the
>  > > HDFS? Sequence Files, Avro?
>  > >
>  > > I will note that these are essentially write once and read heavy chunks
>  > > of text.
>  > >
>  > > > Date: Sun, 9 Jun 2013 17:08:42 -0400
>  > > > Subject: Re: Best practices in sizing values?
>  > > > From: ctubbsii@apache.org
>  > > > To: user@accumulo.apache.org
>  > > >
>  > > > At the very least, I would keep it under the size of your compressed
>  > > > data blocks in your RFiles (this may mean you should increase
> value of
>  > > > table.file.compress.blocksize to be larger than the default of 100K).
>  > > >
>  > > > You could also tweak this according to your application. Say, for
>  > > > example, you wanted to limit the additional work to resolve the
>  > > > pointer and retrieve from HDFS only 5% of the time, you could sample
>  > > > your data, and choose a cutoff value that keeps 95% of your data in
>  > > > the Accumulo table.
>  > > >
>  > > > Personally, I like to keep things under 1MB in the value, and
> under 1K
>  > > > in the key, as a crude rule of thumb, but it very much depends on the
>  > > > application.
>  > > >
>  > > > --
>  > > > Christopher L Tubbs II
>  > > > http://gravatar.com/ctubbsii
>  > > >
>  > > >
>  > > > On Sun, Jun 9, 2013 at 4:37 PM, Frank Smith
>  > > <fr...@outlook.com> wrote:
>  > > > > I have an application where I have a block of unstructured text.
>  > > Normally
>  > > > > that text is relatively small <500k, but there are conditions where
>  > > it can
>  > > > > be up to GBs of text.
>  > > > >
>  > > > > I was considering of using a threshold where I simply decide to
>  > > change from
>  > > > > storing the text in the value of my mutation, and just add a
>  > > reference to
>  > > > > the HDFS location, but I wanted to get some advice on where that
>  > > threshold
>  > > > > should (best practice) or must (system limitation) be?
>  > > > >
>  > > > > Also, can I stream data into a value, vice passing a byte array?
>  > > Similar to
>  > > > > how CLOBs and BLOBs are handled in an RDBMS.
>  > > > >
>  > > > > Thanks,
>  > > > >
>  > > > > Frank

RE: Best practices in sizing values?

Posted by Frank Smith <fr...@outlook.com>.

Josh,
That is an interesting idea.  Would you link them through the keys, or append the key to the end of the value of the previous part?
You have thoughts on how big the chunks should be?
I definitely agree that it would be better to keep the data in Accumulo, vice references to the HDFS.  Accumulo already gives me a scheme for organizing files very effectively on the HDFS, rolling my own doesn't make sense, unless I don't have a good sense for the limitations of a tablet server to manage those large files.
Thanks,
Frank

> Date: Sun, 9 Jun 2013 20:45:15 -0400
> From: josh.elser@gmail.com
> To: user@accumulo.apache.org
> Subject: Re: Best practices in sizing values?
> 
> One thing I wanted to add is that you will likely fare quite well 
> storing your very large files as a linked-list of bytes (multiple 
> key-value pairs make up one of your large blobs of text). You can even 
> use your segmentation of the large chunks of text to do more efficient 
> seek'ing within the file, if applicable to your application.
> 
> I personally don't like the idea of using storing HDFS URIs into 
> Accumulo. If you think about what Accumulo is providing you, one of the 
> things it's great at is abstracting away the notion of that underlying 
> filesystem. Just a thought.
> 
> On 06/09/2013 08:21 PM, Frank Smith wrote:
> > So, what are your thoughts on storing a bunch of small files on the
> > HDFS?  Sequence Files, Avro?
> >
> > I will note that these are essentially write once and read heavy chunks
> > of text.
> >
> >  > Date: Sun, 9 Jun 2013 17:08:42 -0400
> >  > Subject: Re: Best practices in sizing values?
> >  > From: ctubbsii@apache.org
> >  > To: user@accumulo.apache.org
> >  >
> >  > At the very least, I would keep it under the size of your compressed
> >  > data blocks in your RFiles (this may mean you should increase value of
> >  > table.file.compress.blocksize to be larger than the default of 100K).
> >  >
> >  > You could also tweak this according to your application. Say, for
> >  > example, you wanted to limit the additional work to resolve the
> >  > pointer and retrieve from HDFS only 5% of the time, you could sample
> >  > your data, and choose a cutoff value that keeps 95% of your data in
> >  > the Accumulo table.
> >  >
> >  > Personally, I like to keep things under 1MB in the value, and under 1K
> >  > in the key, as a crude rule of thumb, but it very much depends on the
> >  > application.
> >  >
> >  > --
> >  > Christopher L Tubbs II
> >  > http://gravatar.com/ctubbsii
> >  >
> >  >
> >  > On Sun, Jun 9, 2013 at 4:37 PM, Frank Smith
> > <fr...@outlook.com> wrote:
> >  > > I have an application where I have a block of unstructured text.
> > Normally
> >  > > that text is relatively small <500k, but there are conditions where
> > it can
> >  > > be up to GBs of text.
> >  > >
> >  > > I was considering of using a threshold where I simply decide to
> > change from
> >  > > storing the text in the value of my mutation, and just add a
> > reference to
> >  > > the HDFS location, but I wanted to get some advice on where that
> > threshold
> >  > > should (best practice) or must (system limitation) be?
> >  > >
> >  > > Also, can I stream data into a value, vice passing a byte array?
> > Similar to
> >  > > how CLOBs and BLOBs are handled in an RDBMS.
> >  > >
> >  > > Thanks,
> >  > >
> >  > > Frank

Re: Best practices in sizing values?

Posted by Josh Elser <jo...@gmail.com>.

One thing I wanted to add is that you will likely fare quite well 
storing your very large files as a linked-list of bytes (multiple 
key-value pairs make up one of your large blobs of text). You can even 
use your segmentation of the large chunks of text to do more efficient 
seek'ing within the file, if applicable to your application.

I personally don't like the idea of using storing HDFS URIs into 
Accumulo. If you think about what Accumulo is providing you, one of the 
things it's great at is abstracting away the notion of that underlying 
filesystem. Just a thought.

On 06/09/2013 08:21 PM, Frank Smith wrote:
> So, what are your thoughts on storing a bunch of small files on the
> HDFS?  Sequence Files, Avro?
>
> I will note that these are essentially write once and read heavy chunks
> of text.
>
>  > Date: Sun, 9 Jun 2013 17:08:42 -0400
>  > Subject: Re: Best practices in sizing values?
>  > From: ctubbsii@apache.org
>  > To: user@accumulo.apache.org
>  >
>  > At the very least, I would keep it under the size of your compressed
>  > data blocks in your RFiles (this may mean you should increase value of
>  > table.file.compress.blocksize to be larger than the default of 100K).
>  >
>  > You could also tweak this according to your application. Say, for
>  > example, you wanted to limit the additional work to resolve the
>  > pointer and retrieve from HDFS only 5% of the time, you could sample
>  > your data, and choose a cutoff value that keeps 95% of your data in
>  > the Accumulo table.
>  >
>  > Personally, I like to keep things under 1MB in the value, and under 1K
>  > in the key, as a crude rule of thumb, but it very much depends on the
>  > application.
>  >
>  > --
>  > Christopher L Tubbs II
>  > http://gravatar.com/ctubbsii
>  >
>  >
>  > On Sun, Jun 9, 2013 at 4:37 PM, Frank Smith
> <fr...@outlook.com> wrote:
>  > > I have an application where I have a block of unstructured text.
> Normally
>  > > that text is relatively small <500k, but there are conditions where
> it can
>  > > be up to GBs of text.
>  > >
>  > > I was considering of using a threshold where I simply decide to
> change from
>  > > storing the text in the value of my mutation, and just add a
> reference to
>  > > the HDFS location, but I wanted to get some advice on where that
> threshold
>  > > should (best practice) or must (system limitation) be?
>  > >
>  > > Also, can I stream data into a value, vice passing a byte array?
> Similar to
>  > > how CLOBs and BLOBs are handled in an RDBMS.
>  > >
>  > > Thanks,
>  > >
>  > > Frank

RE: Best practices in sizing values?

Posted by Frank Smith <fr...@outlook.com>.

So, what are your thoughts on storing a bunch of small files on the HDFS?  Sequence Files, Avro?
I will note that these are essentially write once and read heavy chunks of text.

> Date: Sun, 9 Jun 2013 17:08:42 -0400
> Subject: Re: Best practices in sizing values?
> From: ctubbsii@apache.org
> To: user@accumulo.apache.org
> 
> At the very least, I would keep it under the size of your compressed
> data blocks in your RFiles (this may mean you should increase value of
> table.file.compress.blocksize to be larger than the default of 100K).
> 
> You could also tweak this according to your application. Say, for
> example, you wanted to limit the additional work to resolve the
> pointer and retrieve from HDFS only 5% of the time, you could sample
> your data, and choose a cutoff value that keeps 95% of your data in
> the Accumulo table.
> 
> Personally, I like to keep things under 1MB in the value, and under 1K
> in the key, as a crude rule of thumb, but it very much depends on the
> application.
> 
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
> 
> 
> On Sun, Jun 9, 2013 at 4:37 PM, Frank Smith <fr...@outlook.com> wrote:
> > I have an application where I have a block of unstructured text.  Normally
> > that text is relatively small <500k, but there are conditions where it can
> > be up to GBs of text.
> >
> > I was considering of using a threshold where I simply decide to change from
> > storing the text in the value of my mutation, and just add a reference to
> > the HDFS location, but I wanted to get some advice on where that threshold
> > should (best practice) or must (system limitation) be?
> >
> > Also, can I stream data into a value, vice passing a byte array?  Similar to
> > how CLOBs and BLOBs are handled in an RDBMS.
> >
> > Thanks,
> >
> > Frank

Re: Best practices in sizing values?

Posted by Christopher <ct...@apache.org>.

At the very least, I would keep it under the size of your compressed
data blocks in your RFiles (this may mean you should increase value of
table.file.compress.blocksize to be larger than the default of 100K).

You could also tweak this according to your application. Say, for
example, you wanted to limit the additional work to resolve the
pointer and retrieve from HDFS only 5% of the time, you could sample
your data, and choose a cutoff value that keeps 95% of your data in
the Accumulo table.

Personally, I like to keep things under 1MB in the value, and under 1K
in the key, as a crude rule of thumb, but it very much depends on the
application.

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii

On Sun, Jun 9, 2013 at 4:37 PM, Frank Smith <fr...@outlook.com> wrote:
> I have an application where I have a block of unstructured text.  Normally
> that text is relatively small <500k, but there are conditions where it can
> be up to GBs of text.
>
> I was considering of using a threshold where I simply decide to change from
> storing the text in the value of my mutation, and just add a reference to
> the HDFS location, but I wanted to get some advice on where that threshold
> should (best practice) or must (system limitation) be?
>
> Also, can I stream data into a value, vice passing a byte array?  Similar to
> how CLOBs and BLOBs are handled in an RDBMS.
>
> Thanks,
>
> Frank