You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by "Seidl, Ed" <se...@llnl.gov> on 2012/07/20 20:59:29 UTC

appending data to tables (partitioning?)

Hi all, I have another question about dealing with large amounts of data…

I'm trying to store large blobs of data inside of accumulo, by means of doing directory imports.  These blobs are binary and are referenced by other tables.  They can also get quite large.  In an effort to cut down on the amount of time spent doing compactions on this data, I've taken to using what amounts to an increasing sequence number for the rowID's, so now a major compaction amounts to a copy of the data, but no merging has to happen.  I can also play with the table.split.threshold property for the table to keep tablets from splitting.  But sometimes a compaction will occur, which results in a lot of data being unnecessarily copied from one rfile to another.  So, my question…is there any way to signal to accumulo that rfiles that I'm trying to do an importdirectory on should just be used as is and no compaction is desired (I.e. Just move the rfiles into the table directory rather than moving them to a temp directory for later merging upon compaction)?  The paradigm I'm shooting for here is like oracle partitioned tables, where you can fill a tmp table with new data, and then swap that tmp table with an empty partition on the target table….the whole process taking seconds since no data moves, just pointers in the guts of the DB.

If there's no current way to do this, would such a mechanism be desirable to anyone other than me?  I wouldn't mind taking a stab at implementing this, but don't want to start if it's a feature that no one would want or is thought to be totally stupid  in the first place :) (As an aside, yes, I've though of storing the data in hdfs and keeping a pointer to it in accumulo, but the way I want to interact w/ the data is way easier if it's all in accumulo tables.)

Thanks,
Ed

Re: appending data to tables (partitioning?)

Posted by "Seidl, Ed" <se...@llnl.gov>.

On 7/20/12 1:23 PM, "Billie J Rinaldi" <bi...@ugov.gov> wrote:

>
>One thing you should think about is making it so that you only have one
>file per tablet, i.e. that you create a new split point for every new
>file that you import.  This should be doable if your files are pretty
>large and you don't end up having too many tablets.  If there is only one
>file per tablet, it won't compact unless you tell it to.

Awesome...that's exactly the case...I'll have one file per tablet, and all
the files should be more-or-less the same size (within 10% or so), on the
order of a gigabyte each.  Thanks for the split point tip...I hadn't
thought of that.  This should do exactly what I want.

Thanks!
Ed

>
>If you want to have multiple files per tablet, there are a number of
>parameters you should think about.  However, you should make sure that
>you don't have too many files per tablet because 1) query performance
>will suffer and 2) there is a limit to the number of files that a tablet
>server will open.  The limit to open files is adjustable.  For scan, it
>defaults to 100 files for all the tablets, and for major compaction it
>defaults to 10 files per tablet (but the compaction can be performed in
>stages).
>
>To change the compaction criteria, adjust table.file.max and
>table.compaction.major.ratio.  table.file.max is the maximum number of
>files that a tablet can have.  If a tablet has more files than this, it
>will compact.  table.compaction.major.ratio governs when compaction
>occurs when a tablet has fewer files than the maximum.  It also governs
>which files are compacted together in either case.  Raising the ratio
>will make compactions happen less.  If table.file.max is larger than the
>number of files you expect to have per tablet, setting
>table.compaction.major.ratio to the same value as table.file.max should
>keep it from compacting unless there is high variation in your file
>sizes.  A set of files is compacted into a single file if the size of the
>largest file times the ratio is <= the sum of the sizes of the files.
>
>Billie
>


Re: appending data to tables (partitioning?)

Posted by Billie J Rinaldi <bi...@ugov.gov>.
On Friday, July 20, 2012 2:59:29 PM, "Ed Seidl" <se...@llnl.gov>: 
> Hi all, I have another question about dealing with large amounts of
> data…
> 
> 
> I'm trying to store large blobs of data inside of accumulo, by means
> of doing directory imports. These blobs are binary and are referenced
> by other tables. They can also get quite large. In an effort to cut
> down on the amount of time spent doing compactions on this data, I've
> taken to using what amounts to an increasing sequence number for the
> rowID's, so now a major compaction amounts to a copy of the data, but
> no merging has to happen. I can also play with the
> table.split.threshold property for the table to keep tablets from
> splitting. But sometimes a compaction will occur, which results in a
> lot of data being unnecessarily copied from one rfile to another. So,
> my question…is there any way to signal to accumulo that rfiles that
> I'm trying to do an importdirectory on should just be used as is and
> no compaction is desired (I.e. Just move the rfiles into the table
> directory rather than moving them to a temp directory for later
> merging upon compaction)? The paradigm I'm shooting for here is like
> oracle partitioned tables, where you can fill a tmp table with new
> data, and then swap that tmp table with an empty partition on the
> target table….the whole process taking seconds since no data moves,
> just pointers in the guts of the DB.

One thing you should think about is making it so that you only have one file per tablet, i.e. that you create a new split point for every new file that you import.  This should be doable if your files are pretty large and you don't end up having too many tablets.  If there is only one file per tablet, it won't compact unless you tell it to.

If you want to have multiple files per tablet, there are a number of parameters you should think about.  However, you should make sure that you don't have too many files per tablet because 1) query performance will suffer and 2) there is a limit to the number of files that a tablet server will open.  The limit to open files is adjustable.  For scan, it defaults to 100 files for all the tablets, and for major compaction it defaults to 10 files per tablet (but the compaction can be performed in stages).

To change the compaction criteria, adjust table.file.max and table.compaction.major.ratio.  table.file.max is the maximum number of files that a tablet can have.  If a tablet has more files than this, it will compact.  table.compaction.major.ratio governs when compaction occurs when a tablet has fewer files than the maximum.  It also governs which files are compacted together in either case.  Raising the ratio will make compactions happen less.  If table.file.max is larger than the number of files you expect to have per tablet, setting table.compaction.major.ratio to the same value as table.file.max should keep it from compacting unless there is high variation in your file sizes.  A set of files is compacted into a single file if the size of the largest file times the ratio is <= the sum of the sizes of the files.

Billie


> If there's no current way to do this, would such a mechanism be
> desirable to anyone other than me? I wouldn't mind taking a stab at
> implementing this, but don't want to start if it's a feature that no
> one would want or is thought to be totally stupid in the first place
> :) (As an aside, yes, I've though of storing the data in hdfs and
> keeping a pointer to it in accumulo, but the way I want to interact w/
> the data is way easier if it's all in accumulo tables.)
> 
> 
> Thanks,
> Ed