You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by "Seidl, Ed" <se...@llnl.gov> on 2014/05/28 19:49:47 UTC

stupid/dangerous batch load question

I have a large amount of data that I am batch loading into accumulo.  I'm using mapreduce to read in chunks of data and write out rfiles to be loaded with importdirectory.  I've noticed that the import will hang for longer and longer times as more data is added.  For instance, one table, which currently has ~2500 tablets, now takes around 2 hours to process the importdirectory.

In poking around in the source for TableOperationsImpl (1.5.0), I see that there is an option to not wait on certain operations (like compact).  Would it be dangerous to (optionally) return immediately from importdirectory, and instead check the fail directory to detect errors in the import?  I know this will eventually cause a backup in the staging directories, but is there any potential to corrupt the tables?

Thanks,
Ed

Re: stupid/dangerous batch load question

Posted by "Seidl, Ed" <se...@llnl.gov>.

It's a small cluster, only 7 tservers.

While the import directory is occuring, no other MR jobs are running.  The number of queued major compactions increases into the thousands.  I have the number of bulk import threads set to 10, so there will be 70 concurrent major compactions shown on the monitor page.  System load is actually pretty low…~40 or so.  The rows themselves are not very large, but the rowid's are on the order of 150 bytes.  Overridden bits of config below:

default  | general.rpc.timeout ............................. | 120s
site     |    @override .................................... | 300s
default  | master.bulk.threadpool.size ..................... | 5
system   |    @override .................................... | 32
default  | master.bulk.timeout ............................. | 5m
system   |    @override .................................... | 60m
default  | tserver.bulk.assign.threads ..................... | 1
system   |    @override .................................... | 10
default  | tserver.bulk.process.threads .................... | 1
system   |    @override .................................... | 10
default  | tserver.bulk.timeout ............................ | 5m
system   |    @override .................................... | 60m
default  | tserver.cache.data.size ......................... | 128M
site     |    @override .................................... | 256M
default  | tserver.compaction.major.concurrent.max ......... | 3
system   |    @override .................................... | 10
default  | tserver.compaction.minor.concurrent.max ......... | 4
site     |    @override .................................... | 10

Thanks,
Ed

From: David Medinets <da...@gmail.com>>
Reply-To: "user@accumulo.apache.org<ma...@accumulo.apache.org>" <us...@accumulo.apache.org>>
Date: Wednesday, May 28, 2014 11:16 AM
To: accumulo-user <us...@accumulo.apache.org>>
Subject: Re: stupid/dangerous batch load question

Lots of questions can be asked:

How many servers?
How many compactions are being run at once?
What is the size of the mutations?

What does the Accumulo monitor page say during the ingest process? Does it indicate high load?

Are you running map-reduce jobs at the same time as the bulk ingest?

I think there is a setting to change the number of threads used by bulk ingest. Can you run 'config -t' and post the results?

I've used tables with thousands of tablets, I can't remember having to wait for a Bulk Ingest to process.



On Wed, May 28, 2014 at 1:49 PM, Seidl, Ed <se...@llnl.gov>> wrote:
I have a large amount of data that I am batch loading into accumulo.  I'm using mapreduce to read in chunks of data and write out rfiles to be loaded with importdirectory.  I've noticed that the import will hang for longer and longer times as more data is added.  For instance, one table, which currently has ~2500 tablets, now takes around 2 hours to process the importdirectory.

In poking around in the source for TableOperationsImpl (1.5.0), I see that there is an option to not wait on certain operations (like compact).  Would it be dangerous to (optionally) return immediately from importdirectory, and instead check the fail directory to detect errors in the import?  I know this will eventually cause a backup in the staging directories, but is there any potential to corrupt the tables?

Thanks,
Ed

Re: stupid/dangerous batch load question

Posted by David Medinets <da...@gmail.com>.

Lots of questions can be asked:

How many servers?
How many compactions are being run at once?
What is the size of the mutations?

What does the Accumulo monitor page say during the ingest process? Does it
indicate high load?

Are you running map-reduce jobs at the same time as the bulk ingest?

I think there is a setting to change the number of threads used by bulk
ingest. Can you run 'config -t' and post the results?

I've used tables with thousands of tablets, I can't remember having to wait
for a Bulk Ingest to process.

On Wed, May 28, 2014 at 1:49 PM, Seidl, Ed <se...@llnl.gov> wrote:

>  I have a large amount of data that I am batch loading into accumulo.
>  I'm using mapreduce to read in chunks of data and write out rfiles to be
> loaded with importdirectory.  I've noticed that the import will hang for
> longer and longer times as more data is added.  For instance, one table,
> which currently has ~2500 tablets, now takes around 2 hours to process the
> importdirectory.
>
>  In poking around in the source for TableOperationsImpl (1.5.0), I see
> that there is an option to not wait on certain operations (like compact).
>  Would it be dangerous to (optionally) return immediately from
> importdirectory, and instead check the fail directory to detect errors in
> the import?  I know this will eventually cause a backup in the staging
> directories, but is there any potential to corrupt the tables?
>
>  Thanks,
> Ed
>

Re: stupid/dangerous batch load question

Posted by "Seidl, Ed" <se...@llnl.gov>.

That's the rub.  I have 120 reducers running, so I wind up w/ 120 RFiles to import.  I haven't tried playing w/ a custom partitioner to send adjacent ranges to reducers so the rfiles won't have overlapping keys.  Perhaps that would help?

Thanks,
Ed

From: Mike Drob <ma...@cloudera.com>>
Reply-To: "user@accumulo.apache.org<ma...@accumulo.apache.org>" <us...@accumulo.apache.org>>
Date: Wednesday, May 28, 2014 11:22 AM
To: "user@accumulo.apache.org<ma...@accumulo.apache.org>" <us...@accumulo.apache.org>>
Subject: Re: stupid/dangerous batch load question

Are you partitioning the resultant files by the existing table splits, or just sending everything to one file?

If you are importing multiple files, then there is potential that some of the files succeed and others fail. Depending on how your data is laid out, this may cause application level corruption, but the underlying key/value store should be ok.

On Wed, May 28, 2014 at 12:49 PM, Seidl, Ed <se...@llnl.gov>> wrote:
I have a large amount of data that I am batch loading into accumulo.  I'm using mapreduce to read in chunks of data and write out rfiles to be loaded with importdirectory.  I've noticed that the import will hang for longer and longer times as more data is added.  For instance, one table, which currently has ~2500 tablets, now takes around 2 hours to process the importdirectory.

In poking around in the source for TableOperationsImpl (1.5.0), I see that there is an option to not wait on certain operations (like compact).  Would it be dangerous to (optionally) return immediately from importdirectory, and instead check the fail directory to detect errors in the import?  I know this will eventually cause a backup in the staging directories, but is there any potential to corrupt the tables?

Thanks,
Ed

Re: stupid/dangerous batch load question

Posted by Josh Elser <jo...@gmail.com>.

On 5/28/14, 2:22 PM, Mike Drob wrote:
> Are you partitioning the resultant files by the existing table splits,
> or just sending everything to one file?

Emphasis on this. Sending a large file to every tablet for a table can 
be very expensive. Trying to align the files you're generating with the 
splits of a table will help alleviate that cost.

> If you are importing multiple files, then there is potential that some
> of the files succeed and others fail. Depending on how your data is laid
> out, this may cause application level corruption, but the underlying
> key/value store should be ok.
>
>
> On Wed, May 28, 2014 at 12:49 PM, Seidl, Ed <seidl2@llnl.gov
> <ma...@llnl.gov>> wrote:
>
>     I have a large amount of data that I am batch loading into accumulo.
>       I'm using mapreduce to read in chunks of data and write out rfiles
>     to be loaded with importdirectory.  I've noticed that the import
>     will hang for longer and longer times as more data is added.  For
>     instance, one table, which currently has ~2500 tablets, now takes
>     around 2 hours to process the importdirectory.
>
>     In poking around in the source for TableOperationsImpl (1.5.0), I
>     see that there is an option to not wait on certain operations (like
>     compact).  Would it be dangerous to (optionally) return immediately
>     from importdirectory, and instead check the fail directory to detect
>     errors in the import?  I know this will eventually cause a backup in
>     the staging directories, but is there any potential to corrupt the
>     tables?
>
>     Thanks,
>     Ed
>
>

Re: stupid/dangerous batch load question

Posted by Mike Drob <ma...@cloudera.com>.

Are you partitioning the resultant files by the existing table splits, or
just sending everything to one file?

If you are importing multiple files, then there is potential that some of
the files succeed and others fail. Depending on how your data is laid out,
this may cause application level corruption, but the underlying key/value
store should be ok.

On Wed, May 28, 2014 at 12:49 PM, Seidl, Ed <se...@llnl.gov> wrote:

>  I have a large amount of data that I am batch loading into accumulo.
>  I'm using mapreduce to read in chunks of data and write out rfiles to be
> loaded with importdirectory.  I've noticed that the import will hang for
> longer and longer times as more data is added.  For instance, one table,
> which currently has ~2500 tablets, now takes around 2 hours to process the
> importdirectory.
>
>  In poking around in the source for TableOperationsImpl (1.5.0), I see
> that there is an option to not wait on certain operations (like compact).
>  Would it be dangerous to (optionally) return immediately from
> importdirectory, and instead check the fail directory to detect errors in
> the import?  I know this will eventually cause a backup in the staging
> directories, but is there any potential to corrupt the tables?
>
>  Thanks,
> Ed
>