You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Rhett Garber <rh...@gmail.com> on 2009/02/04 08:48:24 UTC

data loading

I'm playing with couchdb for the first time. I'm trying to load up
some data to play with but it seems REALLY slow.

I'm using python-couchdb as the client.

Are there any tips for getting data into couchdb in bulk ?
It seems entirely IO bound at the moment, but it isn't  clear to me why.

I'm loading a log file of about 1.2 gigs... processing it in pure
python is CPU bound and takes about 1.5 minutes to get through the
whole thing.

Loading in the couchdb, i've only got 30 megs in the last hour. That
30 megs has turned into 389 megs in the couchdb data file. That
doesn't seem like enough disk IO to cause this sort of delay.....
where is the time going ? network ?

Rhett

Re: Re: data loading

Posted by Dean Landolt <de...@deanlandolt.com>.

On Wed, Feb 4, 2009 at 12:46 PM, Rhett Garber <rh...@gmail.com> wrote:

> On Wed, Feb 4, 2009 at 6:09 AM, Paul Davis <pa...@gmail.com>
> wrote:
>
> > Third, if you have a good method for generating sorted document id's,
> > inserting sorted ID's into CouchDB *should* give you better write
> > performance. Chris Anderson had some luck with this from directly
> > within the Erlang VM. There's no reason it shouldn't apply to the HTTP
> > api as well but I haven't personally tested it just to make sure.
>
> It's advisable to create my own doc ids ? Why wouldn't they be sorted
> when couchdb creates them ?
> Or perhaps I should be asking, how are the doc ids in couchdb generated ?
>
> Rhett
>

UUIDv4, which is totally random. You could always create your doc ids based
on UUIDv1, which is based on system date and mac address. Problematically,
the date it uses isn't monotonically increasing (it goes something like
ss:mm:hh rather than hh:mm:ss), so you could always just create your own
UUIDs based on system time down (down to the lowest level) and whatever else
for the best affect.

Re: Re: data loading

Posted by Chris Anderson <jc...@apache.org>.

On Wed, Feb 4, 2009 at 9:46 AM, Rhett Garber <rh...@gmail.com> wrote:
> It's advisable to create my own doc ids ? Why wouldn't they be sorted
> when couchdb creates them ?

this is a planned change. (roughly) ascending docids play nicer with
the FS cache, so you get much greater (6x in my case) insert speeds.

On my MacBook I was able to get 6,000 docs / sec (in batches of 1,000
docs). I was bypassing the JSON processing so that helped some, but
the ~6x speedup should apply even with the HTTP api, through good
choice of docids.

-- 
Chris Anderson
http://jchris.mfdz.com

Re: Re: data loading

Posted by Rhett Garber <rh...@gmail.com>.

On Wed, Feb 4, 2009 at 6:09 AM, Paul Davis <pa...@gmail.com> wrote:

> Third, if you have a good method for generating sorted document id's,
> inserting sorted ID's into CouchDB *should* give you better write
> performance. Chris Anderson had some luck with this from directly
> within the Erlang VM. There's no reason it shouldn't apply to the HTTP
> api as well but I haven't personally tested it just to make sure.

It's advisable to create my own doc ids ? Why wouldn't they be sorted
when couchdb creates them ?
Or perhaps I should be asking, how are the doc ids in couchdb generated ?

Rhett

Re: Re: data loading

Posted by Paul Davis <pa...@gmail.com>.

On Wed, Feb 4, 2009 at 3:51 AM,  <rh...@gmail.com> wrote:
> So i've got it running now at about 30 megs a minute now, which I think is
> going to work fine.
> Should take about an hour per day of data.
>
> The python process and couchdb process seem to be using about 100% of a
> single CPU.
>
> In terms of getting as much data in as fast as I can, how should I go about
> parallelizing this process ?
> How well does couchdb (and erlang is suppose) make use of multiple CPUs in
> linux ?
>
> Is it better to:
> 1. Run multiple importers against the same db
> 2. Run multiple importers against different db's and merge (replicate)
> together on the same box
> 3. Run multiple importers on different db's on different machines and
> replicate them together ?
>

First off, check the version of erlang you're using. If you happened
to install with `sudo apt-get install erlang` chances are you got
5.5.5 which is dog slow due to a VM bug.

Second, the quickest method to get data in to CouchDB is via
_bulk_docs. You're going to basically trade RAM for speed at this
point. The bigger you can make these inserts the better in terms of
everything. I've done single updates with 1M (smallish) docs before.

Third, if you have a good method for generating sorted document id's,
inserting sorted ID's into CouchDB *should* give you better write
performance. Chris Anderson had some luck with this from directly
within the Erlang VM. There's no reason it shouldn't apply to the HTTP
api as well but I haven't personally tested it just to make sure.

HTH,
Paul Davis

> I'm going to experiment with some of these setups (if they're even possible,
> i'm total newb here) but any
> insight from the experienced would be great.
>
> Thanks,
>
> Rhett
>
> On Feb 4, 2009 12:13am, Rhett Garber <rh...@gmail.com> wrote:
>>
>> Oh awesome. That's much better. Getting about 15 megs a minute now.
>>
>>
>>
>> Rhett
>>
>>
>>
>> On Wed, Feb 4, 2009 at 12:07 AM, Ulises ulises.cervino@gmail.com> wrote:
>>
>> >> Loading in the couchdb, i've only got 30 megs in the last hour. That
>>
>> >> 30 megs has turned into 389 megs in the couchdb data file. That
>>
>> >> doesn't seem like enough disk IO to cause this sort of delay.....
>>
>> >> where is the time going ? network ?
>>
>> >
>>
>> > Are you uploading one document at a time or using bulk updates? You do
>>
>> > this using update([doc1, doc2,...]) in couchdb-python.
>>
>> >
>>
>> > HTH,
>>
>> >
>>
>> > U
>>
>> >
>>
>

Re: Re: data loading

Posted by Ulises <ul...@gmail.com>.

> 2. Run multiple importers against different db's and merge (replicate)
> together on the same box
> 3. Run multiple importers on different db's on different machines and
> replicate them together ?

AFAIK these two approaches might be your best bet as CouchDB has a
process per DB so multiple uploaders to the same DB won't make much
difference (please someone correct me).

Do try them and tell us how it went :)

U

Re: data loading

Posted by Jan Lehnardt <ja...@apache.org>.

On 4 Feb 2009, at 09:51, rhettg@gmail.com wrote:

> So i've got it running now at about 30 megs a minute now, which I  
> think is going to work fine.
> Should take about an hour per day of data.
>
> The python process and couchdb process seem to be using about 100%  
> of a single CPU.

That could be the JSON conversion.

> In terms of getting as much data in as fast as I can, how should I  
> go about parallelizing this process ?
> How well does couchdb (and erlang is suppose) make use of multiple  
> CPUs in linux ?
>
> Is it better to:
> 1. Run multiple importers against the same db
> 2. Run multiple importers against different db's and merge  
> (replicate) together on the same box
> 3. Run multiple importers on different db's on different machines  
> and replicate them together ?

All depends on your data and hardware. All writes to a single db get  
serialized. If you have a single
writer that can fill all the bandwidth for your single disk, that's  
all you need. but usually it is not and
adding more writers can help.

Splitting writes over multiple databases only helps if you can  
generate more writes than a single
disk can handle and you have multiple disks. Replication uses bulk  
inserts, so the final migration
step is a bottleneck again. If you need to sustain a higher write  
rate, you need to keep your data
in multiple databases and merge on read.

For simple data import, try 2-N writers into the same DB. Everything  
else is way too complicated :)

Cheers
Jan
--


>
>
> I'm going to experiment with some of these setups (if they're even  
> possible, i'm total newb here) but any
> insight from the experienced would be great.
>
> Thanks,
>
> Rhett
>
> On Feb 4, 2009 12:13am, Rhett Garber <rh...@gmail.com> wrote:
>> Oh awesome. That's much better. Getting about 15 megs a minute now.
>>
>>
>>
>> Rhett
>>
>>
>>
>> On Wed, Feb 4, 2009 at 12:07 AM, Ulises ulises.cervino@gmail.com>  
>> wrote:
>>
>> >> Loading in the couchdb, i've only got 30 megs in the last hour.  
>> That
>>
>> >> 30 megs has turned into 389 megs in the couchdb data file. That
>>
>> >> doesn't seem like enough disk IO to cause this sort of delay.....
>>
>> >> where is the time going ? network ?
>>
>> >
>>
>> > Are you uploading one document at a time or using bulk updates?  
>> You do
>>
>> > this using update([doc1, doc2,...]) in couchdb-python.
>>
>> >
>>
>> > HTH,
>>
>> >
>>
>> > U
>>
>> >
>>

Re: Re: data loading

Posted by rh...@gmail.com.

So i've got it running now at about 30 megs a minute now, which I think is  
going to work fine.
Should take about an hour per day of data.

The python process and couchdb process seem to be using about 100% of a  
single CPU.

In terms of getting as much data in as fast as I can, how should I go about  
parallelizing this process ?
How well does couchdb (and erlang is suppose) make use of multiple CPUs in  
linux ?

Is it better to:
1. Run multiple importers against the same db
2. Run multiple importers against different db's and merge (replicate)  
together on the same box
3. Run multiple importers on different db's on different machines and  
replicate them together ?

I'm going to experiment with some of these setups (if they're even  
possible, i'm total newb here) but any
insight from the experienced would be great.

Thanks,

Rhett

On Feb 4, 2009 12:13am, Rhett Garber <rh...@gmail.com> wrote:
> Oh awesome. That's much better. Getting about 15 megs a minute now.
>
>
>
> Rhett
>
>
>
> On Wed, Feb 4, 2009 at 12:07 AM, Ulises ulises.cervino@gmail.com> wrote:
>
> >> Loading in the couchdb, i've only got 30 megs in the last hour. That
>
> >> 30 megs has turned into 389 megs in the couchdb data file. That
>
> >> doesn't seem like enough disk IO to cause this sort of delay.....
>
> >> where is the time going ? network ?
>
> >
>
> > Are you uploading one document at a time or using bulk updates? You do
>
> > this using update([doc1, doc2,...]) in couchdb-python.
>
> >
>
> > HTH,
>
> >
>
> > U
>
> >
>

Re: data loading

Posted by Rhett Garber <rh...@gmail.com>.

Oh awesome. That's much better. Getting about 15 megs a minute now.

Rhett

On Wed, Feb 4, 2009 at 12:07 AM, Ulises <ul...@gmail.com> wrote:
>> Loading in the couchdb, i've only got 30 megs in the last hour. That
>> 30 megs has turned into 389 megs in the couchdb data file. That
>> doesn't seem like enough disk IO to cause this sort of delay.....
>> where is the time going ? network ?
>
> Are you uploading one document at a time or using bulk updates? You do
> this using update([doc1, doc2,...]) in couchdb-python.
>
> HTH,
>
> U
>

Re: data loading

Posted by Ulises <ul...@gmail.com>.

> Loading in the couchdb, i've only got 30 megs in the last hour. That
> 30 megs has turned into 389 megs in the couchdb data file. That
> doesn't seem like enough disk IO to cause this sort of delay.....
> where is the time going ? network ?

Are you uploading one document at a time or using bulk updates? You do
this using update([doc1, doc2,...]) in couchdb-python.

HTH,

U

Re: data loading

Posted by Rhett Garber <rh...@gmail.com>.

Finally put a writeup of some of my findings up:
http://nullhole.com/2009/02/17/first-impressions-of-couchdb/

Any comments / further advice very welcome.

Rhett

On Tue, Feb 3, 2009 at 11:48 PM, Rhett Garber <rh...@gmail.com> wrote:
> I'm playing with couchdb for the first time. I'm trying to load up
> some data to play with but it seems REALLY slow.
>
> I'm using python-couchdb as the client.
>
> Are there any tips for getting data into couchdb in bulk ?
> It seems entirely IO bound at the moment, but it isn't  clear to me why.
>
> I'm loading a log file of about 1.2 gigs... processing it in pure
> python is CPU bound and takes about 1.5 minutes to get through the
> whole thing.
>
> Loading in the couchdb, i've only got 30 megs in the last hour. That
> 30 megs has turned into 389 megs in the couchdb data file. That
> doesn't seem like enough disk IO to cause this sort of delay.....
> where is the time going ? network ?
>
> Rhett
>