You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Daniel Gonzalez <go...@gonvaled.com> on 2013/03/06 12:06:41 UTC

Update handler is very slow

Hi,

We have a problem in our data: we have been inconsistent in one of our
fields, and we have named it in different ways. Besides, in some places we
have used int, in other places string. I have created an update handler to
correct this situation, and I am running it for our 100 thousand documents
database, by doing PUT requests, as explained
http://wiki.apache.org/couchdb/Document_Update_Handlers

What I am doing is:

   1. get affected documents with a view
   2. call the update handler.

And this is running over an ssh tunnel.

My problem is that this is veeeery slow. Currently I am running at 4
docs/s. Is this normal?

I could do this locally (no ssh tunnel), but I guess things would not
improve much, since the data being transferred is not that big (no
include_docs, and the view emits very litte information). I have the
impression that the bottleneck is couchdb itself: the update handler is
just that slow.

Am I right about this? Is there a way to speed this up?

Thanks,
Daniel

Re: Update handler is very slow

Posted by Robert Newson <rn...@apache.org>.
I bet you could go faster too, but that's a huge improvement, congrats!

On 6 March 2013 08:21, Daniel Gonzalez <go...@gonvaled.com> wrote:
> I couldn't resist and I have moved to a bulk read / modify / bulk write
> approach and the situation has dramatically improved: I am running now at
> over 100 docs/s compared to a 4 docs/s with the update handler.
>
> On Wed, Mar 6, 2013 at 2:28 PM, Daniel Gonzalez <go...@gonvaled.com>wrote:
>
>> Thanks Robert, that explains it.
>>
>> I was indeed under the impression that update handlers are faster than
>> re-creation of documents. Seeing couchdb as a black-box, that is what you
>> would expect, since the update handler requires less information transfer,
>> and is largely performed inside couchdb itself (with eventually some data
>> coming with the http request).
>>
>> I understand now that the implementation details of the update handler
>> make it slower (in the general case) than re-creation of documents, but
>> since this is not plainly obvious, I think it should be mentioned in the
>> documentation about update handlers.
>>
>> Actually, my first approach to solve the problem was to do exactly that
>> (bulk read / modify / bulk write), but I discarded it because I had thought
>> that an update handler would be *faster*. Then I implemented my solution,
>> and was surprised about the slowness of it. Hence my mail.
>>
>> Now my database update is halfway through, and I will let it run until
>> completion. For the next time, I hope to remember about this discussion.
>>
>> Thanks,
>> Daniel
>>
>> On Wed, Mar 6, 2013 at 2:17 PM, Robert Newson <rn...@apache.org> wrote:
>>
>>> Update handlers are very slow in comparison to a straight POST or PUT
>>> as they have to invoke some Javascript on the server. This is, by some
>>> margin, the slowest way to achieve your goal.
>>>
>>> The mistake here, though, is thinking that an update handler is the
>>> right way to update every document in your system. Update handlers
>>> exist to add a little server-side logic in cases where it's impossible
>>> or awkward to do so in the client (i.e, when the client is not a
>>> browser). GIven their intrinsic slowness, I'd avoid them where I
>>> could.
>>>
>>> The fastest way to update documents is to use the bulk document API.
>>> Ideally you want fetch a batch of docs that need updating in one call,
>>> transform them using any scripting language or tool, and then update
>>> the batch by posting it to _bulk_docs. These methods are described in
>>> http://wiki.apache.org/couchdb/HTTP_Bulk_Document_API. Some
>>> experimentation will be required to find a good batch size; too small
>>> and this will take longer than it could, too high and the server can
>>> crash by running out of memory.  Unless your documents are very large,
>>> or very small, I'd start with a couple of hundred docs and then tweak
>>> up and down. Since this sounds like a one-off, you might even skip
>>> this optimization phase, the difference between doing singular PUT's
>>> through an update handler and doing 200 documents through _bulk_docs
>>> will be so huge that you might not need it to go any faster.
>>>
>>> There was a recent thread to add this as a CouchDB feature. If we did,
>>> it would work much the same as above. I'm wary, though, as it would
>>> encourage the rewrite-all-the-documents approach. That should be quite
>>> a rare event since a schema-less document-oriented approach should
>>> largely relieve you of the pain of changing document contents. In this
>>> thread's case, the inconsistent use of a particular field, a one-time
>>> fix-up makes sense (assuming that new updates are consistent).
>>>
>>> B.
>>>
>>>
>>> On 6 March 2013 06:13, Anthony Ananich <an...@inpun.com> wrote:
>>> > And how much does it take to add document by HTTP PUT?
>>> >
>>> > On Wed, Mar 6, 2013 at 2:33 PM, svilen <az...@svilendobrev.com> wrote:
>>> >> +1. i'd like to know also about update_handlers as i may get into such
>>> >> situation soon.
>>> >>
>>> >> not an answer:
>>> >> if you sure your transformation is correct, my lame take would be:
>>> >> don't do anything.
>>> >> 4doc/s, 12000/hour - so by tomorrow it would be done.
>>> >>
>>> >> of course, no harm to find/learn - e.g. u may need to rerun it again..
>>> >>
>>> >> ciao
>>> >> svilen
>>> >>
>>> >> On Wed, 6 Mar 2013 12:06:41 +0100
>>> >> Daniel Gonzalez <go...@gonvaled.com> wrote:
>>> >>
>>> >>> Hi,
>>> >>>
>>> >>> We have a problem in our data: we have been inconsistent in one of our
>>> >>> fields, and we have named it in different ways. Besides, in some
>>> >>> places we have used int, in other places string. I have created an
>>> >>> update handler to correct this situation, and I am running it for our
>>> >>> 100 thousand documents database, by doing PUT requests, as explained
>>> >>> http://wiki.apache.org/couchdb/Document_Update_Handlers
>>> >>>
>>> >>> What I am doing is:
>>> >>>
>>> >>>    1. get affected documents with a view
>>> >>>    2. call the update handler.
>>> >>>
>>> >>> And this is running over an ssh tunnel.
>>> >>>
>>> >>> My problem is that this is veeeery slow. Currently I am running at 4
>>> >>> docs/s. Is this normal?
>>> >>>
>>> >>> I could do this locally (no ssh tunnel), but I guess things would not
>>> >>> improve much, since the data being transferred is not that big (no
>>> >>> include_docs, and the view emits very litte information). I have the
>>> >>> impression that the bottleneck is couchdb itself: the update handler
>>> >>> is just that slow.
>>> >>>
>>> >>> Am I right about this? Is there a way to speed this up?
>>> >>>
>>> >>> Thanks,
>>> >>> Daniel
>>>
>>
>>

Re: Update handler is very slow

Posted by Daniel Gonzalez <go...@gonvaled.com>.
I couldn't resist and I have moved to a bulk read / modify / bulk write
approach and the situation has dramatically improved: I am running now at
over 100 docs/s compared to a 4 docs/s with the update handler.

On Wed, Mar 6, 2013 at 2:28 PM, Daniel Gonzalez <go...@gonvaled.com>wrote:

> Thanks Robert, that explains it.
>
> I was indeed under the impression that update handlers are faster than
> re-creation of documents. Seeing couchdb as a black-box, that is what you
> would expect, since the update handler requires less information transfer,
> and is largely performed inside couchdb itself (with eventually some data
> coming with the http request).
>
> I understand now that the implementation details of the update handler
> make it slower (in the general case) than re-creation of documents, but
> since this is not plainly obvious, I think it should be mentioned in the
> documentation about update handlers.
>
> Actually, my first approach to solve the problem was to do exactly that
> (bulk read / modify / bulk write), but I discarded it because I had thought
> that an update handler would be *faster*. Then I implemented my solution,
> and was surprised about the slowness of it. Hence my mail.
>
> Now my database update is halfway through, and I will let it run until
> completion. For the next time, I hope to remember about this discussion.
>
> Thanks,
> Daniel
>
> On Wed, Mar 6, 2013 at 2:17 PM, Robert Newson <rn...@apache.org> wrote:
>
>> Update handlers are very slow in comparison to a straight POST or PUT
>> as they have to invoke some Javascript on the server. This is, by some
>> margin, the slowest way to achieve your goal.
>>
>> The mistake here, though, is thinking that an update handler is the
>> right way to update every document in your system. Update handlers
>> exist to add a little server-side logic in cases where it's impossible
>> or awkward to do so in the client (i.e, when the client is not a
>> browser). GIven their intrinsic slowness, I'd avoid them where I
>> could.
>>
>> The fastest way to update documents is to use the bulk document API.
>> Ideally you want fetch a batch of docs that need updating in one call,
>> transform them using any scripting language or tool, and then update
>> the batch by posting it to _bulk_docs. These methods are described in
>> http://wiki.apache.org/couchdb/HTTP_Bulk_Document_API. Some
>> experimentation will be required to find a good batch size; too small
>> and this will take longer than it could, too high and the server can
>> crash by running out of memory.  Unless your documents are very large,
>> or very small, I'd start with a couple of hundred docs and then tweak
>> up and down. Since this sounds like a one-off, you might even skip
>> this optimization phase, the difference between doing singular PUT's
>> through an update handler and doing 200 documents through _bulk_docs
>> will be so huge that you might not need it to go any faster.
>>
>> There was a recent thread to add this as a CouchDB feature. If we did,
>> it would work much the same as above. I'm wary, though, as it would
>> encourage the rewrite-all-the-documents approach. That should be quite
>> a rare event since a schema-less document-oriented approach should
>> largely relieve you of the pain of changing document contents. In this
>> thread's case, the inconsistent use of a particular field, a one-time
>> fix-up makes sense (assuming that new updates are consistent).
>>
>> B.
>>
>>
>> On 6 March 2013 06:13, Anthony Ananich <an...@inpun.com> wrote:
>> > And how much does it take to add document by HTTP PUT?
>> >
>> > On Wed, Mar 6, 2013 at 2:33 PM, svilen <az...@svilendobrev.com> wrote:
>> >> +1. i'd like to know also about update_handlers as i may get into such
>> >> situation soon.
>> >>
>> >> not an answer:
>> >> if you sure your transformation is correct, my lame take would be:
>> >> don't do anything.
>> >> 4doc/s, 12000/hour - so by tomorrow it would be done.
>> >>
>> >> of course, no harm to find/learn - e.g. u may need to rerun it again..
>> >>
>> >> ciao
>> >> svilen
>> >>
>> >> On Wed, 6 Mar 2013 12:06:41 +0100
>> >> Daniel Gonzalez <go...@gonvaled.com> wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> We have a problem in our data: we have been inconsistent in one of our
>> >>> fields, and we have named it in different ways. Besides, in some
>> >>> places we have used int, in other places string. I have created an
>> >>> update handler to correct this situation, and I am running it for our
>> >>> 100 thousand documents database, by doing PUT requests, as explained
>> >>> http://wiki.apache.org/couchdb/Document_Update_Handlers
>> >>>
>> >>> What I am doing is:
>> >>>
>> >>>    1. get affected documents with a view
>> >>>    2. call the update handler.
>> >>>
>> >>> And this is running over an ssh tunnel.
>> >>>
>> >>> My problem is that this is veeeery slow. Currently I am running at 4
>> >>> docs/s. Is this normal?
>> >>>
>> >>> I could do this locally (no ssh tunnel), but I guess things would not
>> >>> improve much, since the data being transferred is not that big (no
>> >>> include_docs, and the view emits very litte information). I have the
>> >>> impression that the bottleneck is couchdb itself: the update handler
>> >>> is just that slow.
>> >>>
>> >>> Am I right about this? Is there a way to speed this up?
>> >>>
>> >>> Thanks,
>> >>> Daniel
>>
>
>

Re: Update handler is very slow

Posted by Daniel Gonzalez <go...@gonvaled.com>.
Thanks Robert, that explains it.

I was indeed under the impression that update handlers are faster than
re-creation of documents. Seeing couchdb as a black-box, that is what you
would expect, since the update handler requires less information transfer,
and is largely performed inside couchdb itself (with eventually some data
coming with the http request).

I understand now that the implementation details of the update handler make
it slower (in the general case) than re-creation of documents, but since
this is not plainly obvious, I think it should be mentioned in the
documentation about update handlers.

Actually, my first approach to solve the problem was to do exactly that
(bulk read / modify / bulk write), but I discarded it because I had thought
that an update handler would be *faster*. Then I implemented my solution,
and was surprised about the slowness of it. Hence my mail.

Now my database update is halfway through, and I will let it run until
completion. For the next time, I hope to remember about this discussion.

Thanks,
Daniel

On Wed, Mar 6, 2013 at 2:17 PM, Robert Newson <rn...@apache.org> wrote:

> Update handlers are very slow in comparison to a straight POST or PUT
> as they have to invoke some Javascript on the server. This is, by some
> margin, the slowest way to achieve your goal.
>
> The mistake here, though, is thinking that an update handler is the
> right way to update every document in your system. Update handlers
> exist to add a little server-side logic in cases where it's impossible
> or awkward to do so in the client (i.e, when the client is not a
> browser). GIven their intrinsic slowness, I'd avoid them where I
> could.
>
> The fastest way to update documents is to use the bulk document API.
> Ideally you want fetch a batch of docs that need updating in one call,
> transform them using any scripting language or tool, and then update
> the batch by posting it to _bulk_docs. These methods are described in
> http://wiki.apache.org/couchdb/HTTP_Bulk_Document_API. Some
> experimentation will be required to find a good batch size; too small
> and this will take longer than it could, too high and the server can
> crash by running out of memory.  Unless your documents are very large,
> or very small, I'd start with a couple of hundred docs and then tweak
> up and down. Since this sounds like a one-off, you might even skip
> this optimization phase, the difference between doing singular PUT's
> through an update handler and doing 200 documents through _bulk_docs
> will be so huge that you might not need it to go any faster.
>
> There was a recent thread to add this as a CouchDB feature. If we did,
> it would work much the same as above. I'm wary, though, as it would
> encourage the rewrite-all-the-documents approach. That should be quite
> a rare event since a schema-less document-oriented approach should
> largely relieve you of the pain of changing document contents. In this
> thread's case, the inconsistent use of a particular field, a one-time
> fix-up makes sense (assuming that new updates are consistent).
>
> B.
>
>
> On 6 March 2013 06:13, Anthony Ananich <an...@inpun.com> wrote:
> > And how much does it take to add document by HTTP PUT?
> >
> > On Wed, Mar 6, 2013 at 2:33 PM, svilen <az...@svilendobrev.com> wrote:
> >> +1. i'd like to know also about update_handlers as i may get into such
> >> situation soon.
> >>
> >> not an answer:
> >> if you sure your transformation is correct, my lame take would be:
> >> don't do anything.
> >> 4doc/s, 12000/hour - so by tomorrow it would be done.
> >>
> >> of course, no harm to find/learn - e.g. u may need to rerun it again..
> >>
> >> ciao
> >> svilen
> >>
> >> On Wed, 6 Mar 2013 12:06:41 +0100
> >> Daniel Gonzalez <go...@gonvaled.com> wrote:
> >>
> >>> Hi,
> >>>
> >>> We have a problem in our data: we have been inconsistent in one of our
> >>> fields, and we have named it in different ways. Besides, in some
> >>> places we have used int, in other places string. I have created an
> >>> update handler to correct this situation, and I am running it for our
> >>> 100 thousand documents database, by doing PUT requests, as explained
> >>> http://wiki.apache.org/couchdb/Document_Update_Handlers
> >>>
> >>> What I am doing is:
> >>>
> >>>    1. get affected documents with a view
> >>>    2. call the update handler.
> >>>
> >>> And this is running over an ssh tunnel.
> >>>
> >>> My problem is that this is veeeery slow. Currently I am running at 4
> >>> docs/s. Is this normal?
> >>>
> >>> I could do this locally (no ssh tunnel), but I guess things would not
> >>> improve much, since the data being transferred is not that big (no
> >>> include_docs, and the view emits very litte information). I have the
> >>> impression that the bottleneck is couchdb itself: the update handler
> >>> is just that slow.
> >>>
> >>> Am I right about this? Is there a way to speed this up?
> >>>
> >>> Thanks,
> >>> Daniel
>

Re: Update handler is very slow

Posted by Robert Newson <rn...@apache.org>.
Update handlers are very slow in comparison to a straight POST or PUT
as they have to invoke some Javascript on the server. This is, by some
margin, the slowest way to achieve your goal.

The mistake here, though, is thinking that an update handler is the
right way to update every document in your system. Update handlers
exist to add a little server-side logic in cases where it's impossible
or awkward to do so in the client (i.e, when the client is not a
browser). GIven their intrinsic slowness, I'd avoid them where I
could.

The fastest way to update documents is to use the bulk document API.
Ideally you want fetch a batch of docs that need updating in one call,
transform them using any scripting language or tool, and then update
the batch by posting it to _bulk_docs. These methods are described in
http://wiki.apache.org/couchdb/HTTP_Bulk_Document_API. Some
experimentation will be required to find a good batch size; too small
and this will take longer than it could, too high and the server can
crash by running out of memory.  Unless your documents are very large,
or very small, I'd start with a couple of hundred docs and then tweak
up and down. Since this sounds like a one-off, you might even skip
this optimization phase, the difference between doing singular PUT's
through an update handler and doing 200 documents through _bulk_docs
will be so huge that you might not need it to go any faster.

There was a recent thread to add this as a CouchDB feature. If we did,
it would work much the same as above. I'm wary, though, as it would
encourage the rewrite-all-the-documents approach. That should be quite
a rare event since a schema-less document-oriented approach should
largely relieve you of the pain of changing document contents. In this
thread's case, the inconsistent use of a particular field, a one-time
fix-up makes sense (assuming that new updates are consistent).

B.


On 6 March 2013 06:13, Anthony Ananich <an...@inpun.com> wrote:
> And how much does it take to add document by HTTP PUT?
>
> On Wed, Mar 6, 2013 at 2:33 PM, svilen <az...@svilendobrev.com> wrote:
>> +1. i'd like to know also about update_handlers as i may get into such
>> situation soon.
>>
>> not an answer:
>> if you sure your transformation is correct, my lame take would be:
>> don't do anything.
>> 4doc/s, 12000/hour - so by tomorrow it would be done.
>>
>> of course, no harm to find/learn - e.g. u may need to rerun it again..
>>
>> ciao
>> svilen
>>
>> On Wed, 6 Mar 2013 12:06:41 +0100
>> Daniel Gonzalez <go...@gonvaled.com> wrote:
>>
>>> Hi,
>>>
>>> We have a problem in our data: we have been inconsistent in one of our
>>> fields, and we have named it in different ways. Besides, in some
>>> places we have used int, in other places string. I have created an
>>> update handler to correct this situation, and I am running it for our
>>> 100 thousand documents database, by doing PUT requests, as explained
>>> http://wiki.apache.org/couchdb/Document_Update_Handlers
>>>
>>> What I am doing is:
>>>
>>>    1. get affected documents with a view
>>>    2. call the update handler.
>>>
>>> And this is running over an ssh tunnel.
>>>
>>> My problem is that this is veeeery slow. Currently I am running at 4
>>> docs/s. Is this normal?
>>>
>>> I could do this locally (no ssh tunnel), but I guess things would not
>>> improve much, since the data being transferred is not that big (no
>>> include_docs, and the view emits very litte information). I have the
>>> impression that the bottleneck is couchdb itself: the update handler
>>> is just that slow.
>>>
>>> Am I right about this? Is there a way to speed this up?
>>>
>>> Thanks,
>>> Daniel

Re: Update handler is very slow

Posted by Anthony Ananich <an...@inpun.com>.
And how much does it take to add document by HTTP PUT?

On Wed, Mar 6, 2013 at 2:33 PM, svilen <az...@svilendobrev.com> wrote:
> +1. i'd like to know also about update_handlers as i may get into such
> situation soon.
>
> not an answer:
> if you sure your transformation is correct, my lame take would be:
> don't do anything.
> 4doc/s, 12000/hour - so by tomorrow it would be done.
>
> of course, no harm to find/learn - e.g. u may need to rerun it again..
>
> ciao
> svilen
>
> On Wed, 6 Mar 2013 12:06:41 +0100
> Daniel Gonzalez <go...@gonvaled.com> wrote:
>
>> Hi,
>>
>> We have a problem in our data: we have been inconsistent in one of our
>> fields, and we have named it in different ways. Besides, in some
>> places we have used int, in other places string. I have created an
>> update handler to correct this situation, and I am running it for our
>> 100 thousand documents database, by doing PUT requests, as explained
>> http://wiki.apache.org/couchdb/Document_Update_Handlers
>>
>> What I am doing is:
>>
>>    1. get affected documents with a view
>>    2. call the update handler.
>>
>> And this is running over an ssh tunnel.
>>
>> My problem is that this is veeeery slow. Currently I am running at 4
>> docs/s. Is this normal?
>>
>> I could do this locally (no ssh tunnel), but I guess things would not
>> improve much, since the data being transferred is not that big (no
>> include_docs, and the view emits very litte information). I have the
>> impression that the bottleneck is couchdb itself: the update handler
>> is just that slow.
>>
>> Am I right about this? Is there a way to speed this up?
>>
>> Thanks,
>> Daniel

Re: Update handler is very slow

Posted by svilen <az...@svilendobrev.com>.
+1. i'd like to know also about update_handlers as i may get into such
situation soon.

not an answer:
if you sure your transformation is correct, my lame take would be: 
don't do anything.
4doc/s, 12000/hour - so by tomorrow it would be done.

of course, no harm to find/learn - e.g. u may need to rerun it again.. 

ciao
svilen

On Wed, 6 Mar 2013 12:06:41 +0100
Daniel Gonzalez <go...@gonvaled.com> wrote:

> Hi,
> 
> We have a problem in our data: we have been inconsistent in one of our
> fields, and we have named it in different ways. Besides, in some
> places we have used int, in other places string. I have created an
> update handler to correct this situation, and I am running it for our
> 100 thousand documents database, by doing PUT requests, as explained
> http://wiki.apache.org/couchdb/Document_Update_Handlers
> 
> What I am doing is:
> 
>    1. get affected documents with a view
>    2. call the update handler.
> 
> And this is running over an ssh tunnel.
> 
> My problem is that this is veeeery slow. Currently I am running at 4
> docs/s. Is this normal?
> 
> I could do this locally (no ssh tunnel), but I guess things would not
> improve much, since the data being transferred is not that big (no
> include_docs, and the view emits very litte information). I have the
> impression that the bottleneck is couchdb itself: the update handler
> is just that slow.
> 
> Am I right about this? Is there a way to speed this up?
> 
> Thanks,
> Daniel