You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by Benoit Chesneau <bc...@gmail.com> on 2012/07/09 15:10:49 UTC

couch_query_server refactoring

I'm working on the couch_query_server refactoring:
    - extract it from the couch app
    - introduce generic way to add query servers written in Erlang or
      calling OS processes like couchjs (so rather than distinct os
processes from native only call an erlang module with some arguments)
   - split the couchapp engine in its own module.

I'm actually wondering why you have one proc / ddoc? Any reason for
that? Apart for the rewriter?

- benoît

Re: couch_query_server refactoring

Posted by Paul Davis <pa...@gmail.com>.
On Mon, Jul 9, 2012 at 9:59 AM, Benoit Chesneau <bc...@gmail.com> wrote:
> On Mon, Jul 9, 2012 at 3:59 PM, Paul Davis <pa...@gmail.com> wrote:
>> Benoît,
>>
>> Lately I've been contemplating removing a lot of the Erlang mechanics
>> for this by rewriting couchjs as a single process/multi threaded
>> application. I've seen a lot of issues related to our process handling
>> and I also think we can probably speed things up considerably if we
>> change how this works. Ie, if we move to an asynchronous message
>> passing interface instead of the serialized stdio interface we should
>> be able to get some nice speedups in throughput while also removing a
>> lot of the resource usage.
>
> What would be the difference/advantage with using something like emonk?

Error handling. Using emonk if you crash and burn it'll take out the
entire VM. Granted if we manage this sufficiently well we should be
able to swap them out interchangeably for people that want to run
Emonk or w/e.

>>
>> As part of that we should also do what you suggest and look into
>> refactoring the top layer to make this stuff a lot cleaner where we
>> call it in places like the rewriter and what not.
>>
>> I'm also not sure what you mean about the couchapp module. Right now
>> if I had to guess I could see a couple Erlang apps: one that
>> encompasses couchjs for JS code, one for Erlang code (for the
>> view/list/show etc) etc etc. I could also see having the
>> rewriter/list/show stuff in its own app as well but its early and I'm
>> not quite awake yet.
>
> I'm thinking to have a new structure like:
>
> {src|apps}/
>     couch_qs -> everything really needed for indexation|m/r
>     couch_ape -> everything related to the couchapps

Hard to say. Seems like we could be a bit more granular but I'd have
to see the end result of the refactoring to be sure. Either way, not a
decision we have to make this moment.

>
>>
>>
>> On Mon, Jul 9, 2012 at 8:10 AM, Benoit Chesneau <bc...@gmail.com> wrote:
>>> I'm working on the couch_query_server refactoring:
>>>     - extract it from the couch app
>>>     - introduce generic way to add query servers written in Erlang or
>>>       calling OS processes like couchjs (so rather than distinct os
>>> processes from native only call an erlang module with some arguments)
>>>    - split the couchapp engine in its own module.
>>>
>>> I'm actually wondering why you have one proc / ddoc? Any reason for
>>> that? Apart for the rewriter?
>>>
>>> - benoît

Re: couch_query_server refactoring

Posted by Benoit Chesneau <bc...@gmail.com>.
On Mon, Jul 9, 2012 at 3:59 PM, Paul Davis <pa...@gmail.com> wrote:
> Benoît,
>
> Lately I've been contemplating removing a lot of the Erlang mechanics
> for this by rewriting couchjs as a single process/multi threaded
> application. I've seen a lot of issues related to our process handling
> and I also think we can probably speed things up considerably if we
> change how this works. Ie, if we move to an asynchronous message
> passing interface instead of the serialized stdio interface we should
> be able to get some nice speedups in throughput while also removing a
> lot of the resource usage.

What would be the difference/advantage with using something like emonk?
>
> As part of that we should also do what you suggest and look into
> refactoring the top layer to make this stuff a lot cleaner where we
> call it in places like the rewriter and what not.
>
> I'm also not sure what you mean about the couchapp module. Right now
> if I had to guess I could see a couple Erlang apps: one that
> encompasses couchjs for JS code, one for Erlang code (for the
> view/list/show etc) etc etc. I could also see having the
> rewriter/list/show stuff in its own app as well but its early and I'm
> not quite awake yet.

I'm thinking to have a new structure like:

{src|apps}/
    couch_qs -> everything really needed for indexation|m/r
    couch_ape -> everything related to the couchapps

>
>
> On Mon, Jul 9, 2012 at 8:10 AM, Benoit Chesneau <bc...@gmail.com> wrote:
>> I'm working on the couch_query_server refactoring:
>>     - extract it from the couch app
>>     - introduce generic way to add query servers written in Erlang or
>>       calling OS processes like couchjs (so rather than distinct os
>> processes from native only call an erlang module with some arguments)
>>    - split the couchapp engine in its own module.
>>
>> I'm actually wondering why you have one proc / ddoc? Any reason for
>> that? Apart for the rewriter?
>>
>> - benoît

Re: couch_query_server refactoring

Posted by Paul Davis <pa...@gmail.com>.
On Thu, Jul 12, 2012 at 2:53 AM, Benoit Chesneau <bc...@gmail.com> wrote:
> On Thu, Jul 12, 2012 at 9:23 AM, Paul Davis <pa...@gmail.com> wrote:
>> I actually see one of two ways. And which one we go with depends on
>> and odd question that I've never really asked (or seen discussed)
>> before.
>>
>> Both patterns would have a single couchjs process that is threaded and
>> can respond to multiple requests simultaneously. Doing this gives us
>> an ability to maximize throughput using pipelining and the like. The
>> downfall is that if the couchjs process dies then it affects every
>> other message that was in transit through it. Although I think we can
>> mitigate a lot of this with some basic retry/exponential back off
>> logic.
>>
>> One way we can do this is to use a similar approach to what we do now
>> but asyncronously. Ie, messages from Erlang to couchjs become tagged
>> tuples that a central process dispatches back and forth to clients.
>> There are a few issues here. Firstly, we're going to basically need
>> two caches of lookups which could be harmful to low latency
>> requirements. Ie, Erlang will have to keep a cache of tag/client
>> pairs, and the couchjs side will have to have some sort of lookup
>> table/cache thing for ddoc ids to JS_Contexts. The weird question I
>> have about this is, "What is the latency/bandwidth of stdio?" I've
>> never thought to try and benchmark such a thing but it seems possible
>> (no idea how likely though) that we could max out the capacity of a
>> single file descriptor for communication.
>>
>> The second approach is to turn couchjs into a simple threaded network
>> server. This way the Erlang side would just open a socket per design
>> doc/context pair, and the couchjs side would be rather simple to
>> implement (considering the alternative). I like this approach because
>> it minimizes cache/lookups, uses more file descriptors (not sure if
>> this is a valid concern), and (most importantly) keeps most of the
>> complexity in Erlang where its easier to manage.
>
>
> I quite like this second approach too. It would ease any
> implementation and is also really portable. Ie passing a socket path
> or port to an external server is easier in that case. For example on
> IOS the app accessing to couchdb could maintain this process
> internally and won't break any sandbox. It may be easier than writing
> a nif for each cases...
>

Not sure what you mean by passing a socket path or port, but I think I
agree that it would map cleanly onto an emonk replacement.

>>
>> Either way, once either of those things is implemented we can
>> re-implement our external process management to not totally bork
>> servers under load as well as (hopefully) improve latency considerably
>> for any of our "RPC" things.
>
> you mean the pool of couch_os_processes ?
>
> - benoît

I meant in general replacing that pool will help anything that needs
to make a JavaScript function call.

Re: couch_query_server refactoring

Posted by Benoit Chesneau <bc...@gmail.com>.
On Thu, Jul 12, 2012 at 9:23 AM, Paul Davis <pa...@gmail.com> wrote:
> I actually see one of two ways. And which one we go with depends on
> and odd question that I've never really asked (or seen discussed)
> before.
>
> Both patterns would have a single couchjs process that is threaded and
> can respond to multiple requests simultaneously. Doing this gives us
> an ability to maximize throughput using pipelining and the like. The
> downfall is that if the couchjs process dies then it affects every
> other message that was in transit through it. Although I think we can
> mitigate a lot of this with some basic retry/exponential back off
> logic.
>
> One way we can do this is to use a similar approach to what we do now
> but asyncronously. Ie, messages from Erlang to couchjs become tagged
> tuples that a central process dispatches back and forth to clients.
> There are a few issues here. Firstly, we're going to basically need
> two caches of lookups which could be harmful to low latency
> requirements. Ie, Erlang will have to keep a cache of tag/client
> pairs, and the couchjs side will have to have some sort of lookup
> table/cache thing for ddoc ids to JS_Contexts. The weird question I
> have about this is, "What is the latency/bandwidth of stdio?" I've
> never thought to try and benchmark such a thing but it seems possible
> (no idea how likely though) that we could max out the capacity of a
> single file descriptor for communication.
>
> The second approach is to turn couchjs into a simple threaded network
> server. This way the Erlang side would just open a socket per design
> doc/context pair, and the couchjs side would be rather simple to
> implement (considering the alternative). I like this approach because
> it minimizes cache/lookups, uses more file descriptors (not sure if
> this is a valid concern), and (most importantly) keeps most of the
> complexity in Erlang where its easier to manage.


I quite like this second approach too. It would ease any
implementation and is also really portable. Ie passing a socket path
or port to an external server is easier in that case. For example on
IOS the app accessing to couchdb could maintain this process
internally and won't break any sandbox. It may be easier than writing
a nif for each cases...

>
> Either way, once either of those things is implemented we can
> re-implement our external process management to not totally bork
> servers under load as well as (hopefully) improve latency considerably
> for any of our "RPC" things.

you mean the pool of couch_os_processes ?

- benoît

Re: couch_query_server refactoring

Posted by Paul Davis <pa...@gmail.com>.
I actually see one of two ways. And which one we go with depends on
and odd question that I've never really asked (or seen discussed)
before.

Both patterns would have a single couchjs process that is threaded and
can respond to multiple requests simultaneously. Doing this gives us
an ability to maximize throughput using pipelining and the like. The
downfall is that if the couchjs process dies then it affects every
other message that was in transit through it. Although I think we can
mitigate a lot of this with some basic retry/exponential back off
logic.

One way we can do this is to use a similar approach to what we do now
but asyncronously. Ie, messages from Erlang to couchjs become tagged
tuples that a central process dispatches back and forth to clients.
There are a few issues here. Firstly, we're going to basically need
two caches of lookups which could be harmful to low latency
requirements. Ie, Erlang will have to keep a cache of tag/client
pairs, and the couchjs side will have to have some sort of lookup
table/cache thing for ddoc ids to JS_Contexts. The weird question I
have about this is, "What is the latency/bandwidth of stdio?" I've
never thought to try and benchmark such a thing but it seems possible
(no idea how likely though) that we could max out the capacity of a
single file descriptor for communication.

The second approach is to turn couchjs into a simple threaded network
server. This way the Erlang side would just open a socket per design
doc/context pair, and the couchjs side would be rather simple to
implement (considering the alternative). I like this approach because
it minimizes cache/lookups, uses more file descriptors (not sure if
this is a valid concern), and (most importantly) keeps most of the
complexity in Erlang where its easier to manage.

Either way, once either of those things is implemented we can
re-implement our external process management to not totally bork
servers under load as well as (hopefully) improve latency considerably
for any of our "RPC" things.

On Thu, Jul 12, 2012 at 1:57 AM, Benoit Chesneau <bc...@gmail.com> wrote:
> On Mon, Jul 9, 2012 at 3:59 PM, Paul Davis <pa...@gmail.com> wrote:
>> Benoît,
>>
>> Lately I've been contemplating removing a lot of the Erlang mechanics
>> for this by rewriting couchjs as a single process/multi threaded
>> application. I've seen a lot of issues related to our process handling
>> and I also think we can probably speed things up considerably if we
>> change how this works. Ie, if we move to an asynchronous message
>> passing interface instead of the serialized stdio interface we should
>> be able to get some nice speedups in throughput while also removing a
>> lot of the resource usage.
>>
>  Do you have a generic idea of the flow ? Or an example? I can be
> funded to do that work :)
>
> - benoît

Re: couch_query_server refactoring

Posted by Benoit Chesneau <bc...@gmail.com>.
On Mon, Jul 9, 2012 at 3:59 PM, Paul Davis <pa...@gmail.com> wrote:
> Benoît,
>
> Lately I've been contemplating removing a lot of the Erlang mechanics
> for this by rewriting couchjs as a single process/multi threaded
> application. I've seen a lot of issues related to our process handling
> and I also think we can probably speed things up considerably if we
> change how this works. Ie, if we move to an asynchronous message
> passing interface instead of the serialized stdio interface we should
> be able to get some nice speedups in throughput while also removing a
> lot of the resource usage.
>
 Do you have a generic idea of the flow ? Or an example? I can be
funded to do that work :)

- benoît

Re: couch_query_server refactoring

Posted by Paul Davis <pa...@gmail.com>.
Benoît,

Lately I've been contemplating removing a lot of the Erlang mechanics
for this by rewriting couchjs as a single process/multi threaded
application. I've seen a lot of issues related to our process handling
and I also think we can probably speed things up considerably if we
change how this works. Ie, if we move to an asynchronous message
passing interface instead of the serialized stdio interface we should
be able to get some nice speedups in throughput while also removing a
lot of the resource usage.

As part of that we should also do what you suggest and look into
refactoring the top layer to make this stuff a lot cleaner where we
call it in places like the rewriter and what not.

I'm also not sure what you mean about the couchapp module. Right now
if I had to guess I could see a couple Erlang apps: one that
encompasses couchjs for JS code, one for Erlang code (for the
view/list/show etc) etc etc. I could also see having the
rewriter/list/show stuff in its own app as well but its early and I'm
not quite awake yet.


On Mon, Jul 9, 2012 at 8:10 AM, Benoit Chesneau <bc...@gmail.com> wrote:
> I'm working on the couch_query_server refactoring:
>     - extract it from the couch app
>     - introduce generic way to add query servers written in Erlang or
>       calling OS processes like couchjs (so rather than distinct os
> processes from native only call an erlang module with some arguments)
>    - split the couchapp engine in its own module.
>
> I'm actually wondering why you have one proc / ddoc? Any reason for
> that? Apart for the rewriter?
>
> - benoît