You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by Jan Lehnardt <ja...@apache.org> on 2013/08/15 11:38:19 UTC

Re: Erlang vs JavaScript

On Aug 15, 2013, at 10:09 , Robert Newson <rn...@apache.org> wrote:

> A big +1 to Jason's clarification of "erlang" vs "native". CouchDB
> could have shipped an erlang view server that worked in a separate
> process and had the stdio overhead, to combine the slowness of the
> protocol with the obtuseness of erlang. ;)
> 
> Evaluating Javascript within the erlang VM process intrigues me, Jens,
> how is that done in your case? I've not previously found the assertion
> that V8 would be faster than SpiderMonkey for a view server compelling
> since the bottleneck is almost never in the code evaluation, but I do
> support CouchDB switching to it for the synergy effects of a closer
> binding with node.js, but if it's running in the same process, that
> would change (though I don't immediately see why the same couldn't be
> done for SpiderMonkey). Off the top of my head, I don't know a safe
> way to evaluate JS in the VM. A NIF-based approach would either be
> quite elaborate or would trip all the scheduling problems that
> long-running NIF's are now notorious for.
> 
> At a step removed, the view server protocol itself seems like the
> thing to improve on, it feels like that's the principal bottleneck.

The code is here: https://github.com/couchbase/couchdb/tree/master/src/mapreduce

I’d love for someone to pick this up and give CouchDB, say, a ./configure --enable-native-v8 option or a plugin that allows people to opt into the speed improvements made there. :)

The choice for V8 was made because of easier integration API and more reliable releases as a standalone project, which I think was a smart move.

IIRC it relies on a change to CouchDB-y internals that has not made it back from Couchbase to CouchDB (Filipe will know, but I doubt he’s reading this thread), but we should look into that and get us “native JS views”, at least as an option or plugin.

CCing dev@.

Jan
--





> 
> B.
> 
> 
> On 15 August 2013 08:22, Jason Smith <jh...@apache.org> wrote:
>> Yes, to a first approximation, with a native view, CouchDB is basically
>> running eval() on your code. In my example, I took advantage of this to
>> build a nonstandard response to satisfy an application. (Instead of a 404,
>> we sent a designated fallback document body.)
>> 
>> But, if you accumulate the list in a native view, a JavaScript view, or a
>> hypothetical Erlang view (i.e. a subprocess), from the operating system's
>> perspective, the memory for that list will be allocated somewhere. Either
>> the CouchDB process asks for X KB more memory, or its subprocess will ask
>> for it. So I think the total system impact is probably low in practice.
>> 
>> So I guess my point is not that native views are wrong, just they have a
>> cost so you should weigh the cost/benefit for your own project. In the case
>> of manage_couchdb, I wrote a JavaScript implementation; but since sometimes
>> I have an emergency and I must find conflicts ASAP, I made an Erlang
>> version because it is worth it.
>> 
>> 
>> On Thu, Aug 15, 2013 at 2:05 PM, Stanley Iriele <si...@gmail.com>wrote:
>> 
>>> Whoa...OK...that I had no idea about...thanks for taking the time to go to
>>> that granularity, by the way.
>>> 
>>> So does this mean that the process memory is shared? As apposed to living
>>> in its own space?.so if someone accumulates a large json object in a list
>>> function its chewing up couchdb's memory?... I guess I'm a little confused
>>> about what's in the same process and what isn't now
>>> On Aug 14, 2013 11:57 PM, "Jason Smith" <jh...@apache.org> wrote:
>>> 
>>>> To me, an Erlang view is a view server which supports map, reduce, show,
>>>> update, list, etc. functions in the Erlang language. (Basically it is
>>>> implemented in Erlang.)
>>>> 
>>>> A view server is a subprocess that runs beneath CouchDB which
>>> communicates
>>>> with it over standard i/o. It is a different process in the operating
>>>> system and only interfaces with the main server using the view server
>>>> protocol (basically a bunch of JSON messages going back and forth).
>>>> 
>>>> I do not know of an Erlang view server which works well and is currently
>>>> maintained.
>>>> 
>>>> A native view (shipped by CouchDB but disabled by default) is some
>>>> corner-cutting. Code is evaluated directly by the primary CouchDB server.
>>>> Since CouchDB is Erlang, the native query server is necessarily Erlang.
>>> The
>>>> key difference is, your code is right there in the eye of the storm. You
>>>> can call couch_server:open("some_db") and completely circumvent security
>>>> and other invariants which CouchDB enforces. You can leak memory until
>>> the
>>>> kernel OOM killer terminates CouchDB. It's not about the language, it's
>>>> that is is running inside the CouchDB process.
>>>> 
>>>> 
>>>> 
>>>> On Thu, Aug 15, 2013 at 1:36 PM, Stanley Iriele <siriele2x3@gmail.com
>>>>> wrote:
>>>> 
>>>>> Wait....I'm a tad confused here..Jason what is the difference between
>>>>> native views and Erlang views?...
>>>>> On Aug 14, 2013 11:16 PM, "Jason Smith" <jh...@apache.org> wrote:
>>>>> 
>>>>>> Oh, also:
>>>>>> 
>>>>>> They are **not** Erlang views. They are **native** views. We should
>>>>>> emphasize the latter to remind ourselves about the security and
>>>>> reliability
>>>>>> risks which Bob identifies.
>>>>>> 
>>>>>> They are very powerful, but it is a trade-off. Once I had a customer
>>>> who
>>>>>> had a basic "class" document describing common values. All other
>>>>> documents
>>>>>> were for modifications to the "base class" so to speak. He needed to
>>>>> query
>>>>>> by document ID, but if no such document existed, return the "base
>>>> class"
>>>>>> document instead. The product was already in the field and so the
>>> code
>>>>>> could not change. We had to change it in CouchDB.
>>>>>> 
>>>>>> The fix was very simple: a _rewrite rule to a native _show function.
>>> In
>>>>> the
>>>>>> show function, if the Doc was null, then we used the internal CouchDB
>>>> API
>>>>>> to fetch the default document. Voila.
>>>>>> 
>>>>>> 
>>>>>> On Thu, Aug 15, 2013 at 1:08 PM, Jason Smith <jh...@apache.org> wrote:
>>>>>> 
>>>>>>> On Thursday, August 15, 2013, Andrey Kuprianov wrote:
>>>>>>> 
>>>>>>>> Doesnt server performance downgrade, while views are being
>>> rebuilt?
>>>> So
>>>>>> the
>>>>>>>> faster they are rebuilt, the better for you.
>>>>>>> 
>>>>>>> 
>>>>>>> If my view build would degrade total performance to cross an
>>>>> unacceptable
>>>>>>> threshold, then I am really riding the line! What about an
>>> unplanned
>>>>>>> compaction? What if one day the clients have a bug and load
>>>> increases?
>>>>>> What
>>>>>>> if an unplanned disaster happens and a backup must be performed
>>>>> urgently?
>>>>>>> 
>>>>>>> I would evaluate view performance in the larger context of the
>>> entire
>>>>>>> application life cycle.
>>>>>>> 
>>>>>>> Men seem to want to date beautiful women. It is a very high
>>> priority
>>>> at
>>>>>>> the pub or whatever. But long-married men do not even think about
>>>> their
>>>>>>> wife's attractiveness because that is a small, superficial part of
>>> a
>>>>> much
>>>>>>> larger story.
>>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> Besides, looks like it's possible to do the same 3 steps with
>>> design
>>>>> doc
>>>>>>>> views created in Erlang? Or is it just about using require() in
>>>>> Node.js?
>>>>>>>> 
>>>>>>> 
>>>>>>> Actually, yes that is a fine point. I myself prefer Node.js but
>>>> anyone
>>>>>> can
>>>>>>> choose the best fit for them.
>>>>>>> 
>>>>>>> And speaking more broadly, CouchDB is a very flexible platform so
>>> it
>>>> is
>>>>>>> quite likely that my own policies do not apply to every use case.
>>> In
>>>>> fact
>>>>>>> if I'm honest I use native views myself, usually for unplanned
>>>>>>> troubleshooting, I want to find conflicts so I use manage_couchdb:
>>>>>>> http://github.com/iriscouch/manage_couchdb
>>>>>>> 
>>>>>>> My main point is, anybody time somebody says "performance" ask
>>>> yourself
>>>>>> if
>>>>>>> it is really a "performance siren." Earlier in this thread, Jens
>>>> raises
>>>>>>> some examples of plausible true performance requirements, not just
>>>>> siren
>>>>>>> songs.
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 


Re: Erlang vs JavaScript

Posted by Robert Keizer <ro...@keizer.ca>.
On 13-08-18 09:33 AM, Alexander Shorin wrote:
> On Sun, Aug 18, 2013 at 3:54 PM, Volker Mische <vo...@gmail.com> wrote:
>> On 08/18/2013 08:42 AM, Alexander Shorin wrote:
>>> On Sun, Aug 18, 2013 at 10:22 AM, Benoit Chesneau <bc...@gmail.com> wrote:
>>>> On Fri, Aug 16, 2013 at 9:58 PM, Alexander Shorin <kx...@gmail.com> wrote:
>>>>
>>>>> On Fri, Aug 16, 2013 at 11:23 PM, Jason Smith <jh...@apache.org> wrote:
>>>>>> On Fri, Aug 16, 2013 at 4:49 PM, Volker Mische <vo...@gmail.com>
>>>>>> wrote:
>>>>>>> On 08/16/2013 11:32 AM, Alexander Shorin wrote:
>>>>>>>> On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau <bchesneau@gmail.com
>>>>>>>> wrote:
>>>>>>>>> I agree, (modulo the fact that I would replace a string by a binary
>>>>> ;)
>>>>>>>>> but
>>>>>>>>> that would be only possible if we extract the metadata (_id, _rev)
>>>>> from
>>>>>>>>> the
>>>>>>>>> JSON so couchdb wouldn't have to decode the JSON to get them.
>>>>> Streaming
>>>>>>>>> json would also allows that but since there is no guaranty in the
>>>>>>>>> properties order of a JSON it would be less efficient.
>>>>>>>> What if we split document metadata from document itself?
>>>>>>
>>>>>> I would like to hear a goal for this effort? What is the definition of
>>>>>> success and failure?
>>>>> Idea: move document metadata into separate object.
>>>>>
>>>> How do you link the metadata to the separate object there? Do you let the
>>>> application set the internal links?
>>>>
>>>> I'm +1 with such idea anyway.
>>> Mmm...how I imagine it (Disclaimer: I'm sure I'm wrong in details there!):
>>>
>>> Btree:
>>>
>>>      ----+----
>>>     |        |
>>>   --+--    --+--
>>> |    |  |    |
>>> *    *  *    *
>>>
>>> At the node we have doc object {...} for specific revision. Instead of
>>> this, we'll have a tuple ({...}, {...}) - first is a meta, second is a
>>> data.
>>> So I think there wouldn't be needed internal links since meta and data
>>> would live within same Btree node.
>>> For regular doc requesting, they will be merged (still need for `_`
>>> prefix to avoid collisions?) and returned as single {...} as always.
>> We could also return them as separate objects, so the view function
>> becomes: function(doc, meta) {}.
>>
>> Couchbase does that and from my experience it works well and feel right.
> Oh, so this idea even works (:
>
> However, the trick was about to not pass doc part (in case if it big
> enough) to the view server until view server wouldn't process his
> metadata. Otherwise this is good feature, but it wouldn't help with
> indexing speed up. I remind the trick: first process meta part and if
> it passed - load the doc. Later I'd sent another mail where I'd
> eventually reinvented chained views, because trick with meta does
> exactly the same, chained views are more correct way to go. See quote
> at the end with resume.
>
> Anyway, I feel we need to inherit Couchbase experience with document's
> metadata object (of course if they wouldn't sue us for that ((: )
> since everyone already same some preferred metadata fields (like type)
> or uses special object for that to not pollute main document body.
> I'm prefer special '.meta' object at the document root which holds
> document type info, authorship, timestamps, bindings, etc.
> It's good feature to have no matter does it optimizes indexation
> process or not (:

I would suggest either prefixing with an underscore, or the use of a 
separate object passed to the view server.

If someone ( such as myself ) has many many documents, which happen to 
contain a "meta" attribute, it would be non-trivial to upgrade / 
migrate. A migration script could be written of course, although it 
wouldn't be ideal;

Something to consider, it may be worth while to simply use obj._meta 
instead of .meta.

>
> Below is about chained views:
>
> On Fri, Aug 16, 2013 at 11:58 PM, Alexander Shorin <kx...@gmail.com> wrote:
>> Resume: probably, I'd just described chained views feature with
>> autoindexing by certain fields (:
>> Removing autoindexing feature and we could make views building process
>> much more faster if we make right views chain which will use set
>> algebra operations to calculate target doc ids to pass to final view:
>> reduce docs before map results:
>>
>> {
>> "views": {
>>      "posts": {"map": "...", "reduce": "..."},
>>      "chain": [
>>       ["by_type", {"key": "post"}],
>>       ["hidden", {"key": false}],
>>       ["by_domain", {"keys": ["public", "wiki"]}]
>>    ]
>>   }
>> }
>>
>> In case of 10000 docs db with 1200 posts where 200 are hidden and 400
>> are private, result view posts have to process only 600 docs instead
>> of 10000 and it's index lookup operation to find out the result docs
>> to pass. Sure, calling such view triggers all views in the chain.
>

Chained views would be awesome! I'm sure I'm not alone in having solved 
this problem by using multiple queries and matching document IDs.

Re: Erlang vs JavaScript

Posted by Alexander Shorin <kx...@gmail.com>.
On Sun, Aug 18, 2013 at 3:54 PM, Volker Mische <vo...@gmail.com> wrote:
> On 08/18/2013 08:42 AM, Alexander Shorin wrote:
>> On Sun, Aug 18, 2013 at 10:22 AM, Benoit Chesneau <bc...@gmail.com> wrote:
>>> On Fri, Aug 16, 2013 at 9:58 PM, Alexander Shorin <kx...@gmail.com> wrote:
>>>
>>>> On Fri, Aug 16, 2013 at 11:23 PM, Jason Smith <jh...@apache.org> wrote:
>>>>> On Fri, Aug 16, 2013 at 4:49 PM, Volker Mische <vo...@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> On 08/16/2013 11:32 AM, Alexander Shorin wrote:
>>>>>>> On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau <bchesneau@gmail.com
>>>>>
>>>>>>> wrote:
>>>>>>>> I agree, (modulo the fact that I would replace a string by a binary
>>>> ;)
>>>>>>>> but
>>>>>>>> that would be only possible if we extract the metadata (_id, _rev)
>>>> from
>>>>>>>> the
>>>>>>>> JSON so couchdb wouldn't have to decode the JSON to get them.
>>>> Streaming
>>>>>>>> json would also allows that but since there is no guaranty in the
>>>>>>>> properties order of a JSON it would be less efficient.
>>>>>>>
>>>>>>> What if we split document metadata from document itself?
>>>>>
>>>>>
>>>>> I would like to hear a goal for this effort? What is the definition of
>>>>> success and failure?
>>>>
>>>> Idea: move document metadata into separate object.
>>>>
>>>
>>> How do you link the metadata to the separate object there? Do you let the
>>> application set the internal links?
>>>
>>> I'm +1 with such idea anyway.
>>
>> Mmm...how I imagine it (Disclaimer: I'm sure I'm wrong in details there!):
>>
>> Btree:
>>
>>     ----+----
>>    |        |
>>  --+--    --+--
>> |    |  |    |
>> *    *  *    *
>>
>> At the node we have doc object {...} for specific revision. Instead of
>> this, we'll have a tuple ({...}, {...}) - first is a meta, second is a
>> data.
>> So I think there wouldn't be needed internal links since meta and data
>> would live within same Btree node.
>> For regular doc requesting, they will be merged (still need for `_`
>> prefix to avoid collisions?) and returned as single {...} as always.
>
> We could also return them as separate objects, so the view function
> becomes: function(doc, meta) {}.
>
> Couchbase does that and from my experience it works well and feel right.

Oh, so this idea even works (:

However, the trick was about to not pass doc part (in case if it big
enough) to the view server until view server wouldn't process his
metadata. Otherwise this is good feature, but it wouldn't help with
indexing speed up. I remind the trick: first process meta part and if
it passed - load the doc. Later I'd sent another mail where I'd
eventually reinvented chained views, because trick with meta does
exactly the same, chained views are more correct way to go. See quote
at the end with resume.

Anyway, I feel we need to inherit Couchbase experience with document's
metadata object (of course if they wouldn't sue us for that ((: )
since everyone already same some preferred metadata fields (like type)
or uses special object for that to not pollute main document body.
I'm prefer special '.meta' object at the document root which holds
document type info, authorship, timestamps, bindings, etc.
It's good feature to have no matter does it optimizes indexation
process or not (:

Below is about chained views:

On Fri, Aug 16, 2013 at 11:58 PM, Alexander Shorin <kx...@gmail.com> wrote:
> Resume: probably, I'd just described chained views feature with
> autoindexing by certain fields (:
> Removing autoindexing feature and we could make views building process
> much more faster if we make right views chain which will use set
> algebra operations to calculate target doc ids to pass to final view:
> reduce docs before map results:
>
> {
> "views": {
>     "posts": {"map": "...", "reduce": "..."},
>     "chain": [
>      ["by_type", {"key": "post"}],
>      ["hidden", {"key": false}],
>      ["by_domain", {"keys": ["public", "wiki"]}]
>   ]
>  }
> }
>
> In case of 10000 docs db with 1200 posts where 200 are hidden and 400
> are private, result view posts have to process only 600 docs instead
> of 10000 and it's index lookup operation to find out the result docs
> to pass. Sure, calling such view triggers all views in the chain.

--
,,,^..^,,,

Re: Erlang vs JavaScript

Posted by Volker Mische <vo...@gmail.com>.
On 08/18/2013 08:42 AM, Alexander Shorin wrote:
> On Sun, Aug 18, 2013 at 10:22 AM, Benoit Chesneau <bc...@gmail.com> wrote:
>> On Fri, Aug 16, 2013 at 9:58 PM, Alexander Shorin <kx...@gmail.com> wrote:
>>
>>> On Fri, Aug 16, 2013 at 11:23 PM, Jason Smith <jh...@apache.org> wrote:
>>>> On Fri, Aug 16, 2013 at 4:49 PM, Volker Mische <vo...@gmail.com>
>>>> wrote:
>>>>>
>>>>> On 08/16/2013 11:32 AM, Alexander Shorin wrote:
>>>>>> On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau <bchesneau@gmail.com
>>>>
>>>>>> wrote:
>>>>>>> I agree, (modulo the fact that I would replace a string by a binary
>>> ;)
>>>>>>> but
>>>>>>> that would be only possible if we extract the metadata (_id, _rev)
>>> from
>>>>>>> the
>>>>>>> JSON so couchdb wouldn't have to decode the JSON to get them.
>>> Streaming
>>>>>>> json would also allows that but since there is no guaranty in the
>>>>>>> properties order of a JSON it would be less efficient.
>>>>>>
>>>>>> What if we split document metadata from document itself?
>>>>
>>>>
>>>> I would like to hear a goal for this effort? What is the definition of
>>>> success and failure?
>>>
>>> Idea: move document metadata into separate object.
>>>
>>
>> How do you link the metadata to the separate object there? Do you let the
>> application set the internal links?
>>
>> I'm +1 with such idea anyway.
> 
> Mmm...how I imagine it (Disclaimer: I'm sure I'm wrong in details there!):
> 
> Btree:
> 
>     ----+----
>    |        |
>  --+--    --+--
> |    |  |    |
> *    *  *    *
> 
> At the node we have doc object {...} for specific revision. Instead of
> this, we'll have a tuple ({...}, {...}) - first is a meta, second is a
> data.
> So I think there wouldn't be needed internal links since meta and data
> would live within same Btree node.
> For regular doc requesting, they will be merged (still need for `_`
> prefix to avoid collisions?) and returned as single {...} as always.

We could also return them as separate objects, so the view function
becomes: function(doc, meta) {}.

Couchbase does that and from my experience it works well and feel right.

Cheers,
  Volker


Re: Erlang vs JavaScript

Posted by JFC Morfin <je...@jefsey.com>.
At 11:49 16/08/2013, Volker Mische wrote:
> > What if we split document metadata from document itself? E.g. pass
> > _id, _rev and other system or meta fields with separate object. Their
> > size much lesser than whole document, so it will be possible to fast
> > decode this metadata and decide is doc need to be processed or not
> > without need to decode/encode megabytes of document's json. Sure, this
> > adds additional communication roundtrip, but in case if it will be
> > faster than json decode/encode - why not?
>
>That would be the ultimate-ultimate goal.

This is a basic requirement for me: incrementally (i.e. metadata on 
metadata) and for syllodata (data between data) interlinks.
jfc 


Re: Erlang vs JavaScript

Posted by "nicholas a. evans" <ni...@ekenosen.net>.
It seems like there might be several simple "internalizing" speedups,
even before tackling the view server protocol or the couchjs view
server, hinted at by Alexander's suggestion:

On Fri, Aug 16, 2013 at 3:58 PM, Alexander Shorin <kx...@gmail.com> wrote:
> Idea: move document metadata into separate object.
...
> Case 2: Large docs. Profit in case when you have set right fields into
> metadata (like doc type, authorship, tags etc.) and filter first by
> this metadata - you have minimal memory footprint, you have less CPU
> load, rule "fast accept - fast reject" works perfectly.

For the simple case of filtering which fields are passed to the map
fn, you don't need full blown chained views, you only need a simple
way to define field filters (describing which fields are the relevant
"metadata" fields).

> Side effect: it's possible to autoindex metadata on fly on document
> update without asking user to write (meta/by_type, meta/by_author,
> meta/by_update_time etc. viiews) . Sure, as much metadata you have as
> large base index will be. In 80% cases it will be no more than 4KB.

Similarly to how the internals of couch already optimize away the case
when you have multiple views in the same design doc share the same map
function (but different reductions), we should also be able to
optimize away the case where multiple views share the same fields
filter.

> Resume: probably, I'd just described chained views feature with
> autoindexing by certain fields (:

One lesson I learned when I looked into implementing chained
map/reduce views is that they will need to be in different design_docs
from the parent views, in order to play nicely with BigCouch.  Keeping
them in the same design_doc just doesn't work with parallel view
builds (at least, not without breaking normal design_doc
considerations).  So although I really like the simplicity of the
"keep chained views in one design doc" approach, it's probably a
dead-end.

> Removing autoindexing feature and we could make views building process
> much more faster if we make right views chain which will use set
> algebra operations to calculate target doc ids to pass to final view:
> reduce docs before map results:
>
> {
> "views": {
>     "posts": {"map": "...", "reduce": "..."},
>     "chain": [
>      ["by_type", {"key": "post"}],
>      ["hidden", {"key": false}],
>      ["by_domain", {"keys": ["public", "wiki"]}]
>   ]
>  }
> }

I was inspired by your view syntax and thought I'd put forward my own
similar proposal:

{
  "_id": "plain_old_views_for_comparison",
  "views": {
    "single_emit": {
      "map": "function(doc) { if (!doc.foo) { emit([doc.bar, doc.baz],
doc.quux); } }",
      "reduce": "_count"
    },
    "multiple_emits": {
      "map": "function(doc) { if (!doc.foo) { emit([0, doc.bar],
doc.quux); emit(['baz', doc.baz], doc.quux); } }",
      "reduce": "_count"
    },
}

{
  "_id": "internalized",
  "options": {
    "filter": "!foo",
    "fields": ["bar", "baz", "quux"]
   },
  "views": {
    "single_emit_1": {
      "map": "function(doc) { emit([doc.bar, doc.baz], doc.quux); }",
      "reduce": "_count"
    },
    "single_emit_2": {
      "map": { "key": ["bar", "baz"], "value": "quux" },
      "reduce": "_count"
    },
    "multiple_emits": {
      "map": { "emits": [[0, "bar"], "quux"], ["'baz'", "baz"], "quux"]] },
      "reduce": "_count"
    },
}

Where the above views should behave the same way.  The view options
would support "filter" as a guard clause and "fields" to strip out all
but the relevant metadata.  These should be defined at the design
document level to simplify working with the current view server
protocol.  And the view "map" could optionally be an object describing
the emit values instead of a function string.

The filter string should be simple but powerful: I'd suggest
supporting !, &&, ||, (), "foo.bar.baz", >, <, >=, <=, ==, !=,
numbers, and strings (for "type == 'foo'"). But even if all it
supported was "foo" and "!foo", it would still be useful.  In some
cases, this will prevent most docs from ever needing to be evaluated
by the view server.  The "fields" array might also consider filtering
nested fields like with "foo.bar.baz".  The "filter" and internal map
("key", "value", "emits") should support the same values that "fields"
supports plus numbers and strings, or they could support the same
syntax as "filter" to do things like "key": ["!!deleted_at",
"deleted_at"].  The "filter" and internal map would be able to use all
of the fields, not just the ones defined in the options.

Another odd case where I've personally noticed indexing speed get
immensely bogged down is when the reduce function merges the map
objects together.  I've seen views with this problem go up to 5GB
during initial load and compact back down to 20MB.  I've documented
this problem and my workaround here:
https://gist.github.com/nevans/5512593.  The hideously reduce pattern
in that gist has resulted in 2-5x faster view builds for me (small DBs
infinitesimally slower, huge speedup for big DBs).  But it would be
*much* better to simply add a "minimum_reduced_group_level" option to
the view, and let the Erlang handle that without doing unnecessary
view server round trips and hideously complicated reduce functions.
Any group_level below the minimum_reduced_group_level would simply
return "null" for all of the values.

This isn't a trivial proposal, but it can be implemented completely
independently of any view server protocol or couchjs changes.  And
even a simplified version could still yield major speedup for some of
the most common map patterns, just as "_sum" and "_count" speed up the
most common reduce functions.  Also, the individual pieces can be
implemented independently:  If I were to work on this myself (probably
not going to happen in the next month or two), I'd do
"minimum_reduced_group_level" first and "filter" second, since I think
that's were *my* biggest bang for the buck would be.  But other
people's dataset (e.g. large docs) might get the biggest improvement
from "fields".  And if you have lots of simple map functions, you
might get the biggest speedup from the internal map "key", "value",
"emits".

What do you think?  Ugly and untenable?  Or a shot in the right direction?


Also, I know that Jason already yielded on the O(N) argument, but I
got here late and wanted to add my $0.02:  Obviously anything better
than O(N) is impossible when you need to map N documents.  Changing to
O(N/Q) (where Q=parallelism of view indexing; e.g. throw hardware at
it) is still essentially O(N), but it's very useful and something that
BigCouch does nicely.  A 9x speedup might be the difference between a
rollout taking 90 hours (barely finishes over the weekend) and 10
hours (you can do it overnight during the week).  The longer the view
rollout period, the slower and more cautious the
development/deployment cycle becomes.  More importantly, it might be
the difference between loading a large user in 9 hours vs 60 minutes,
which will feel like a qualitative improvement to that user and is
especially important when that user is e.g. Walt Mossberg and load
time is one of two nitpicks he has in his review.  Or when you have a
hundred similarly jumbo-sized users sign up the next day.  Sorry for
piling on after the argument is over.  :)

-- 
Nick

Re: Erlang vs JavaScript

Posted by Alexander Shorin <kx...@gmail.com>.
On Sun, Aug 18, 2013 at 10:22 AM, Benoit Chesneau <bc...@gmail.com> wrote:
> On Fri, Aug 16, 2013 at 9:58 PM, Alexander Shorin <kx...@gmail.com> wrote:
>
>> On Fri, Aug 16, 2013 at 11:23 PM, Jason Smith <jh...@apache.org> wrote:
>> > On Fri, Aug 16, 2013 at 4:49 PM, Volker Mische <vo...@gmail.com>
>> > wrote:
>> >>
>> >> On 08/16/2013 11:32 AM, Alexander Shorin wrote:
>> >> > On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau <bchesneau@gmail.com
>> >
>> >> > wrote:
>> >> >> I agree, (modulo the fact that I would replace a string by a binary
>> ;)
>> >> >> but
>> >> >> that would be only possible if we extract the metadata (_id, _rev)
>> from
>> >> >> the
>> >> >> JSON so couchdb wouldn't have to decode the JSON to get them.
>> Streaming
>> >> >> json would also allows that but since there is no guaranty in the
>> >> >> properties order of a JSON it would be less efficient.
>> >> >
>> >> > What if we split document metadata from document itself?
>> >
>> >
>> > I would like to hear a goal for this effort? What is the definition of
>> > success and failure?
>>
>> Idea: move document metadata into separate object.
>>
>
> How do you link the metadata to the separate object there? Do you let the
> application set the internal links?
>
> I'm +1 with such idea anyway.

Mmm...how I imagine it (Disclaimer: I'm sure I'm wrong in details there!):

Btree:

    ----+----
   |        |
 --+--    --+--
|    |  |    |
*    *  *    *

At the node we have doc object {...} for specific revision. Instead of
this, we'll have a tuple ({...}, {...}) - first is a meta, second is a
data.
So I think there wouldn't be needed internal links since meta and data
would live within same Btree node.
For regular doc requesting, they will be merged (still need for `_`
prefix to avoid collisions?) and returned as single {...} as always.

--
,,,^..^,,,

Re: Erlang vs JavaScript

Posted by Benoit Chesneau <bc...@gmail.com>.
On Fri, Aug 16, 2013 at 9:58 PM, Alexander Shorin <kx...@gmail.com> wrote:

> On Fri, Aug 16, 2013 at 11:23 PM, Jason Smith <jh...@apache.org> wrote:
> > On Fri, Aug 16, 2013 at 4:49 PM, Volker Mische <vo...@gmail.com>
> > wrote:
> >>
> >> On 08/16/2013 11:32 AM, Alexander Shorin wrote:
> >> > On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau <bchesneau@gmail.com
> >
> >> > wrote:
> >> >> I agree, (modulo the fact that I would replace a string by a binary
> ;)
> >> >> but
> >> >> that would be only possible if we extract the metadata (_id, _rev)
> from
> >> >> the
> >> >> JSON so couchdb wouldn't have to decode the JSON to get them.
> Streaming
> >> >> json would also allows that but since there is no guaranty in the
> >> >> properties order of a JSON it would be less efficient.
> >> >
> >> > What if we split document metadata from document itself?
> >
> >
> > I would like to hear a goal for this effort? What is the definition of
> > success and failure?
>
> Idea: move document metadata into separate object.
>

How do you link the metadata to the separate object there? Do you let the
application set the internal links?

I'm +1 with such idea anyway.



> Motivation:
>
> Case 1: Small docs. No profit at all. More over, probably it's better
> to not split things there e.g. pass full doc if his size around some
> amount of megabytes.
> Case 2: Large docs. Profit in case when you have set right fields into
> metadata (like doc type, authorship, tags etc.) and filter first by
> this metadata - you have minimal memory footprint, you have less CPU
> load, rule "fast accept - fast reject" works perfectly.
>
> Side effect: it's possible to first filter by metadata and leave only
> required to process document ids. And if we known what and how many to
> process, we may make assumptions about parallel indexation.
>
> Side effect: it's possible to autoindex metadata on fly on document
> update without asking user to write (meta/by_type, meta/by_author,
> meta/by_update_time etc. viiews) . Sure, as much metadata you have as
> large base index will be. In 80% cases it will be no more than 4KB.
>
> Resume: probably, I'd just described chained views feature with
> autoindexing by certain fields (:
> Removing autoindexing feature and we could make views building process
> much more faster if we make right views chain which will use set
> algebra operations to calculate target doc ids to pass to final view:
> reduce docs before map results:
>
> {
> "views": {
>     "posts": {"map": "...", "reduce": "..."},
>     "chain": [
>      ["by_type", {"key": "post"}],
>      ["hidden", {"key": false}],
>      ["by_domain", {"keys": ["public", "wiki"]}]
>   ]
>  }
> }
>
> In case of 10000 docs db with 1200 posts where 200 are hidden and 400
> are private, result view posts have to process only 600 docs instead
> of 10000 and it's index lookup operation to find out the result docs
> to pass. Sure, calling such view triggers all views in the chain. And
> I don't think about cross dependencies and loops for know.
>
> --
> ,,,^..^,,,
>

Re: Erlang vs JavaScript

Posted by Alexander Shorin <kx...@gmail.com>.
On Fri, Aug 16, 2013 at 11:23 PM, Jason Smith <jh...@apache.org> wrote:
> On Fri, Aug 16, 2013 at 4:49 PM, Volker Mische <vo...@gmail.com>
> wrote:
>>
>> On 08/16/2013 11:32 AM, Alexander Shorin wrote:
>> > On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau <bc...@gmail.com>
>> > wrote:
>> >> I agree, (modulo the fact that I would replace a string by a binary ;)
>> >> but
>> >> that would be only possible if we extract the metadata (_id, _rev) from
>> >> the
>> >> JSON so couchdb wouldn't have to decode the JSON to get them. Streaming
>> >> json would also allows that but since there is no guaranty in the
>> >> properties order of a JSON it would be less efficient.
>> >
>> > What if we split document metadata from document itself?
>
>
> I would like to hear a goal for this effort? What is the definition of
> success and failure?

Idea: move document metadata into separate object.

Motivation:

Case 1: Small docs. No profit at all. More over, probably it's better
to not split things there e.g. pass full doc if his size around some
amount of megabytes.
Case 2: Large docs. Profit in case when you have set right fields into
metadata (like doc type, authorship, tags etc.) and filter first by
this metadata - you have minimal memory footprint, you have less CPU
load, rule "fast accept - fast reject" works perfectly.

Side effect: it's possible to first filter by metadata and leave only
required to process document ids. And if we known what and how many to
process, we may make assumptions about parallel indexation.

Side effect: it's possible to autoindex metadata on fly on document
update without asking user to write (meta/by_type, meta/by_author,
meta/by_update_time etc. viiews) . Sure, as much metadata you have as
large base index will be. In 80% cases it will be no more than 4KB.

Resume: probably, I'd just described chained views feature with
autoindexing by certain fields (:
Removing autoindexing feature and we could make views building process
much more faster if we make right views chain which will use set
algebra operations to calculate target doc ids to pass to final view:
reduce docs before map results:

{
"views": {
    "posts": {"map": "...", "reduce": "..."},
    "chain": [
     ["by_type", {"key": "post"}],
     ["hidden", {"key": false}],
     ["by_domain", {"keys": ["public", "wiki"]}]
  ]
 }
}

In case of 10000 docs db with 1200 posts where 200 are hidden and 400
are private, result view posts have to process only 600 docs instead
of 10000 and it's index lookup operation to find out the result docs
to pass. Sure, calling such view triggers all views in the chain. And
I don't think about cross dependencies and loops for know.

--
,,,^..^,,,

Re: Erlang vs JavaScript

Posted by Jason Smith <jh...@apache.org>.
On Fri, Aug 16, 2013 at 4:49 PM, Volker Mische <vo...@gmail.com>wrote:

> On 08/16/2013 11:32 AM, Alexander Shorin wrote:
> > On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau <bc...@gmail.com>
> wrote:
> >> I agree, (modulo the fact that I would replace a string by a binary ;)
> but
> >> that would be only possible if we extract the metadata (_id, _rev) from
> the
> >> JSON so couchdb wouldn't have to decode the JSON to get them. Streaming
> >> json would also allows that but since there is no guaranty in the
> >> properties order of a JSON it would be less efficient.
> >
> > What if we split document metadata from document itself?
>

I would like to hear a goal for this effort? What is the definition of
success and failure?

Jan makes a fine point on user@. I "live with the pain." But really, life
is pain. Deny it if you must. Until we are delivered--finally!--our sweet
release, we will necessarily endure pain.

Facts:

* When you store a record, a machine must write that to storage
* If you have an index, a machine must update the index to storage

Building an index requires visiting every document. One way or another, the
entire .couch file is coming off the disk and going through the ringer. One
way or another, every row in the view will be written. I am not clear why
optimizing from N ms/doc to 1/2 N ms/doc will help, when you still have to
read 30GB from storage, and write 30GB back.

One one end the computer scientist says we cannot avoid the necessary time
complexity. On the other end, the casual user says, if it is not
instantaneous, then it hardly matters.

That is, we have a problem of expectation management, not codec speed.
Nobody expects MySQL's CREATE INDEX to finish in a flash, and nobody should
expect that of a view.

If somebody does set out to accelerate views, you're welcome. But I would
ask: what is a successful optimization, and why?

(Also, Noah, if you are out there, this is an example of the sort of thing
I would put on the wiki but past bad experiences make me say "can't be
bothered.")

Re: Erlang vs JavaScript

Posted by JFC Morfin <je...@jefsey.com>.
At 11:49 16/08/2013, Volker Mische wrote:
> > What if we split document metadata from document itself? E.g. pass
> > _id, _rev and other system or meta fields with separate object. Their
> > size much lesser than whole document, so it will be possible to fast
> > decode this metadata and decide is doc need to be processed or not
> > without need to decode/encode megabytes of document's json. Sure, this
> > adds additional communication roundtrip, but in case if it will be
> > faster than json decode/encode - why not?
>
>That would be the ultimate-ultimate goal.

This is a basic requirement for me: incrementally (i.e. metadata on 
metadata) and for syllodata (data between data) interlinks.
jfc 


Re: Erlang vs JavaScript

Posted by Volker Mische <vo...@gmail.com>.
On 08/16/2013 11:32 AM, Alexander Shorin wrote:
> On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau <bc...@gmail.com> wrote:
>> I agree, (modulo the fact that I would replace a string by a binary ;) but
>> that would be only possible if we extract the metadata (_id, _rev) from the
>> JSON so couchdb wouldn't have to decode the JSON to get them. Streaming
>> json would also allows that but since there is no guaranty in the
>> properties order of a JSON it would be less efficient.
> 
> What if we split document metadata from document itself? E.g. pass
> _id, _rev and other system or meta fields with separate object. Their
> size much lesser than whole document, so it will be possible to fast
> decode this metadata and decide is doc need to be processed or not
> without need to decode/encode megabytes of document's json. Sure, this
> adds additional communication roundtrip, but in case if it will be
> faster than json decode/encode - why not?

That would be the ultimate-ultimate goal.

Cheers,
  Volker



Re: Erlang vs JavaScript

Posted by Volker Mische <vo...@gmail.com>.
On 08/16/2013 11:32 AM, Alexander Shorin wrote:
> On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau <bc...@gmail.com> wrote:
>> I agree, (modulo the fact that I would replace a string by a binary ;) but
>> that would be only possible if we extract the metadata (_id, _rev) from the
>> JSON so couchdb wouldn't have to decode the JSON to get them. Streaming
>> json would also allows that but since there is no guaranty in the
>> properties order of a JSON it would be less efficient.
> 
> What if we split document metadata from document itself? E.g. pass
> _id, _rev and other system or meta fields with separate object. Their
> size much lesser than whole document, so it will be possible to fast
> decode this metadata and decide is doc need to be processed or not
> without need to decode/encode megabytes of document's json. Sure, this
> adds additional communication roundtrip, but in case if it will be
> faster than json decode/encode - why not?

That would be the ultimate-ultimate goal.

Cheers,
  Volker



Re: Erlang vs JavaScript

Posted by Alexander Shorin <kx...@gmail.com>.
On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau <bc...@gmail.com> wrote:
> I agree, (modulo the fact that I would replace a string by a binary ;) but
> that would be only possible if we extract the metadata (_id, _rev) from the
> JSON so couchdb wouldn't have to decode the JSON to get them. Streaming
> json would also allows that but since there is no guaranty in the
> properties order of a JSON it would be less efficient.

What if we split document metadata from document itself? E.g. pass
_id, _rev and other system or meta fields with separate object. Their
size much lesser than whole document, so it will be possible to fast
decode this metadata and decide is doc need to be processed or not
without need to decode/encode megabytes of document's json. Sure, this
adds additional communication roundtrip, but in case if it will be
faster than json decode/encode - why not?

--
,,,^..^,,,

Re: Erlang vs JavaScript

Posted by Alexander Shorin <kx...@gmail.com>.
On Fri, Aug 16, 2013 at 1:12 PM, Benoit Chesneau <bc...@gmail.com> wrote:
> I agree, (modulo the fact that I would replace a string by a binary ;) but
> that would be only possible if we extract the metadata (_id, _rev) from the
> JSON so couchdb wouldn't have to decode the JSON to get them. Streaming
> json would also allows that but since there is no guaranty in the
> properties order of a JSON it would be less efficient.

What if we split document metadata from document itself? E.g. pass
_id, _rev and other system or meta fields with separate object. Their
size much lesser than whole document, so it will be possible to fast
decode this metadata and decide is doc need to be processed or not
without need to decode/encode megabytes of document's json. Sure, this
adds additional communication roundtrip, but in case if it will be
faster than json decode/encode - why not?

--
,,,^..^,,,

Re: Erlang vs JavaScript

Posted by Benoit Chesneau <bc...@gmail.com>.
On Fri, Aug 16, 2013 at 11:05 AM, Volker Mische <vo...@gmail.com>wrote:

> On 08/15/2013 11:53 AM, Benoit Chesneau wrote:
> > On Thu, Aug 15, 2013 at 11:38 AM, Jan Lehnardt <ja...@apache.org> wrote:
> >
> >>
> >> On Aug 15, 2013, at 10:09 , Robert Newson <rn...@apache.org> wrote:
> >>
> >>> A big +1 to Jason's clarification of "erlang" vs "native". CouchDB
> >>> could have shipped an erlang view server that worked in a separate
> >>> process and had the stdio overhead, to combine the slowness of the
> >>> protocol with the obtuseness of erlang. ;)
> >>>
> >>> Evaluating Javascript within the erlang VM process intrigues me, Jens,
> >>> how is that done in your case? I've not previously found the assertion
> >>> that V8 would be faster than SpiderMonkey for a view server compelling
> >>> since the bottleneck is almost never in the code evaluation, but I do
> >>> support CouchDB switching to it for the synergy effects of a closer
> >>> binding with node.js, but if it's running in the same process, that
> >>> would change (though I don't immediately see why the same couldn't be
> >>> done for SpiderMonkey). Off the top of my head, I don't know a safe
> >>> way to evaluate JS in the VM. A NIF-based approach would either be
> >>> quite elaborate or would trip all the scheduling problems that
> >>> long-running NIF's are now notorious for.
> >>>
> >>> At a step removed, the view server protocol itself seems like the
> >>> thing to improve on, it feels like that's the principal bottleneck.
> >>
> >> The code is here:
> >> https://github.com/couchbase/couchdb/tree/master/src/mapreduce
> >>
> >> I’d love for someone to pick this up and give CouchDB, say, a
> ./configure
> >> --enable-native-v8 option or a plugin that allows people to opt into the
> >> speed improvements made there. :)
> >>
> >> The choice for V8 was made because of easier integration API and more
> >> reliable releases as a standalone project, which I think was a smart
> move.
> >>
> >> IIRC it relies on a change to CouchDB-y internals that has not made it
> >> back from Couchbase to CouchDB (Filipe will know, but I doubt he’s
> reading
> >> this thread), but we should look into that and get us “native JS
> views”, at
> >> least as an option or plugin.
> >>
> >> CCing dev@.
> >>
> >> Jan
> >> --
> >>
> >>
> > Well on the first hand nifs look like a good idea but can be very
> > problematic:
> >
> > - when the view computation take time it would block the full vm
> > scheduling. It can be mitigated using a pool of threads to execute the
> work
> > asynchronously but then can create other problems like memory leaking
> etc.
> > - nifs can't be upgraded easily during hot upgrade
> > - when a nif crash, all the vm crash.
> >
> > (Note that we have the same problem when using a nif to decode/encode
> json,
> > it only works well with medium sized documents)
> >
> > One other way to improve the js handling would be removing the main
> > bottleneck ie the serialization-deserialization we do on each step. Not
> > sure if it exists but  feasible, why not passing erlang terms from erlang
> > to js and js to erlang? So at the end the deserialization would happen
> only
> > on the JS side ie instead of having
> >
> > get erlang term
> > encode to json
> > send to js
> > decode json
> > process
> > encode json
> > send json
> > decode json to erlang term
> > store
> >
> > we sould just have
> >
> > get erlang term
> > send over STDIO
> > decode erlang term to JS object
> > process
> > encode to erlang term
> > send erlang term
> > store
> >
> > Erlang serialization is also very optimised.
>
> I think the ultimate goal should be to be as little
> conversion/serialisation as possible, hence no conversion to Erlang
> Terms at all.
>
> Input as string
> Parsing to get ID
> Store as string
>
> Send to JS as string
> Process with JS
> Store as string
>
> Cheers,
>   Volker
>
>
>

I agree, (modulo the fact that I would replace a string by a binary ;) but
that would be only possible if we extract the metadata (_id, _rev) from the
JSON so couchdb wouldn't have to decode the JSON to get them. Streaming
json would also allows that but since there is no guaranty in the
properties order of a JSON it would be less efficient.

- benoit

Re: Erlang vs JavaScript

Posted by Benoit Chesneau <bc...@gmail.com>.
On Fri, Aug 16, 2013 at 11:05 AM, Volker Mische <vo...@gmail.com>wrote:

> On 08/15/2013 11:53 AM, Benoit Chesneau wrote:
> > On Thu, Aug 15, 2013 at 11:38 AM, Jan Lehnardt <ja...@apache.org> wrote:
> >
> >>
> >> On Aug 15, 2013, at 10:09 , Robert Newson <rn...@apache.org> wrote:
> >>
> >>> A big +1 to Jason's clarification of "erlang" vs "native". CouchDB
> >>> could have shipped an erlang view server that worked in a separate
> >>> process and had the stdio overhead, to combine the slowness of the
> >>> protocol with the obtuseness of erlang. ;)
> >>>
> >>> Evaluating Javascript within the erlang VM process intrigues me, Jens,
> >>> how is that done in your case? I've not previously found the assertion
> >>> that V8 would be faster than SpiderMonkey for a view server compelling
> >>> since the bottleneck is almost never in the code evaluation, but I do
> >>> support CouchDB switching to it for the synergy effects of a closer
> >>> binding with node.js, but if it's running in the same process, that
> >>> would change (though I don't immediately see why the same couldn't be
> >>> done for SpiderMonkey). Off the top of my head, I don't know a safe
> >>> way to evaluate JS in the VM. A NIF-based approach would either be
> >>> quite elaborate or would trip all the scheduling problems that
> >>> long-running NIF's are now notorious for.
> >>>
> >>> At a step removed, the view server protocol itself seems like the
> >>> thing to improve on, it feels like that's the principal bottleneck.
> >>
> >> The code is here:
> >> https://github.com/couchbase/couchdb/tree/master/src/mapreduce
> >>
> >> I’d love for someone to pick this up and give CouchDB, say, a
> ./configure
> >> --enable-native-v8 option or a plugin that allows people to opt into the
> >> speed improvements made there. :)
> >>
> >> The choice for V8 was made because of easier integration API and more
> >> reliable releases as a standalone project, which I think was a smart
> move.
> >>
> >> IIRC it relies on a change to CouchDB-y internals that has not made it
> >> back from Couchbase to CouchDB (Filipe will know, but I doubt he’s
> reading
> >> this thread), but we should look into that and get us “native JS
> views”, at
> >> least as an option or plugin.
> >>
> >> CCing dev@.
> >>
> >> Jan
> >> --
> >>
> >>
> > Well on the first hand nifs look like a good idea but can be very
> > problematic:
> >
> > - when the view computation take time it would block the full vm
> > scheduling. It can be mitigated using a pool of threads to execute the
> work
> > asynchronously but then can create other problems like memory leaking
> etc.
> > - nifs can't be upgraded easily during hot upgrade
> > - when a nif crash, all the vm crash.
> >
> > (Note that we have the same problem when using a nif to decode/encode
> json,
> > it only works well with medium sized documents)
> >
> > One other way to improve the js handling would be removing the main
> > bottleneck ie the serialization-deserialization we do on each step. Not
> > sure if it exists but  feasible, why not passing erlang terms from erlang
> > to js and js to erlang? So at the end the deserialization would happen
> only
> > on the JS side ie instead of having
> >
> > get erlang term
> > encode to json
> > send to js
> > decode json
> > process
> > encode json
> > send json
> > decode json to erlang term
> > store
> >
> > we sould just have
> >
> > get erlang term
> > send over STDIO
> > decode erlang term to JS object
> > process
> > encode to erlang term
> > send erlang term
> > store
> >
> > Erlang serialization is also very optimised.
>
> I think the ultimate goal should be to be as little
> conversion/serialisation as possible, hence no conversion to Erlang
> Terms at all.
>
> Input as string
> Parsing to get ID
> Store as string
>
> Send to JS as string
> Process with JS
> Store as string
>
> Cheers,
>   Volker
>
>
>

I agree, (modulo the fact that I would replace a string by a binary ;) but
that would be only possible if we extract the metadata (_id, _rev) from the
JSON so couchdb wouldn't have to decode the JSON to get them. Streaming
json would also allows that but since there is no guaranty in the
properties order of a JSON it would be less efficient.

- benoit

Re: Erlang vs JavaScript

Posted by Volker Mische <vo...@gmail.com>.
On 08/15/2013 11:53 AM, Benoit Chesneau wrote:
> On Thu, Aug 15, 2013 at 11:38 AM, Jan Lehnardt <ja...@apache.org> wrote:
> 
>>
>> On Aug 15, 2013, at 10:09 , Robert Newson <rn...@apache.org> wrote:
>>
>>> A big +1 to Jason's clarification of "erlang" vs "native". CouchDB
>>> could have shipped an erlang view server that worked in a separate
>>> process and had the stdio overhead, to combine the slowness of the
>>> protocol with the obtuseness of erlang. ;)
>>>
>>> Evaluating Javascript within the erlang VM process intrigues me, Jens,
>>> how is that done in your case? I've not previously found the assertion
>>> that V8 would be faster than SpiderMonkey for a view server compelling
>>> since the bottleneck is almost never in the code evaluation, but I do
>>> support CouchDB switching to it for the synergy effects of a closer
>>> binding with node.js, but if it's running in the same process, that
>>> would change (though I don't immediately see why the same couldn't be
>>> done for SpiderMonkey). Off the top of my head, I don't know a safe
>>> way to evaluate JS in the VM. A NIF-based approach would either be
>>> quite elaborate or would trip all the scheduling problems that
>>> long-running NIF's are now notorious for.
>>>
>>> At a step removed, the view server protocol itself seems like the
>>> thing to improve on, it feels like that's the principal bottleneck.
>>
>> The code is here:
>> https://github.com/couchbase/couchdb/tree/master/src/mapreduce
>>
>> I’d love for someone to pick this up and give CouchDB, say, a ./configure
>> --enable-native-v8 option or a plugin that allows people to opt into the
>> speed improvements made there. :)
>>
>> The choice for V8 was made because of easier integration API and more
>> reliable releases as a standalone project, which I think was a smart move.
>>
>> IIRC it relies on a change to CouchDB-y internals that has not made it
>> back from Couchbase to CouchDB (Filipe will know, but I doubt he’s reading
>> this thread), but we should look into that and get us “native JS views”, at
>> least as an option or plugin.
>>
>> CCing dev@.
>>
>> Jan
>> --
>>
>>
> Well on the first hand nifs look like a good idea but can be very
> problematic:
> 
> - when the view computation take time it would block the full vm
> scheduling. It can be mitigated using a pool of threads to execute the work
> asynchronously but then can create other problems like memory leaking etc.
> - nifs can't be upgraded easily during hot upgrade
> - when a nif crash, all the vm crash.
> 
> (Note that we have the same problem when using a nif to decode/encode json,
> it only works well with medium sized documents)
> 
> One other way to improve the js handling would be removing the main
> bottleneck ie the serialization-deserialization we do on each step. Not
> sure if it exists but  feasible, why not passing erlang terms from erlang
> to js and js to erlang? So at the end the deserialization would happen only
> on the JS side ie instead of having
> 
> get erlang term
> encode to json
> send to js
> decode json
> process
> encode json
> send json
> decode json to erlang term
> store
> 
> we sould just have
> 
> get erlang term
> send over STDIO
> decode erlang term to JS object
> process
> encode to erlang term
> send erlang term
> store
> 
> Erlang serialization is also very optimised.

I think the ultimate goal should be to be as little
conversion/serialisation as possible, hence no conversion to Erlang
Terms at all.

Input as string
Parsing to get ID
Store as string

Send to JS as string
Process with JS
Store as string

Cheers,
  Volker



Re: Erlang vs JavaScript

Posted by Volker Mische <vo...@gmail.com>.
On 08/15/2013 11:53 AM, Benoit Chesneau wrote:
> On Thu, Aug 15, 2013 at 11:38 AM, Jan Lehnardt <ja...@apache.org> wrote:
> 
>>
>> On Aug 15, 2013, at 10:09 , Robert Newson <rn...@apache.org> wrote:
>>
>>> A big +1 to Jason's clarification of "erlang" vs "native". CouchDB
>>> could have shipped an erlang view server that worked in a separate
>>> process and had the stdio overhead, to combine the slowness of the
>>> protocol with the obtuseness of erlang. ;)
>>>
>>> Evaluating Javascript within the erlang VM process intrigues me, Jens,
>>> how is that done in your case? I've not previously found the assertion
>>> that V8 would be faster than SpiderMonkey for a view server compelling
>>> since the bottleneck is almost never in the code evaluation, but I do
>>> support CouchDB switching to it for the synergy effects of a closer
>>> binding with node.js, but if it's running in the same process, that
>>> would change (though I don't immediately see why the same couldn't be
>>> done for SpiderMonkey). Off the top of my head, I don't know a safe
>>> way to evaluate JS in the VM. A NIF-based approach would either be
>>> quite elaborate or would trip all the scheduling problems that
>>> long-running NIF's are now notorious for.
>>>
>>> At a step removed, the view server protocol itself seems like the
>>> thing to improve on, it feels like that's the principal bottleneck.
>>
>> The code is here:
>> https://github.com/couchbase/couchdb/tree/master/src/mapreduce
>>
>> I’d love for someone to pick this up and give CouchDB, say, a ./configure
>> --enable-native-v8 option or a plugin that allows people to opt into the
>> speed improvements made there. :)
>>
>> The choice for V8 was made because of easier integration API and more
>> reliable releases as a standalone project, which I think was a smart move.
>>
>> IIRC it relies on a change to CouchDB-y internals that has not made it
>> back from Couchbase to CouchDB (Filipe will know, but I doubt he’s reading
>> this thread), but we should look into that and get us “native JS views”, at
>> least as an option or plugin.
>>
>> CCing dev@.
>>
>> Jan
>> --
>>
>>
> Well on the first hand nifs look like a good idea but can be very
> problematic:
> 
> - when the view computation take time it would block the full vm
> scheduling. It can be mitigated using a pool of threads to execute the work
> asynchronously but then can create other problems like memory leaking etc.
> - nifs can't be upgraded easily during hot upgrade
> - when a nif crash, all the vm crash.
> 
> (Note that we have the same problem when using a nif to decode/encode json,
> it only works well with medium sized documents)
> 
> One other way to improve the js handling would be removing the main
> bottleneck ie the serialization-deserialization we do on each step. Not
> sure if it exists but  feasible, why not passing erlang terms from erlang
> to js and js to erlang? So at the end the deserialization would happen only
> on the JS side ie instead of having
> 
> get erlang term
> encode to json
> send to js
> decode json
> process
> encode json
> send json
> decode json to erlang term
> store
> 
> we sould just have
> 
> get erlang term
> send over STDIO
> decode erlang term to JS object
> process
> encode to erlang term
> send erlang term
> store
> 
> Erlang serialization is also very optimised.

I think the ultimate goal should be to be as little
conversion/serialisation as possible, hence no conversion to Erlang
Terms at all.

Input as string
Parsing to get ID
Store as string

Send to JS as string
Process with JS
Store as string

Cheers,
  Volker



Re: Erlang vs JavaScript

Posted by Jan Lehnardt <ja...@apache.org>.
On Aug 15, 2013, at 11:53 , Benoit Chesneau <bc...@gmail.com> wrote:

> On Thu, Aug 15, 2013 at 11:38 AM, Jan Lehnardt <ja...@apache.org> wrote:
> 
>> 
>> On Aug 15, 2013, at 10:09 , Robert Newson <rn...@apache.org> wrote:
>> 
>>> A big +1 to Jason's clarification of "erlang" vs "native". CouchDB
>>> could have shipped an erlang view server that worked in a separate
>>> process and had the stdio overhead, to combine the slowness of the
>>> protocol with the obtuseness of erlang. ;)
>>> 
>>> Evaluating Javascript within the erlang VM process intrigues me, Jens,
>>> how is that done in your case? I've not previously found the assertion
>>> that V8 would be faster than SpiderMonkey for a view server compelling
>>> since the bottleneck is almost never in the code evaluation, but I do
>>> support CouchDB switching to it for the synergy effects of a closer
>>> binding with node.js, but if it's running in the same process, that
>>> would change (though I don't immediately see why the same couldn't be
>>> done for SpiderMonkey). Off the top of my head, I don't know a safe
>>> way to evaluate JS in the VM. A NIF-based approach would either be
>>> quite elaborate or would trip all the scheduling problems that
>>> long-running NIF's are now notorious for.
>>> 
>>> At a step removed, the view server protocol itself seems like the
>>> thing to improve on, it feels like that's the principal bottleneck.
>> 
>> The code is here:
>> https://github.com/couchbase/couchdb/tree/master/src/mapreduce
>> 
>> I’d love for someone to pick this up and give CouchDB, say, a ./configure
>> --enable-native-v8 option or a plugin that allows people to opt into the
>> speed improvements made there. :)
>> 
>> The choice for V8 was made because of easier integration API and more
>> reliable releases as a standalone project, which I think was a smart move.
>> 
>> IIRC it relies on a change to CouchDB-y internals that has not made it
>> back from Couchbase to CouchDB (Filipe will know, but I doubt he’s reading
>> this thread), but we should look into that and get us “native JS views”, at
>> least as an option or plugin.
>> 
>> CCing dev@.
>> 
>> Jan
>> --
>> 
>> 
> Well on the first hand nifs look like a good idea but can be very
> problematic:
> 
> - when the view computation take time it would block the full vm
> scheduling. It can be mitigated using a pool of threads to execute the work
> asynchronously but then can create other problems like memory leaking etc.
> - nifs can't be upgraded easily during hot upgrade
> - when a nif crash, all the vm crash.

Yeah totally, hence making the whole thing an option.

> (Note that we have the same problem when using a nif to decode/encode json,
> it only works well with medium sized documents)




> One other way to improve the js handling would be removing the main
> bottleneck ie the serialization-deserialization we do on each step. Not
> sure if it exists but  feasible, why not passing erlang terms from erlang
> to js and js to erlang? So at the end the deserialization would happen only
> on the JS side ie instead of having
> 
> get erlang term
> encode to json
> send to js
> decode json
> process
> encode json
> send json
> decode json to erlang term
> store
> 
> we sould just have
> 
> get erlang term
> send over STDIO
> decode erlang term to JS object
> process
> encode to erlang term
> send erlang term
> store
> 
> Erlang serialization is also very optimised.
> 
> 
> Both solutions could co-exist, that may worh a try and benchmark each...

I think we just want both solutions period, the embedded one will still be faster, but potentially a little less stable, and the external view server one will be slower but extremely robust. Users should be able to choose between them :)

Best
Jan
--



Re: Erlang vs JavaScript

Posted by Jan Lehnardt <ja...@apache.org>.
On Aug 15, 2013, at 11:53 , Benoit Chesneau <bc...@gmail.com> wrote:

> On Thu, Aug 15, 2013 at 11:38 AM, Jan Lehnardt <ja...@apache.org> wrote:
> 
>> 
>> On Aug 15, 2013, at 10:09 , Robert Newson <rn...@apache.org> wrote:
>> 
>>> A big +1 to Jason's clarification of "erlang" vs "native". CouchDB
>>> could have shipped an erlang view server that worked in a separate
>>> process and had the stdio overhead, to combine the slowness of the
>>> protocol with the obtuseness of erlang. ;)
>>> 
>>> Evaluating Javascript within the erlang VM process intrigues me, Jens,
>>> how is that done in your case? I've not previously found the assertion
>>> that V8 would be faster than SpiderMonkey for a view server compelling
>>> since the bottleneck is almost never in the code evaluation, but I do
>>> support CouchDB switching to it for the synergy effects of a closer
>>> binding with node.js, but if it's running in the same process, that
>>> would change (though I don't immediately see why the same couldn't be
>>> done for SpiderMonkey). Off the top of my head, I don't know a safe
>>> way to evaluate JS in the VM. A NIF-based approach would either be
>>> quite elaborate or would trip all the scheduling problems that
>>> long-running NIF's are now notorious for.
>>> 
>>> At a step removed, the view server protocol itself seems like the
>>> thing to improve on, it feels like that's the principal bottleneck.
>> 
>> The code is here:
>> https://github.com/couchbase/couchdb/tree/master/src/mapreduce
>> 
>> I’d love for someone to pick this up and give CouchDB, say, a ./configure
>> --enable-native-v8 option or a plugin that allows people to opt into the
>> speed improvements made there. :)
>> 
>> The choice for V8 was made because of easier integration API and more
>> reliable releases as a standalone project, which I think was a smart move.
>> 
>> IIRC it relies on a change to CouchDB-y internals that has not made it
>> back from Couchbase to CouchDB (Filipe will know, but I doubt he’s reading
>> this thread), but we should look into that and get us “native JS views”, at
>> least as an option or plugin.
>> 
>> CCing dev@.
>> 
>> Jan
>> --
>> 
>> 
> Well on the first hand nifs look like a good idea but can be very
> problematic:
> 
> - when the view computation take time it would block the full vm
> scheduling. It can be mitigated using a pool of threads to execute the work
> asynchronously but then can create other problems like memory leaking etc.
> - nifs can't be upgraded easily during hot upgrade
> - when a nif crash, all the vm crash.

Yeah totally, hence making the whole thing an option.

> (Note that we have the same problem when using a nif to decode/encode json,
> it only works well with medium sized documents)




> One other way to improve the js handling would be removing the main
> bottleneck ie the serialization-deserialization we do on each step. Not
> sure if it exists but  feasible, why not passing erlang terms from erlang
> to js and js to erlang? So at the end the deserialization would happen only
> on the JS side ie instead of having
> 
> get erlang term
> encode to json
> send to js
> decode json
> process
> encode json
> send json
> decode json to erlang term
> store
> 
> we sould just have
> 
> get erlang term
> send over STDIO
> decode erlang term to JS object
> process
> encode to erlang term
> send erlang term
> store
> 
> Erlang serialization is also very optimised.
> 
> 
> Both solutions could co-exist, that may worh a try and benchmark each...

I think we just want both solutions period, the embedded one will still be faster, but potentially a little less stable, and the external view server one will be slower but extremely robust. Users should be able to choose between them :)

Best
Jan
--



Re: Erlang vs JavaScript

Posted by Benoit Chesneau <bc...@gmail.com>.
On Thu, Aug 15, 2013 at 11:38 AM, Jan Lehnardt <ja...@apache.org> wrote:

>
> On Aug 15, 2013, at 10:09 , Robert Newson <rn...@apache.org> wrote:
>
> > A big +1 to Jason's clarification of "erlang" vs "native". CouchDB
> > could have shipped an erlang view server that worked in a separate
> > process and had the stdio overhead, to combine the slowness of the
> > protocol with the obtuseness of erlang. ;)
> >
> > Evaluating Javascript within the erlang VM process intrigues me, Jens,
> > how is that done in your case? I've not previously found the assertion
> > that V8 would be faster than SpiderMonkey for a view server compelling
> > since the bottleneck is almost never in the code evaluation, but I do
> > support CouchDB switching to it for the synergy effects of a closer
> > binding with node.js, but if it's running in the same process, that
> > would change (though I don't immediately see why the same couldn't be
> > done for SpiderMonkey). Off the top of my head, I don't know a safe
> > way to evaluate JS in the VM. A NIF-based approach would either be
> > quite elaborate or would trip all the scheduling problems that
> > long-running NIF's are now notorious for.
> >
> > At a step removed, the view server protocol itself seems like the
> > thing to improve on, it feels like that's the principal bottleneck.
>
> The code is here:
> https://github.com/couchbase/couchdb/tree/master/src/mapreduce
>
> I’d love for someone to pick this up and give CouchDB, say, a ./configure
> --enable-native-v8 option or a plugin that allows people to opt into the
> speed improvements made there. :)
>
> The choice for V8 was made because of easier integration API and more
> reliable releases as a standalone project, which I think was a smart move.
>
> IIRC it relies on a change to CouchDB-y internals that has not made it
> back from Couchbase to CouchDB (Filipe will know, but I doubt he’s reading
> this thread), but we should look into that and get us “native JS views”, at
> least as an option or plugin.
>
> CCing dev@.
>
> Jan
> --
>
>
Well on the first hand nifs look like a good idea but can be very
problematic:

- when the view computation take time it would block the full vm
scheduling. It can be mitigated using a pool of threads to execute the work
asynchronously but then can create other problems like memory leaking etc.
- nifs can't be upgraded easily during hot upgrade
- when a nif crash, all the vm crash.

(Note that we have the same problem when using a nif to decode/encode json,
it only works well with medium sized documents)

One other way to improve the js handling would be removing the main
bottleneck ie the serialization-deserialization we do on each step. Not
sure if it exists but  feasible, why not passing erlang terms from erlang
to js and js to erlang? So at the end the deserialization would happen only
on the JS side ie instead of having

get erlang term
encode to json
send to js
decode json
process
encode json
send json
decode json to erlang term
store

we sould just have

get erlang term
send over STDIO
decode erlang term to JS object
process
encode to erlang term
send erlang term
store

Erlang serialization is also very optimised.


Both solutions could co-exist, that may worh a try and benchmark each...


- benoit

Re: Erlang vs JavaScript

Posted by Benoit Chesneau <bc...@gmail.com>.
On Thu, Aug 15, 2013 at 11:38 AM, Jan Lehnardt <ja...@apache.org> wrote:

>
> On Aug 15, 2013, at 10:09 , Robert Newson <rn...@apache.org> wrote:
>
> > A big +1 to Jason's clarification of "erlang" vs "native". CouchDB
> > could have shipped an erlang view server that worked in a separate
> > process and had the stdio overhead, to combine the slowness of the
> > protocol with the obtuseness of erlang. ;)
> >
> > Evaluating Javascript within the erlang VM process intrigues me, Jens,
> > how is that done in your case? I've not previously found the assertion
> > that V8 would be faster than SpiderMonkey for a view server compelling
> > since the bottleneck is almost never in the code evaluation, but I do
> > support CouchDB switching to it for the synergy effects of a closer
> > binding with node.js, but if it's running in the same process, that
> > would change (though I don't immediately see why the same couldn't be
> > done for SpiderMonkey). Off the top of my head, I don't know a safe
> > way to evaluate JS in the VM. A NIF-based approach would either be
> > quite elaborate or would trip all the scheduling problems that
> > long-running NIF's are now notorious for.
> >
> > At a step removed, the view server protocol itself seems like the
> > thing to improve on, it feels like that's the principal bottleneck.
>
> The code is here:
> https://github.com/couchbase/couchdb/tree/master/src/mapreduce
>
> I’d love for someone to pick this up and give CouchDB, say, a ./configure
> --enable-native-v8 option or a plugin that allows people to opt into the
> speed improvements made there. :)
>
> The choice for V8 was made because of easier integration API and more
> reliable releases as a standalone project, which I think was a smart move.
>
> IIRC it relies on a change to CouchDB-y internals that has not made it
> back from Couchbase to CouchDB (Filipe will know, but I doubt he’s reading
> this thread), but we should look into that and get us “native JS views”, at
> least as an option or plugin.
>
> CCing dev@.
>
> Jan
> --
>
>
Well on the first hand nifs look like a good idea but can be very
problematic:

- when the view computation take time it would block the full vm
scheduling. It can be mitigated using a pool of threads to execute the work
asynchronously but then can create other problems like memory leaking etc.
- nifs can't be upgraded easily during hot upgrade
- when a nif crash, all the vm crash.

(Note that we have the same problem when using a nif to decode/encode json,
it only works well with medium sized documents)

One other way to improve the js handling would be removing the main
bottleneck ie the serialization-deserialization we do on each step. Not
sure if it exists but  feasible, why not passing erlang terms from erlang
to js and js to erlang? So at the end the deserialization would happen only
on the JS side ie instead of having

get erlang term
encode to json
send to js
decode json
process
encode json
send json
decode json to erlang term
store

we sould just have

get erlang term
send over STDIO
decode erlang term to JS object
process
encode to erlang term
send erlang term
store

Erlang serialization is also very optimised.


Both solutions could co-exist, that may worh a try and benchmark each...


- benoit