You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Tito Ciuro <tc...@mac.com> on 2014/10/28 18:32:37 UTC

How does indexing really work?

Hello,

I’m a bit confused about how CouchDB really works. I just launched Futon and see that the indexer is busy working on a design document. I have almost a million documents.

A few minutes later, I see three more tasks appearing, all belonging to different design documents. No problem, except that the total count is all different:

- design doc 1: ~950,000
- design doc 2: ~450,000
- design doc 3: ~313,000
- design doc 4: ~85,000

Why are the total counts different? My understanding is/was that a database holds N documents. Each indexing function is passed a document which then gets compares whether it’s the doc_type it expects:

function(doc) {
    <http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#CA-1846e35e0e66fe65e7a443a2459a0272833e6152_2>if (doc.Type == "customer") {
    <http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#CA-1846e35e0e66fe65e7a443a2459a0272833e6152_3>emit(doc._id, {LastName: doc.LastName, FirstName: doc.FirstName});
    <http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#CA-1846e35e0e66fe65e7a443a2459a0272833e6152_4>}
}

In the Genesis case, I was assuming that each view would have to go through each document across the database and index its own doc_type. Basically, one round for each design document for N total documents. For example, if the database contains 100,000 documents and two design documents, there would be two active tasks listed:

- _design/customers => index 100,000 documents
- _design/orders => index 100,000 documents

Later on, the indexing would be partial and the delta (say 9,000 docs) would have to be reindexed by each view:

- _design/customers => index 9,000 documents
- _design/orders => index 9,000 documents

This doesn’t seem to be the case. I’d love to know how indexing really works.

Thanks!

— Tito

Re: How does indexing really work?

Posted by Tito Ciuro <tc...@mac.com>.

Hi Joan,

Apologies for not replying earlier. Your explanation makes sense. I currently have design docs per class type, each containing several views. For example: _design/people, _design/orders, _design/items, etc. For the most part, it’s working well, though I’m trying to understand the data access patterns exercised by the app to verify whether further grouping or disassociation is needed.

Thanks again everybody for the help,

— Tito

> On Oct 28, 2014, at 1:04 PM, Joan Touzet <wo...@apache.org> wrote:
> 
> Ah, I follow now. This explanation should help.
> 
> Each design doc is being processed by an independent couchjs JavaScript interpreter.
> 
> If you want all views to be processed with exactly the same delta, put them all in the same design doc. You are trading off parallelism (# of ddocs, and # of CPU cores potentially) for synchronicity.
> 
> Another point to consider is that, if all views are in the same doc, an edit to any of them will cause recalculation of the *all* views in that doc. This might be good - maybe these views never change! It really depends on your development and deployment model.
> 
> -Joan
> 
> ----- Original Message -----
> From: "Tito Ciuro" <tc...@mac.com>
> To: user@couchdb.apache.org
> Sent: Tuesday, October 28, 2014 3:32:57 PM
> Subject: Re: How does indexing really work?
> 
> Hi Sebastian,
> 
> Even if indexing had started several times and the updates seq number was valid, shouldn’t all tasks share the same number? (that is, shouldn’t all tasks be indexing the same delta?) :-/
> 
> Regards,
> 
> — Tito
> 
>> On Oct 28, 2014, at 12:08 PM, Sebastian Rothbucher <se...@googlemail.com> wrote:
>> 
>> my Futon says sth like "Processed 0 of 4 changes (0%)" - and the # of
>> changes might differ from the # of docs (at least when the indexing is
>> started several times). Likewise, the Futon start page displays # of docs
>> and the updates seq (=# of changes). Could that be one explanation?
> 
>

Re: How does indexing really work?

Posted by Joan Touzet <wo...@apache.org>.

Ah, I follow now. This explanation should help.

Each design doc is being processed by an independent couchjs JavaScript interpreter.

If you want all views to be processed with exactly the same delta, put them all in the same design doc. You are trading off parallelism (# of ddocs, and # of CPU cores potentially) for synchronicity.

Another point to consider is that, if all views are in the same doc, an edit to any of them will cause recalculation of the *all* views in that doc. This might be good - maybe these views never change! It really depends on your development and deployment model.

-Joan

----- Original Message -----
From: "Tito Ciuro" <tc...@mac.com>
To: user@couchdb.apache.org
Sent: Tuesday, October 28, 2014 3:32:57 PM
Subject: Re: How does indexing really work?

Hi Sebastian,

Even if indexing had started several times and the updates seq number was valid, shouldn’t all tasks share the same number? (that is, shouldn’t all tasks be indexing the same delta?) :-/

Regards,

— Tito

> On Oct 28, 2014, at 12:08 PM, Sebastian Rothbucher <se...@googlemail.com> wrote:
> 
> my Futon says sth like "Processed 0 of 4 changes (0%)" - and the # of
> changes might differ from the # of docs (at least when the indexing is
> started several times). Likewise, the Futon start page displays # of docs
> and the updates seq (=# of changes). Could that be one explanation?

Re: How does indexing really work?

Posted by Tito Ciuro <tc...@mac.com>.

Hi Sebastian,

Even if indexing had started several times and the updates seq number was valid, shouldn’t all tasks share the same number? (that is, shouldn’t all tasks be indexing the same delta?) :-/

Regards,

— Tito

> On Oct 28, 2014, at 12:08 PM, Sebastian Rothbucher <se...@googlemail.com> wrote:
> 
> my Futon says sth like "Processed 0 of 4 changes (0%)" - and the # of
> changes might differ from the # of docs (at least when the indexing is
> started several times). Likewise, the Futon start page displays # of docs
> and the updates seq (=# of changes). Could that be one explanation?

Re: How does indexing really work?

Posted by Sebastian Rothbucher <se...@googlemail.com>.

Hi Tito,

my Futon says sth like "Processed 0 of 4 changes (0%)" - and the # of
changes might differ from the # of docs (at least when the indexing is
started several times). Likewise, the Futon start page displays # of docs
and the updates seq (=# of changes). Could that be one explanation?
Because I'd be with you: a new view (or any change to a design doc for that
matter) means working all documents of the DB (all N as you put it),
passing each to the view server and producing the view. The views remember
which document a non-reduced row belongs to, but as emit can happen zero,
one, or several times that's not really an indication

I'm afraid I don't have the cookie cutter - but let us know

Best
     Sebastian

On Tue, Oct 28, 2014 at 6:56 PM, Tito Ciuro <tc...@mac.com> wrote:

> Hello Joan,
>
> I’m getting this information in two places:
>
> - Futon’s “Status” page
> - CouchDB’s /_active_tasks payload
>
> I know there are ~950,000 documents in the database. This numbers appears
> in the /_utils main page. What I don’t understand is why the total number
> of documents differ when the active tasks are reported via Status page or
> /_active_tasks. Each active task has a different total number of docs to be
> processed.
>
> Yes, Genesis case is the initial case where CouchDB hasn’t had the
> opportunity to index anything.
>
> Thanks,
>
> — Tito
>
> > On Oct 28, 2014, at 10:50 AM, Joan Touzet <wo...@apache.org> wrote:
> >
> > Hi Tito,
> >
> > Can you explain where you're getting the "total count" from? Is this the
> total number of rows emitted by each view after all views have finished
> processing?
> >
> > What do you mean by "Genesis case" - do you mean building a view for the
> first time?
> >
> > Thanks,
> > Joan
> >
> > ----- Original Message -----
> > From: "Tito Ciuro" <tciuro@mac.com <ma...@mac.com>>
> > To: user@couchdb.apache.org <ma...@couchdb.apache.org>
> > Sent: Tuesday, October 28, 2014 1:32:37 PM
> > Subject: How does indexing really work?
> >
> > Hello,
> >
> > I’m a bit confused about how CouchDB really works. I just launched Futon
> and see that the indexer is busy working on a design document. I have
> almost a million documents.
> >
> > A few minutes later, I see three more tasks appearing, all belonging to
> different design documents. No problem, except that the total count is all
> different:
> >
> > - design doc 1: ~950,000
> > - design doc 2: ~450,000
> > - design doc 3: ~313,000
> > - design doc 4: ~85,000
> >
> > Why are the total counts different? My understanding is/was that a
> database holds N documents. Each indexing function is passed a document
> which then gets compares whether it’s the doc_type it expects:
> >
> > function(doc) {
> >    <
> http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#CA-1846e35e0e66fe65e7a443a2459a0272833e6152_2
> <
> http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#CA-1846e35e0e66fe65e7a443a2459a0272833e6152_2>>if
> (doc.Type == "customer") {
> >    <
> http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#CA-1846e35e0e66fe65e7a443a2459a0272833e6152_3
> <
> http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#CA-1846e35e0e66fe65e7a443a2459a0272833e6152_3>>emit(doc._id,
> {LastName: doc.LastName, FirstName: doc.FirstName});
> >    <
> http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#CA-1846e35e0e66fe65e7a443a2459a0272833e6152_4
> <
> http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#CA-1846e35e0e66fe65e7a443a2459a0272833e6152_4
> >>}
> > }
> >
> > In the Genesis case, I was assuming that each view would have to go
> through each document across the database and index its own doc_type.
> Basically, one round for each design document for N total documents. For
> example, if the database contains 100,000 documents and two design
> documents, there would be two active tasks listed:
> >
> > - _design/customers => index 100,000 documents
> > - _design/orders => index 100,000 documents
> >
> > Later on, the indexing would be partial and the delta (say 9,000 docs)
> would have to be reindexed by each view:
> >
> > - _design/customers => index 9,000 documents
> > - _design/orders => index 9,000 documents
> >
> > This doesn’t seem to be the case. I’d love to know how indexing really
> works.
> >
> > Thanks!
> >
> > — Tito
>
>

Re: How does indexing really work?

Posted by Tito Ciuro <tc...@mac.com>.

Hello Joan,

I’m getting this information in two places:

- Futon’s “Status” page
- CouchDB’s /_active_tasks payload

I know there are ~950,000 documents in the database. This numbers appears in the /_utils main page. What I don’t understand is why the total number of documents differ when the active tasks are reported via Status page or /_active_tasks. Each active task has a different total number of docs to be processed.

Yes, Genesis case is the initial case where CouchDB hasn’t had the opportunity to index anything.

Thanks,

— Tito

> On Oct 28, 2014, at 10:50 AM, Joan Touzet <wo...@apache.org> wrote:
> 
> Hi Tito,
> 
> Can you explain where you're getting the "total count" from? Is this the total number of rows emitted by each view after all views have finished processing?
> 
> What do you mean by "Genesis case" - do you mean building a view for the first time?
> 
> Thanks,
> Joan
> 
> ----- Original Message -----
> From: "Tito Ciuro" <tciuro@mac.com <ma...@mac.com>>
> To: user@couchdb.apache.org <ma...@couchdb.apache.org>
> Sent: Tuesday, October 28, 2014 1:32:37 PM
> Subject: How does indexing really work?
> 
> Hello,
> 
> I’m a bit confused about how CouchDB really works. I just launched Futon and see that the indexer is busy working on a design document. I have almost a million documents.
> 
> A few minutes later, I see three more tasks appearing, all belonging to different design documents. No problem, except that the total count is all different:
> 
> - design doc 1: ~950,000
> - design doc 2: ~450,000
> - design doc 3: ~313,000
> - design doc 4: ~85,000
> 
> Why are the total counts different? My understanding is/was that a database holds N documents. Each indexing function is passed a document which then gets compares whether it’s the doc_type it expects:
> 
> function(doc) {
>    <http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#CA-1846e35e0e66fe65e7a443a2459a0272833e6152_2 <http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#CA-1846e35e0e66fe65e7a443a2459a0272833e6152_2>>if (doc.Type == "customer") {
>    <http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#CA-1846e35e0e66fe65e7a443a2459a0272833e6152_3 <http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#CA-1846e35e0e66fe65e7a443a2459a0272833e6152_3>>emit(doc._id, {LastName: doc.LastName, FirstName: doc.FirstName});
>    <http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#CA-1846e35e0e66fe65e7a443a2459a0272833e6152_4 <http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#CA-1846e35e0e66fe65e7a443a2459a0272833e6152_4>>}
> }
> 
> In the Genesis case, I was assuming that each view would have to go through each document across the database and index its own doc_type. Basically, one round for each design document for N total documents. For example, if the database contains 100,000 documents and two design documents, there would be two active tasks listed:
> 
> - _design/customers => index 100,000 documents
> - _design/orders => index 100,000 documents
> 
> Later on, the indexing would be partial and the delta (say 9,000 docs) would have to be reindexed by each view:
> 
> - _design/customers => index 9,000 documents
> - _design/orders => index 9,000 documents
> 
> This doesn’t seem to be the case. I’d love to know how indexing really works.
> 
> Thanks!
> 
> — Tito

Re: How does indexing really work?

Posted by Joan Touzet <wo...@apache.org>.

Hi Tito,

Can you explain where you're getting the "total count" from? Is this the total number of rows emitted by each view after all views have finished processing?

What do you mean by "Genesis case" - do you mean building a view for the first time?

Thanks,
Joan

----- Original Message -----
From: "Tito Ciuro" <tc...@mac.com>
To: user@couchdb.apache.org
Sent: Tuesday, October 28, 2014 1:32:37 PM
Subject: How does indexing really work?

Hello,

I’m a bit confused about how CouchDB really works. I just launched Futon and see that the indexer is busy working on a design document. I have almost a million documents.

A few minutes later, I see three more tasks appearing, all belonging to different design documents. No problem, except that the total count is all different:

- design doc 1: ~950,000
- design doc 2: ~450,000
- design doc 3: ~313,000
- design doc 4: ~85,000

Why are the total counts different? My understanding is/was that a database holds N documents. Each indexing function is passed a document which then gets compares whether it’s the doc_type it expects:

function(doc) {
    <http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#CA-1846e35e0e66fe65e7a443a2459a0272833e6152_2>if (doc.Type == "customer") {
    <http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#CA-1846e35e0e66fe65e7a443a2459a0272833e6152_3>emit(doc._id, {LastName: doc.LastName, FirstName: doc.FirstName});
    <http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#CA-1846e35e0e66fe65e7a443a2459a0272833e6152_4>}
}

In the Genesis case, I was assuming that each view would have to go through each document across the database and index its own doc_type. Basically, one round for each design document for N total documents. For example, if the database contains 100,000 documents and two design documents, there would be two active tasks listed:

- _design/customers => index 100,000 documents
- _design/orders => index 100,000 documents

Later on, the indexing would be partial and the delta (say 9,000 docs) would have to be reindexed by each view:

- _design/customers => index 9,000 documents
- _design/orders => index 9,000 documents

This doesn’t seem to be the case. I’d love to know how indexing really works.

Thanks!

— Tito