You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Nicolas Jessus <ni...@lores.org> on 2010/11/17 14:00:56 UTC

Forcing document reindex

Is there a way to force a document reindex, like is triggered by an update, 
without actually doing the update? 

I am trying to do unspeakable things to Couch, but ones that would be very
useful (and inefficient, but that would be ok). I'll tell if it works :)

Thanks,
Nic

Re: Forcing document reindex

Posted by Nicolas Jessus <ni...@lores.org>.

Ryan Ramage <ry...@...> writes:

> 
> Thinking about this, you don't have to get all the database. If you
> have the set of keys you are interested in for all the docs eg
> [clientName, dateMP1, dateM1]
> 

Thank you for your kind heart and desire to help, Ryan; I'm however afraid you
misunderstood the problem - which is that constructing this key doesn't look
possible.

Re: Forcing document reindex

Posted by Ryan Ramage <ry...@gmail.com>.

Thinking about this, you don't have to get all the database. If you
have the set of keys you are interested in for all the docs eg
[clientName, dateMP1, dateM1]

1. create a design doc '_design/report'

2. create a view that for all types you want, eg 'by_id' and
    emit(doc._id, doc)

3. create a list function in the same design doc as above. This list
function will be responsible for joining the group of things together
in your report. eg 'meeting_average_time'

3. To query the list function so that you only get the keys you are
interested in, you will need to do a post to an example url:

http://192.168.1.101/test/_design/report/_list/meeting_average_time/by_id
body: {"keys": ["John_Smith", "MP1","M1"]}

(see http://wiki.apache.org/couchdb/HTTP_view_API#Querying_Options)

On Wed, Nov 17, 2010 at 11:45 AM, Nicolas Jessus
<ni...@lores.org> wrote:
>> What about a list function?
> I really don't see how that would work, except by getting most of the database
> to be returned to the list function and doing set manipulation there...
>
>

Re: Forcing document reindex

Posted by Nicolas Jessus <ni...@lores.org>.

> What about a list function?
I really don't see how that would work, except by getting most of the database
to be returned to the list function and doing set manipulation there...

Re: Forcing document reindex

Posted by Ryan Ramage <ry...@gmail.com>.

What about a list function?

You can access the request to get query parameters (eg id needed for
the report?) and then you can iterate through the docs and build up
your relationship.

I dont know how fast it will be with the number of docs you have, but
it will be liner time (order n).

http://guide.couchdb.org/editions/1/en/lists.html


http://guide.couchdb.org/editions/1/en/lists.html

On Wed, Nov 17, 2010 at 9:13 AM, Nicolas Jessus
<ni...@lores.org> wrote:
> All right; no one should like what they're going to read.
>
> I have a medium-sized MySQL system, which translates to a Couch with about a
> million documents of about 20 types. The system would really benefit from a
> schema-free design. The data is only weakly relational. Couch would fit really
> well, enough that I don't mind twisting its arm in a few places if need be; the
> tradeoff would be worth it.
>
> The hiccup is reporting. Some of it involves the full set of documents. Let's
> say I have 5 categories of documents involved in a report, A to E. A links to B,
> B links to C, etc. The report needs data from A, B, and E. As far as I can
> think, there's no way to do a view collation, because A and B share an ID but E
> doesn't. I can't pull a million documents from the DB to process elsewhere
> either, so that nixes simple indexing and the '_id' object values.
>
> I could however write a special view_server that will emit keys after checking
> the linked ID through an HTTP call (that's where you scream). Indexing
> performance is totally unimportant to me, DB updates are relatively few, and I
> can live with the dirty side-effects (again, the system as a whole would still
> be much cleaner than the MySQL one).
>
> With that solution I can have a map function that just handle docs of type A.
> But I still need to reindex the relevant As when B or E changes. I could simply
> listen to the change stream and force a reindex, but that doesn't work well with
> legitimate updates when the _rev number goes up at random even though the doc
> hasn't changed, and there's no auto-merge. So I'm pretty stuck.
>
> I'm not asking that this type of functionality be encouraged. It's clearly
> subverting the point of Couch. On the other hand, it doesn't seem like having a
> force-reindex function would dirty the concept, and if it's easy to code, then
> it's a shame it doesn't exist.
>
>

Re: Forcing document reindex

Posted by Nicolas Jessus <ni...@lores.org>.

> you can only emit a result from the map function if key elements to be used
> are in the document currently being indexed.

Well, yes, that's the problem I'm trying to get around somehow.

> have to use a list or one of the lucene projects to combine views together.
> or figure out a way to put all the data into denormalised document, which
> you have said isn't going to work.

Are there Lucene projects with the specific goal of combining views together? If
so that could be interesting. All I know is the rnewsom adapter, which creates
its own index.

> regarding the re-index concern of yours, if the data changes couchdb takes
> care of the views/lists, you don't have to force it to do anything - what am
> I missing?

Ah, if I implement the hack of getting the parts of the key I don't have through
an HTTP call, then I can have a map only caring about Meeting documents; but in
that case I will need to force one or more Meeting docs to reindex when the
related MeetingProposal or Client changes. Clearer? :)

Thanks much,
Nic

Re: Forcing document reindex

Posted by Nicholas Orr <ni...@zxgen.net>.

On Thu, Nov 18, 2010 at 5:00 AM, Nicolas Jessus <ni...@lores.org>wrote:

>
> Naively, the key should be something like [clientName, dateMP1, dateM1], or
> maybe [clientName] and a value of [dateMP1, dateM1]. There can be hundreds
> of
> thousands of meetings. The problem is to generate the key triplet when
> there's
> no common ID between the documents.

you can only emit a result from the map function if key elements to be used
are in the document currently being indexed.
if you want [clientName] then the field clientName has to exist in the
document.
if you want [dateMP1, dateM1] then those fields have to exist in the
document.

example, car data - this is what i currently deal with :)

document{
make: honda
family: accord
variant: euro
}

document{
make: honda
family: accord
variant: v6
}
document{
make: toyota
family: camry
variant: something
}
then i can create a view to emit all makes
function(doc) {
  if(doc.make) {
    emit(doc.make,1)
  }
}

then I'll get a view with results like
{rows: [
  {key:honda, value:1},
  {key:honda, value:1},
  {key:toyota, value:1}
]}

put in a reduce function and i get totals (using group=true)
function(keys, values) {
  return sum(values);
}
result
{rows: [
  {key:honda, value:2},
  {key:toyota, value:1}
]}

now i want to know how many of each make & family i update the view map
function
function(doc) {
  if(doc.make && doc.family) {
    emit([doc.make, doc.family),1)
  }
}

result (using group=true)
{rows: [
  {key:[honda,accord], value:2},
  {key:[toyota,camry], value:1}
]}

result (using group=true&group_level=1)
{rows: [
  {key:[honda], value:2},
  {key:[toyota], value:1}
]}

so looking at what you are trying to do I don't see how it is possible to
have separate documents emiting keys from other documents (a single document
goes into the map function) using map/reduce - you are probably going to
have to use a list or one of the lucene projects to combine views together.
or figure out a way to put all the data into denormalised document, which
you have said isn't going to work.

regarding the re-index concern of yours, if the data changes couchdb takes
care of the views/lists, you don't have to force it to do anything - what am
I missing?

Nicholas :)

(hopefully i got all that right....)

Re: Forcing document reindex

Posted by Nicolas Jessus <ni...@lores.org>.

Hello Cliff,

> I am not sure if I fully understand your use case (however it does sound 
> intriguing and unusual).

Sorry, I'll try to be clearer. I should have taken a real case to start with, I
just didn't want to be necessarily verbose (failed!). 

Consider 5 types of documents:

type: Meeting
_id: M1
meetingProposalID: MP1
date: 2010-09-09

type: MeetingProposal
_id: MP1
projectPartID: PP1
date: 2010-10-10

type: ProjectPart
_id: PP1
projectID: P1

type: Project
_id: P1
clientID: C1

type: Client
_id: C1
name: John

ProjectPart can be denormalised into Project, but let's ignore that.

Let's say I would like to know the average time between a meeting proposal and
the actual meeting, per client, to see what kind of delay I should expect. This 
is a simple report, others are much more complex, so I'm really looking to solve
the general case problem.

Naively, the key should be something like [clientName, dateMP1, dateM1], or
maybe [clientName] and a value of [dateMP1, dateM1]. There can be hundreds of
thousands of meetings. The problem is to generate the key triplet when there's
no common ID between the documents.


> I assume that you are getting data out of your legacy MySQL system using 
> complex joins.??
Yes, although the joins aren't complex, the data model is pretty
straightforward, with docs mostly in a chain. 

> Have you considered totally denormalising your data and input data to 
> couchdb based on the output of your MySQL reports ??
Yes, but that would not really work - each document can still be updated on its
own, with maybe a few thousand updates a day, which is little but enough to
cause massive locks if there are massively denormalised documents.

> Perhaps couchdb-lucene (or my current fav of the moment elasticsearch 
> which is also based on lucene) would be useful ??
I already have set it up, and modified it to make simple doc joins without fuss,
which is good enough for run-of-the-mill searching. It wouldn't resolve the
million-doc-pull problem, though, and joining is obviously pretty slow.


But thanks for the proposals :)

Re: Forcing document reindex

Posted by Cliff Williams <cl...@aol.com>.

Nicolas,

I am not sure if I fully understand your use case (however it does sound 
intriguing and unusual).

A couple of things stick out in your commentary;

"The data is only weakly relational."
"DB updates are relatively few"

I assume that you are getting data out of your legacy MySQL system using 
complex joins.??

Have you considered totally denormalising your data and input data to 
couchdb based on the output of your MySQL reports ??
Perhaps couchdb-lucene (or my current fav of the moment elasticsearch 
which is also based on lucene) would be useful ??

If none of the two suggestions are of any use. Could you post a more 
detailed description (with a data sample if possible) of

"The hiccup is reporting. Some of it involves the full set of documents. 
Let's
say I have 5 categories of documents involved in a report, A to E. A 
links to B,
B links to C, etc. The report needs data from A, B, and E. As far as I can
think, there's no way to do a view collation, because A and B share an 
ID but E
doesn't. I can't pull a million documents from the DB to process elsewhere
either, so that nixes simple indexing and the '_id' object values."

Very best regards

Cliff

On 17/11/10 16:13, Nicolas Jessus wrote:
> All right; no one should like what they're going to read.
>
> I have a medium-sized MySQL system, which translates to a Couch with about a
> million documents of about 20 types. The system would really benefit from a
> schema-free design. The data is only weakly relational. Couch would fit really
> well, enough that I don't mind twisting its arm in a few places if need be; the
> tradeoff would be worth it.
>
> The hiccup is reporting. Some of it involves the full set of documents. Let's
> say I have 5 categories of documents involved in a report, A to E. A links to B,
> B links to C, etc. The report needs data from A, B, and E. As far as I can
> think, there's no way to do a view collation, because A and B share an ID but E
> doesn't. I can't pull a million documents from the DB to process elsewhere
> either, so that nixes simple indexing and the '_id' object values.
>
> I could however write a special view_server that will emit keys after checking
> the linked ID through an HTTP call (that's where you scream). Indexing
> performance is totally unimportant to me, DB updates are relatively few, and I
> can live with the dirty side-effects (again, the system as a whole would still
> be much cleaner than the MySQL one).
>
> With that solution I can have a map function that just handle docs of type A.
> But I still need to reindex the relevant As when B or E changes. I could simply
> listen to the change stream and force a reindex, but that doesn't work well with
> legitimate updates when the _rev number goes up at random even though the doc
> hasn't changed, and there's no auto-merge. So I'm pretty stuck.
>
> I'm not asking that this type of functionality be encouraged. It's clearly
> subverting the point of Couch. On the other hand, it doesn't seem like having a
> force-reindex function would dirty the concept, and if it's easy to code, then
> it's a shame it doesn't exist.
>
>

Re: Forcing document reindex

Posted by Nicolas Jessus <ni...@lores.org>.

All right; no one should like what they're going to read.

I have a medium-sized MySQL system, which translates to a Couch with about a
million documents of about 20 types. The system would really benefit from a
schema-free design. The data is only weakly relational. Couch would fit really
well, enough that I don't mind twisting its arm in a few places if need be; the
tradeoff would be worth it. 

The hiccup is reporting. Some of it involves the full set of documents. Let's
say I have 5 categories of documents involved in a report, A to E. A links to B,
B links to C, etc. The report needs data from A, B, and E. As far as I can
think, there's no way to do a view collation, because A and B share an ID but E
doesn't. I can't pull a million documents from the DB to process elsewhere
either, so that nixes simple indexing and the '_id' object values. 

I could however write a special view_server that will emit keys after checking
the linked ID through an HTTP call (that's where you scream). Indexing
performance is totally unimportant to me, DB updates are relatively few, and I
can live with the dirty side-effects (again, the system as a whole would still
be much cleaner than the MySQL one). 

With that solution I can have a map function that just handle docs of type A.
But I still need to reindex the relevant As when B or E changes. I could simply
listen to the change stream and force a reindex, but that doesn't work well with
legitimate updates when the _rev number goes up at random even though the doc
hasn't changed, and there's no auto-merge. So I'm pretty stuck.

I'm not asking that this type of functionality be encouraged. It's clearly
subverting the point of Couch. On the other hand, it doesn't seem like having a
force-reindex function would dirty the concept, and if it's easy to code, then
it's a shame it doesn't exist.

Re: Forcing document reindex

Posted by Jan Lehnardt <ja...@apache.org>.

On 17 Nov 2010, at 15:16, Nicolas Jessus wrote:

>> No. But storing the document with no changes will do the trick. Of course,
> CouchDB will update the _rev field
>> in the process.
> 
> That would probably cause concurrency problems with legitimate updates. Shame.
> That would be very useful functionality.

Why?

Cheers
Jan
--

Re: Forcing document reindex

Posted by Nicolas Jessus <ni...@lores.org>.

> No. But storing the document with no changes will do the trick. Of course,
CouchDB will update the _rev field
> in the process.

That would probably cause concurrency problems with legitimate updates. Shame.
That would be very useful functionality.

Thanks for the help, anyway
Nic

Re: Forcing document reindex

Posted by Jan Lehnardt <ja...@apache.org>.

On 17 Nov 2010, at 14:34, Nicolas Jessus wrote:

>> If you make a change to your view functions (a whitespace change will do)
>> and query the view again, it'll be rebuilt.
>> 
>> Please don't abuse your couch too much :)
>> 
>> Cheers
>> Jan
> 
> 
> Hello Jan,
> Thank you for the answer. I'd just like to avoid rebuilding entire views, and
> have the benefit of reindexing the document in whatever view it is needed. 
> So it's pretty much exactly the behavior of updating a document, but without
> actually incrementing its _rev number. Can Couch do that?

No. But storing the document with no changes will do the trick. Of course, CouchDB will update the _rev field in the process.

Cheers
Jan
--

Re: Forcing document reindex

Posted by Nicolas Jessus <ni...@lores.org>.

> If you make a change to your view functions (a whitespace change will do)
> and query the view again, it'll be rebuilt.
> 
> Please don't abuse your couch too much :)
> 
> Cheers
> Jan


Hello Jan,
Thank you for the answer. I'd just like to avoid rebuilding entire views, and
have the benefit of reindexing the document in whatever view it is needed. 
So it's pretty much exactly the behavior of updating a document, but without
actually incrementing its _rev number. Can Couch do that?

I promise I'll be gentle :)

Nic

Re: Forcing document reindex

Posted by Jan Lehnardt <ja...@apache.org>.

Ni Nicolas,

On 17 Nov 2010, at 14:00, Nicolas Jessus wrote:

> Is there a way to force a document reindex, like is triggered by an update, 
> without actually doing the update? 
> 
> I am trying to do unspeakable things to Couch, but ones that would be very
> useful (and inefficient, but that would be ok). I'll tell if it works :)

If you make a change to your view functions (a whitespace change will do)
and query the view again, it'll be rebuilt.

Please don't abuse your couch too much :)

Cheers
Jan
--