You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Harold Cooper <ha...@mit.edu> on 2010/02/08 00:15:40 UTC

two view questions: group=true, inverted indices

Hi there,

I'm new to CouchDB and have two questions about the use of mapreduce
in views.

1.
As far as I can tell, even when I pass group=true to a view,
reduce(keys, values) is still passed different keys,
e.g. keys = [["a", "551a50e574ccd439af28428db2401ab4"],
["b", "94d13f9e969786c6d653555a7e94f61e"]].

Isn't the whole point of group=true that this shouldn't happen?


2.
When I've read about mapreduce before, a classic example use is
constructing an inverted index. But if I make a view like:
{
map: "function(doc) {
  var words = doc.text.split(' ');
  for (var i in words) {
    emit(words[i], [doc._id]);
  }
}",
reduce: "function(keys, values) {
  // concatenate the lists of docIds together:
  return Array.prototype.concat.apply([], values);
}"
}
then couchdb complains that the reduce result is growing too fast.

I did read that this is the way things are, but it's too bad because
it would be a perfect application of mapreduce, and the only other
text search option I've heard of is couchdb-lucene which doesn't
sound nearly as fun/elegant.

Is there another way to approach this?
Should I just not reduce and end up with one row per word-occurrence?

Thanks for any help,
and sorry if this has been covered before, I did try to search around first.
--
Harold

Re: two view questions: group=true, inverted indices

Posted by Noah Slater <ns...@tumbolia.org>.

Sounds good, but I don't think you need any involvement of the committers before you start work. When I first found out about the project, I was unprepared to use it seriously unless it had a proper build system. So I built one and donated it. Nobody else was involved during its development, and once it was complete and after a while maintaining it, I was considered a committer by the rest of the team. If there is something you'd like to see in CouchDB, just start building it.

You can begin by sending patches via JIRA, and if the contribution is sustained and large enough, the community can decide to vote you in as a committer on the project. There's really no difference between you and me, except that I have write access to the main repository at the moment. You're already part of the development team, even if you didn't realise it yet. Hehe.

On 17 Feb 2010, at 16:41, Senthilkumar Peelikkampatti wrote:

> I recently asked Mark one of the erlang CMS creator, why can't they
> use couchdb, he said he is  considering it but sitting on the fence as
> couch doesn't have native erlang interface. I may be wrong but focus
> on such an interface is not there from my perspective. I also aware of
> few threads in this forum looking for native interface and pure erlang
> based FTI (reason behind this is either their system does't use Java
> or complexity of setting up and configuring and backing up/restoring)
> is not seemless and fluid as couchdb. There are other reason too.
> If one the committers start the foundation work, people like me jump
> in as and when time permits  and contribute. I also aware of davisp
> has something about FTI
> (http://www.davispj.com/2008/09/25/introducing-efti.html) but I was
> not able to locate that in github. Joe Armstrong has foundation code
> (indexer, porter stemmer etc) in his github repository. So we have
> pieces of code available and need to integrate it with couchdb in
> coudhdb way. Thats why I was telling committer's commitment to FTI.
> 
> 
> On Wed, Feb 17, 2010 at 8:20 AM, Noah Slater <ns...@tumbolia.org> wrote:
>> 
>> On 17 Feb 2010, at 02:30, Senthilkumar Peelikkampatti wrote:
>> 
>>> I think couchdb committers should support and encourage this kind of initiative.
>> 
>> How?
> 
> 
> 
> -- 
> Regards,
> Senthilkumar Peelikkampatti,
> http://pmsenthilkumar.blogspot.com/

Re: two view questions: group=true, inverted indices

Posted by Senthilkumar Peelikkampatti <se...@gmail.com>.

I recently asked Mark one of the erlang CMS creator, why can't they
use couchdb, he said he is  considering it but sitting on the fence as
couch doesn't have native erlang interface. I may be wrong but focus
on such an interface is not there from my perspective. I also aware of
few threads in this forum looking for native interface and pure erlang
based FTI (reason behind this is either their system does't use Java
or complexity of setting up and configuring and backing up/restoring)
is not seemless and fluid as couchdb. There are other reason too.
If one the committers start the foundation work, people like me jump
in as and when time permits  and contribute. I also aware of davisp
has something about FTI
(http://www.davispj.com/2008/09/25/introducing-efti.html) but I was
not able to locate that in github. Joe Armstrong has foundation code
(indexer, porter stemmer etc) in his github repository. So we have
pieces of code available and need to integrate it with couchdb in
coudhdb way. Thats why I was telling committer's commitment to FTI.

On Wed, Feb 17, 2010 at 8:20 AM, Noah Slater <ns...@tumbolia.org> wrote:
>
> On 17 Feb 2010, at 02:30, Senthilkumar Peelikkampatti wrote:
>
>> I think couchdb committers should support and encourage this kind of initiative.
>
> How?

-- 
Regards,
Senthilkumar Peelikkampatti,
http://pmsenthilkumar.blogspot.com/

Re: two view questions: group=true, inverted indices

Posted by Noah Slater <ns...@tumbolia.org>.

On 17 Feb 2010, at 02:30, Senthilkumar Peelikkampatti wrote:

> I think couchdb committers should support and encourage this kind of initiative.

How?

Re: two view questions: group=true, inverted indices

Posted by Senthilkumar Peelikkampatti <se...@gmail.com>.

couchdb needs native FTI and Erlang has few options available in that
space. I am aware of some experiment going on
http://github.com/bdionne/indexer. I think couchdb committers should
support  and encourage this kind of initiative.

On Mon, Feb 8, 2010 at 5:31 AM, Robert Dionne
<di...@dionne-associates.com> wrote:
>
>
>
> On Feb 7, 2010, at 6:29 PM, Paul Davis wrote:
>
>> On Sun, Feb 7, 2010 at 6:15 PM, Harold Cooper <ha...@mit.edu> wrote:
>>> Hi there,
>>>
>>> I'm new to CouchDB and have two questions about the use of mapreduce
>>> in views.
>>>
>>> 1.
>>> As far as I can tell, even when I pass group=true to a view,
>>> reduce(keys, values) is still passed different keys,
>>> e.g. keys = [["a", "551a50e574ccd439af28428db2401ab4"],
>>> ["b", "94d13f9e969786c6d653555a7e94f61e"]].
>>>
>>
>> Even when you query with group=true, the ungrouped reduction is still
>> calculated. Generally you should be able to just ignore such things.
>>
>>> Isn't the whole point of group=true that this shouldn't happen?
>>>
>>>
>>> 2.
>>> When I've read about mapreduce before, a classic example use is
>>> constructing an inverted index. But if I make a view like:
>>> {
>>> map: "function(doc) {
>>>  var words = doc.text.split(' ');
>>>  for (var i in words) {
>>>    emit(words[i], [doc._id]);
>>>  }
>>> }",
>>> reduce: "function(keys, values) {
>>>  // concatenate the lists of docIds together:
>>>  return Array.prototype.concat.apply([], values);
>>> }"
>>> }
>>> then couchdb complains that the reduce result is growing too fast.
>>>
>>> I did read that this is the way things are, but it's too bad because
>>> it would be a perfect application of mapreduce, and the only other
>>> text search option I've heard of is couchdb-lucene which doesn't
>>> sound nearly as fun/elegant.
>>>
>>> Is there another way to approach this?
>>> Should I just not reduce and end up with one row per word-occurrence?
>>
>> CouchDB Map/Reduce isn't like Google Map/Reduce. Its much more like
>> the old school map/reduce pattern that expects to be calculating a
>> single reduction value. The CouchDB internals make doing things like
>> inverted indices hard. The 'proper' way would be to do as you say and
>> return a single row per key with only some intermediary values handled
>> by reductions.
>>
>> Also, while couchdb-lucene may not present near as much fun, its got
>> quite a bit to it. Full-Text indexing is hard. Many examples show it
>> as nothing more than an inverted index, but that's hiding 95% of the
>> knowledge on information retrieval and scoring algorithms that are in
>> Lucene. And there's the integration with Tika to do things like
>> attachment indexing. I quite dislike Java but I've come to accept that
>> there really isn't much competition that's compatible with CouchDB's
>> document model.
>>
>
> I think it does have challenges and couchdb-lucene offers a good solution for most use cases, plus it's mature and well known, but
> at some point, perhaps post 1.0 I think a native FTI implementation will add a lot of value to CouchDB if only by removing the dependency
> on Java.
>
>
>
>
>
>
>> HTH,
>> Paul Davis
>>
>>> Thanks for any help,
>>> and sorry if this has been covered before, I did try to search around first.
>>> --
>>> Harold
>>>
>
>



-- 
Regards,
Senthilkumar Peelikkampatti,
http://pmsenthilkumar.blogspot.com/

Re: two view questions: group=true, inverted indices

Posted by Robert Dionne <di...@dionne-associates.com>.



On Feb 7, 2010, at 6:29 PM, Paul Davis wrote:

> On Sun, Feb 7, 2010 at 6:15 PM, Harold Cooper <ha...@mit.edu> wrote:
>> Hi there,
>> 
>> I'm new to CouchDB and have two questions about the use of mapreduce
>> in views.
>> 
>> 1.
>> As far as I can tell, even when I pass group=true to a view,
>> reduce(keys, values) is still passed different keys,
>> e.g. keys = [["a", "551a50e574ccd439af28428db2401ab4"],
>> ["b", "94d13f9e969786c6d653555a7e94f61e"]].
>> 
> 
> Even when you query with group=true, the ungrouped reduction is still
> calculated. Generally you should be able to just ignore such things.
> 
>> Isn't the whole point of group=true that this shouldn't happen?
>> 
>> 
>> 2.
>> When I've read about mapreduce before, a classic example use is
>> constructing an inverted index. But if I make a view like:
>> {
>> map: "function(doc) {
>>  var words = doc.text.split(' ');
>>  for (var i in words) {
>>    emit(words[i], [doc._id]);
>>  }
>> }",
>> reduce: "function(keys, values) {
>>  // concatenate the lists of docIds together:
>>  return Array.prototype.concat.apply([], values);
>> }"
>> }
>> then couchdb complains that the reduce result is growing too fast.
>> 
>> I did read that this is the way things are, but it's too bad because
>> it would be a perfect application of mapreduce, and the only other
>> text search option I've heard of is couchdb-lucene which doesn't
>> sound nearly as fun/elegant.
>> 
>> Is there another way to approach this?
>> Should I just not reduce and end up with one row per word-occurrence?
> 
> CouchDB Map/Reduce isn't like Google Map/Reduce. Its much more like
> the old school map/reduce pattern that expects to be calculating a
> single reduction value. The CouchDB internals make doing things like
> inverted indices hard. The 'proper' way would be to do as you say and
> return a single row per key with only some intermediary values handled
> by reductions.
> 
> Also, while couchdb-lucene may not present near as much fun, its got
> quite a bit to it. Full-Text indexing is hard. Many examples show it
> as nothing more than an inverted index, but that's hiding 95% of the
> knowledge on information retrieval and scoring algorithms that are in
> Lucene. And there's the integration with Tika to do things like
> attachment indexing. I quite dislike Java but I've come to accept that
> there really isn't much competition that's compatible with CouchDB's
> document model.
> 

I think it does have challenges and couchdb-lucene offers a good solution for most use cases, plus it's mature and well known, but
at some point, perhaps post 1.0 I think a native FTI implementation will add a lot of value to CouchDB if only by removing the dependency
on Java. 






> HTH,
> Paul Davis
> 
>> Thanks for any help,
>> and sorry if this has been covered before, I did try to search around first.
>> --
>> Harold
>>

Re: two view questions: group=true, inverted indices

Posted by Paul Davis <pa...@gmail.com>.

On Sun, Feb 7, 2010 at 6:15 PM, Harold Cooper <ha...@mit.edu> wrote:
> Hi there,
>
> I'm new to CouchDB and have two questions about the use of mapreduce
> in views.
>
> 1.
> As far as I can tell, even when I pass group=true to a view,
> reduce(keys, values) is still passed different keys,
> e.g. keys = [["a", "551a50e574ccd439af28428db2401ab4"],
> ["b", "94d13f9e969786c6d653555a7e94f61e"]].
>

Even when you query with group=true, the ungrouped reduction is still
calculated. Generally you should be able to just ignore such things.

> Isn't the whole point of group=true that this shouldn't happen?
>
>
> 2.
> When I've read about mapreduce before, a classic example use is
> constructing an inverted index. But if I make a view like:
> {
> map: "function(doc) {
>  var words = doc.text.split(' ');
>  for (var i in words) {
>    emit(words[i], [doc._id]);
>  }
> }",
> reduce: "function(keys, values) {
>  // concatenate the lists of docIds together:
>  return Array.prototype.concat.apply([], values);
> }"
> }
> then couchdb complains that the reduce result is growing too fast.
>
> I did read that this is the way things are, but it's too bad because
> it would be a perfect application of mapreduce, and the only other
> text search option I've heard of is couchdb-lucene which doesn't
> sound nearly as fun/elegant.
>
> Is there another way to approach this?
> Should I just not reduce and end up with one row per word-occurrence?

CouchDB Map/Reduce isn't like Google Map/Reduce. Its much more like
the old school map/reduce pattern that expects to be calculating a
single reduction value. The CouchDB internals make doing things like
inverted indices hard. The 'proper' way would be to do as you say and
return a single row per key with only some intermediary values handled
by reductions.

Also, while couchdb-lucene may not present near as much fun, its got
quite a bit to it. Full-Text indexing is hard. Many examples show it
as nothing more than an inverted index, but that's hiding 95% of the
knowledge on information retrieval and scoring algorithms that are in
Lucene. And there's the integration with Tika to do things like
attachment indexing. I quite dislike Java but I've come to accept that
there really isn't much competition that's compatible with CouchDB's
document model.

HTH,
Paul Davis

> Thanks for any help,
> and sorry if this has been covered before, I did try to search around first.
> --
> Harold
>

Re: two view questions: group=true, inverted indices

Posted by Harold Cooper <hr...@gmail.com>.

Haha, thanks for the info. I'm sure couchdb-lucene is the best way to go for
full text search; I should've simply said that I think mapreduce can be
"fun" and "elegant" when it fits really well, but I look forward to trying
out couchdb-lucene and I expect I'll enjoy using it as well.

As for question 1, I think Paul's answer is what I was looking for, so now I
understand where those calls were coming from.

Thanks for the quick and helpful replies!
--
H


On Sun, Feb 7, 2010 at 6:30 PM, Robert Newson <ro...@gmail.com>wrote:

> 1) it's reduce(key, values, rereduce). The method should be called
> with 1 or more values for the same key, which you can then reduce to a
> summary value. It's called 'reduce' because the result must be smaller
> than the input. Building a result as large as the input (in fact, as
> large as the sum of the inputs) isn't really what map/reduce is for.
>
> 2) In your example, just remove the reduce method altogether for a
> simplistic "lookup by work" index. If you query it with ?key=<word>
> then you'll get a lot of rows back, one per document with that work in
> it.
>
> I should defend couchdb-lucene a little on principle and just say that
> it's fun, perhaps inelegant, but actually quite fast and a more
> appropriate means to do full-text search than a couchdb view (which is
> why I wrote it).
>
> B.
>
> On Sun, Feb 7, 2010 at 11:15 PM, Harold Cooper <ha...@mit.edu> wrote:
> > Hi there,
> >
> > I'm new to CouchDB and have two questions about the use of mapreduce
> > in views.
> >
> > 1.
> > As far as I can tell, even when I pass group=true to a view,
> > reduce(keys, values) is still passed different keys,
> > e.g. keys = [["a", "551a50e574ccd439af28428db2401ab4"],
> > ["b", "94d13f9e969786c6d653555a7e94f61e"]].
> >
> > Isn't the whole point of group=true that this shouldn't happen?
> >
> >
> > 2.
> > When I've read about mapreduce before, a classic example use is
> > constructing an inverted index. But if I make a view like:
> > {
> > map: "function(doc) {
> >  var words = doc.text.split(' ');
> >  for (var i in words) {
> >    emit(words[i], [doc._id]);
> >  }
> > }",
> > reduce: "function(keys, values) {
> >  // concatenate the lists of docIds together:
> >  return Array.prototype.concat.apply([], values);
> > }"
> > }
> > then couchdb complains that the reduce result is growing too fast.
> >
> > I did read that this is the way things are, but it's too bad because
> > it would be a perfect application of mapreduce, and the only other
> > text search option I've heard of is couchdb-lucene which doesn't
> > sound nearly as fun/elegant.
> >
> > Is there another way to approach this?
> > Should I just not reduce and end up with one row per word-occurrence?
> >
> > Thanks for any help,
> > and sorry if this has been covered before, I did try to search around
> first.
> > --
> > Harold
> >
>

Re: two view questions: group=true, inverted indices

Posted by Robert Newson <ro...@gmail.com>.

1) it's reduce(key, values, rereduce). The method should be called
with 1 or more values for the same key, which you can then reduce to a
summary value. It's called 'reduce' because the result must be smaller
than the input. Building a result as large as the input (in fact, as
large as the sum of the inputs) isn't really what map/reduce is for.

2) In your example, just remove the reduce method altogether for a
simplistic "lookup by work" index. If you query it with ?key=<word>
then you'll get a lot of rows back, one per document with that work in
it.

I should defend couchdb-lucene a little on principle and just say that
it's fun, perhaps inelegant, but actually quite fast and a more
appropriate means to do full-text search than a couchdb view (which is
why I wrote it).

B.

On Sun, Feb 7, 2010 at 11:15 PM, Harold Cooper <ha...@mit.edu> wrote:
> Hi there,
>
> I'm new to CouchDB and have two questions about the use of mapreduce
> in views.
>
> 1.
> As far as I can tell, even when I pass group=true to a view,
> reduce(keys, values) is still passed different keys,
> e.g. keys = [["a", "551a50e574ccd439af28428db2401ab4"],
> ["b", "94d13f9e969786c6d653555a7e94f61e"]].
>
> Isn't the whole point of group=true that this shouldn't happen?
>
>
> 2.
> When I've read about mapreduce before, a classic example use is
> constructing an inverted index. But if I make a view like:
> {
> map: "function(doc) {
>  var words = doc.text.split(' ');
>  for (var i in words) {
>    emit(words[i], [doc._id]);
>  }
> }",
> reduce: "function(keys, values) {
>  // concatenate the lists of docIds together:
>  return Array.prototype.concat.apply([], values);
> }"
> }
> then couchdb complains that the reduce result is growing too fast.
>
> I did read that this is the way things are, but it's too bad because
> it would be a perfect application of mapreduce, and the only other
> text search option I've heard of is couchdb-lucene which doesn't
> sound nearly as fun/elegant.
>
> Is there another way to approach this?
> Should I just not reduce and end up with one row per word-occurrence?
>
> Thanks for any help,
> and sorry if this has been covered before, I did try to search around first.
> --
> Harold
>