You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by He Shiming <he...@gmail.com> on 2011/05/22 08:59:37 UTC

Mapping multiple entries in an array field? (like tags)

Dear Community,

I'm working with documents like this:
{
    "tags": ["python", "couchdb", "web"]
}

I'd like to be able to query this document, by all of the following keys:
"python"
["python", "web"]
["couchdb", "python"]
["web", "python", "couchdb"]

but not by keys such as:
"redis"
["python", "redis"]

It's a tag-like situation. I can have the parameter and the field
sorted in alphabetical order if necessary. But it should support any
number of "tags" as the key.

The only solution I figured, is to compose a list of possible key
combinations in the map function, then emit all of them. And this way,
one view will only be able to handle a fixed number of keys. For that
matter, the function gets really complicated for the 3-key (or above)
query. So I'm wondering if there are better ways to do this.

Thanks!
He Shiming

Re: Mapping multiple entries in an array field? (like tags)

Posted by He Shiming <he...@gmail.com>.
Hi Alex, thanks for the quick response. But with only a start/endkey
pair, how am I supposed to match a tag that is not at the front of an
array?

Consider these two documents:
doc1: ["akhet", "couchdb", "python", "redis", "riak"]
doc2: ["couchdb", "redis"]

They are sorted, and they can be emitted. However, specifying
startkey=["couchdb"]&endkey=["couchdb", {}] will only match doc2 (or
did I understand it wrong?). And it doesn't look like there's an easy
way to match both docs with keys like ["couchdb", "redis"].

What do you mean by another key sequence? What will it look like?

On Sun, May 22, 2011 at 4:28 PM, Alexander Shorin <kx...@gmail.com> wrote:
> Hi!
>
> I think that better would be to return sorted list of lowercase tags
> as key and access to them via startkey and endkey params.
> ?startkey=["couchdb"]&endkey=["couchdb", null]
> So you could access to all of them, but only while you're keeping
> their order in key. For other combinations (you'd like to start with
> python tag, not couchdb) you have to generate another key sequence.
>
> This would be more optimal solution instead of generation of all
> available combinations of tags: for 4 tags you would have 24 emitted
> records, for 5 - 120, for 6 - 720 and more, and this is without case
> problem.
>
> ------------------
> ,,,^..^,,,
>
>



-- 
Best regards,
He Shiming

Re: Mapping multiple entries in an array field? (like tags)

Posted by Alexander Shorin <kx...@gmail.com>.
Hi!

I think that better would be to return sorted list of lowercase tags
as key and access to them via startkey and endkey params.
?startkey=["couchdb"]&endkey=["couchdb", null]
So you could access to all of them, but only while you're keeping
their order in key. For other combinations (you'd like to start with
python tag, not couchdb) you have to generate another key sequence.

This would be more optimal solution instead of generation of all
available combinations of tags: for 4 tags you would have 24 emitted
records, for 5 - 120, for 6 - 720 and more, and this is without case
problem.

------------------
,,,^..^,,,



On Sun, May 22, 2011 at 10:59 AM, He Shiming <he...@gmail.com> wrote:
> Dear Community,
>
> I'm working with documents like this:
> {
>    "tags": ["python", "couchdb", "web"]
> }
>
> I'd like to be able to query this document, by all of the following keys:
> "python"
> ["python", "web"]
> ["couchdb", "python"]
> ["web", "python", "couchdb"]
>
> but not by keys such as:
> "redis"
> ["python", "redis"]
>
> It's a tag-like situation. I can have the parameter and the field
> sorted in alphabetical order if necessary. But it should support any
> number of "tags" as the key.
>
> The only solution I figured, is to compose a list of possible key
> combinations in the map function, then emit all of them. And this way,
> one view will only be able to handle a fixed number of keys. For that
> matter, the function gets really complicated for the 3-key (or above)
> query. So I'm wondering if there are better ways to do this.
>
> Thanks!
> He Shiming
>

Re: Mapping multiple entries in an array field? (like tags)

Posted by He Shiming <he...@gmail.com>.
10 choose 2 is actually 45. But I mis-calculated 10 choose 3, which
should be 120. The peak number for a 10-tag doc is choosing 5, which
comes to 252. And this does look scary.

Thanks for the help. I came to understand that the view of couchdb,
has to index all these combinations in a B+ tree, if they are all
emitted. I'll have to remember that.

On Tue, May 24, 2011 at 8:08 AM, Patrick Barnes <mr...@gmail.com> wrote:
> On 23/05/2011 6:08 PM, He Shiming wrote:
>>
>
> How do you get 45? '10 choose 3' is 120.
>
> Also, you don't expect to have to match a smaller number of tags? If you
> only emit 3-tag combinations, how would you match a 2-tag intersection? With
> use of startkey and endkey, you could slightly reduce the required records -
> but this would only work for the tags that always get sorted to the front.
> (eg intersection of ['a' and 'b'], not ['b' and 'f'])
>
> I'm just saying that if you have millions of documents, you'll have a view
> with more than a hundred million rows in it - so it will take a long time to
> generate, and take up a lot more disk space than the equivalent lucene view.
> At an estimate, I'd say generation speed would be much slower than lucene.
> Retrieval speed might be a little bit faster, but queries like intersections
> are something lucene is built for.
>
> -Patrick
>

-- 
Best regards,
He Shiming

Re: Mapping multiple entries in an array field? (like tags)

Posted by Patrick Barnes <mr...@gmail.com>.
On 23/05/2011 6:08 PM, He Shiming wrote:
> @Patrick. The number of combination isn't that scary, given tags are sorted.
>
> For a 2-tag query, 2-tag documents will have 1 combination only, and 3
> for 3-tag docs. 45 combinations for 10-tag docs.

How do you get 45? '10 choose 3' is 120.

Also, you don't expect to have to match a smaller number of tags? If you 
only emit 3-tag combinations, how would you match a 2-tag intersection? 
With use of startkey and endkey, you could slightly reduce the required 
records - but this would only work for the tags that always get sorted 
to the front. (eg intersection of ['a' and 'b'], not ['b' and 'f'])

> However, based on what you've said. I'm under the impression that
> calling ``emit`` this many times per doc is a bad idea. I'm not
> familiar with the underlying mechanisms of lucene. But are you saying
> emitting just several dozen times per doc will definitely be much
> slower than lucene?

I'm just saying that if you have millions of documents, you'll have a 
view with more than a hundred million rows in it - so it will take a 
long time to generate, and take up a lot more disk space than the 
equivalent lucene view.
At an estimate, I'd say generation speed would be much slower than 
lucene. Retrieval speed might be a little bit faster, but queries like 
intersections are something lucene is built for.

-Patrick

Re: Mapping multiple entries in an array field? (like tags)

Posted by He Shiming <he...@gmail.com>.
I'm experimenting with lucene, and it appears rather easy to do.
@Mark, actually it's not a bad idea to split hot and cold tags. I'll
try this approach too.

@Patrick. The number of combination isn't that scary, given tags are sorted.

For a 2-tag query, 2-tag documents will have 1 combination only, and 3
for 3-tag docs. 45 combinations for 10-tag docs.

For a 3-tag query, 10-tag docs will have 36 combinations.

Compared to just having to emit at least 10 docs for a single-tag
query on 10-tag docs, this is not so bad. Theoretically, no matter
what I do, more computing time has to be spent on those docs with more
tags.

However, based on what you've said. I'm under the impression that
calling ``emit`` this many times per doc is a bad idea. I'm not
familiar with the underlying mechanisms of lucene. But are you saying
emitting just several dozen times per doc will definitely be much
slower than lucene?

On Mon, May 23, 2011 at 12:07 PM, Mark Hahn <ma...@boutiquing.com> wrote:
> Use your original method but only with the "hot tags".  Then use my method
> for the "cold tags".  Then get the intersection.  I would think you'd be
> able to reduce the number of ids to load to log n.
>
>
>
>
> --
> Mark Hahn
> Website Manager
> mark@boutiquing.com
> 949-229-1012
>



-- 
Best regards,
He Shiming

Re: Mapping multiple entries in an array field? (like tags)

Posted by Mark Hahn <ma...@boutiquing.com>.
Use your original method but only with the "hot tags".  Then use my method
for the "cold tags".  Then get the intersection.  I would think you'd be
able to reduce the number of ids to load to log n.

On Sun, May 22, 2011 at 6:57 PM, He Shiming <he...@gmail.com> wrote:

> @Mark, hmm... eventually, I'm expecting the number of docs to be in
> the millions. Most of them will be tagged, and the number of tags will
> be in the thousands. Many "hot" tags will return a lot of documents.
> Unlike the single tag situation, which I can put a limit on. Finding
> intersections requires the full list.
>
> I'm trying to use a schedule to pre-fetch content for all the tags.
> But even under this circumstance, I'm looking for a lightweight way.
>
> So far I came to think that my original idea wasn't so bad. Because
> each document will only have like 5 or 10 tags max. The number of
> possible combinations isn't that huge. I think this way, there will be
> less calculation than fetching the full list several times at client
> side. The plus side would be, as long as it's still map/reduce, I can
> use bigcouch to scale the calculation easily.
>
> On Mon, May 23, 2011 at 9:17 AM, Mark Hahn <ma...@boutiquing.com> wrote:
> >
> > I'm using the method of emitting a result for each tag in the document
> and
> > I'm not seeing any huge client calculation.  I just get the list of doc
> ids
> > for each tag requested and do an intersection of the results.  Not a big
> > deal.  It isn't as if I have to load all the docs.
> >
> >
> >
> > --
> > Mark Hahn
> > Website Manager
> > mark@boutiquing.com
> > 949-229-1012
> >
>
>
>
> --
> Best regards,
> He Shiming
>



-- 
Mark Hahn
Website Manager
mark@boutiquing.com
949-229-1012

Re: Mapping multiple entries in an array field? (like tags)

Posted by Patrick Barnes <mr...@gmail.com>.
With your original idea; to emit for each combination of tags... you may 
still end up with an unworkable number of records in your view.

Say for each document you emit combinations of tags, in alphabetical 
order. Limiting the maximum number of matched tags to 3, for example, 
will help limit the size of the view.

5-10 tags being your typical examples, you would get:
(5 choose 1)+(5 choose 2)+(5 choose 3) = 25
to
(10 choose 1)+(10 choose 2)+(10 choose 3) = 175
view records PER document.

And what exactly do you need to reduce? Just tag and tag intersection 
counts?

---------------------------------------------------------------------

It might be better to use couchdb-lucene; you can have a view that just has:

function(doc) {
     var ret=new Document();
     for (var i in doc.tags)
        ret.add(doc.tags[i], {"index":"not_analyzed"});
     return ret;
}

Then a query like:
http://localhost:5984/db/_fti/_design/db/tags?q=tag1&include_docs=true
   gets you all the docs with tag1,

For intersections:
http://localhost:5984/db/_fti/_design/db/tags?q=tag1%20AND%20tag2&include_docs=true
   gets you all the docs with both tag1 and tag2 (AND, not and)

And, to get the number of docs that *would* be returned from any query 
without having to count them:
http://localhost:5984/db/_fti/_design/db/tags?q=tag1&limit=1
and read the 'total_rows' element out of the response.

You could even do more complex queries too, if necessary; "tag1 AND 
(tag2 OR tag3)"

The only thing I can see that you might miss from this approach is being 
able to get the list of available tags - you could combine the above 
with a view that emits single tags, has '_count' as the reduce and use 
reduce=true&group=true to fetch that primary list.

-Patrick

On 23/05/2011 11:57 AM, He Shiming wrote:
> @Mark, hmm... eventually, I'm expecting the number of docs to be in
> the millions. Most of them will be tagged, and the number of tags will
> be in the thousands. Many "hot" tags will return a lot of documents.
> Unlike the single tag situation, which I can put a limit on. Finding
> intersections requires the full list.
>
> I'm trying to use a schedule to pre-fetch content for all the tags.
> But even under this circumstance, I'm looking for a lightweight way.
>
> So far I came to think that my original idea wasn't so bad. Because
> each document will only have like 5 or 10 tags max. The number of
> possible combinations isn't that huge. I think this way, there will be
> less calculation than fetching the full list several times at client
> side. The plus side would be, as long as it's still map/reduce, I can
> use bigcouch to scale the calculation easily.
>
> On Mon, May 23, 2011 at 9:17 AM, Mark Hahn<ma...@boutiquing.com>  wrote:
>>
>> I'm using the method of emitting a result for each tag in the document and
>> I'm not seeing any huge client calculation.  I just get the list of doc ids
>> for each tag requested and do an intersection of the results.  Not a big
>> deal.  It isn't as if I have to load all the docs.
>>
>>
>>
>> --
>> Mark Hahn
>> Website Manager
>> mark@boutiquing.com
>> 949-229-1012
>>
>
>
>

Re: Mapping multiple entries in an array field? (like tags)

Posted by He Shiming <he...@gmail.com>.
@Mark, hmm... eventually, I'm expecting the number of docs to be in
the millions. Most of them will be tagged, and the number of tags will
be in the thousands. Many "hot" tags will return a lot of documents.
Unlike the single tag situation, which I can put a limit on. Finding
intersections requires the full list.

I'm trying to use a schedule to pre-fetch content for all the tags.
But even under this circumstance, I'm looking for a lightweight way.

So far I came to think that my original idea wasn't so bad. Because
each document will only have like 5 or 10 tags max. The number of
possible combinations isn't that huge. I think this way, there will be
less calculation than fetching the full list several times at client
side. The plus side would be, as long as it's still map/reduce, I can
use bigcouch to scale the calculation easily.

On Mon, May 23, 2011 at 9:17 AM, Mark Hahn <ma...@boutiquing.com> wrote:
>
> I'm using the method of emitting a result for each tag in the document and
> I'm not seeing any huge client calculation.  I just get the list of doc ids
> for each tag requested and do an intersection of the results.  Not a big
> deal.  It isn't as if I have to load all the docs.
>
>
>
> --
> Mark Hahn
> Website Manager
> mark@boutiquing.com
> 949-229-1012
>



-- 
Best regards,
He Shiming

Re: Mapping multiple entries in an array field? (like tags)

Posted by Mark Hahn <ma...@boutiquing.com>.
> However, emitting tags separately gives me separate results with each tag
as key. That way client side calculation might be huge.

I'm using the method of emitting a result for each tag in the document and
I'm not seeing any huge client calculation.  I just get the list of doc ids
for each tag requested and do an intersection of the results.  Not a big
deal.  It isn't as if I have to load all the docs.


On Sun, May 22, 2011 at 3:47 AM, He Shiming <he...@gmail.com> wrote:

> @Patrick, thanks. However, emitting tags separately gives me separate
> results with each tag as key. That way client side calculation might
> be huge. I'll have to load each of the documents to get meaning
> results. I might also want to run reduce on this query, and the
> grouping will be done for each key. It's not really what I'm
> expecting.
>
> @Nils, thanks. I'm using lucene anyway, so I'll give it a try. I'm
> just hoping that this can be done with map/reduce, I didn't expect it
> to be difficult.
>
> On Sun, May 22, 2011 at 5:15 PM, Nils Breunese <N....@vpro.nl> wrote:
> > You could get the documents for the individual tags and perform the logic
> on the client, but when that's not feasible (and it soon stops being
> feasible) you'll want to take a look at couchdb-lucene which allows you to
> do these kinds of OR queries: https://github.com/rnewson/couchdb-lucene
> >
> > Nils.
> > ________________________________________
> > Van: He Shiming [heshiming@gmail.com]
> > Verzonden: zondag 22 mei 2011 8:59
> > Aan: user@couchdb.apache.org
> > Onderwerp: Mapping multiple entries in an array field? (like tags)
> >
> > Dear Community,
> >
> > I'm working with documents like this:
> > {
> >    "tags": ["python", "couchdb", "web"]
> > }
> >
> > I'd like to be able to query this document, by all of the following keys:
> > "python"
> > ["python", "web"]
> > ["couchdb", "python"]
> > ["web", "python", "couchdb"]
> >
> > but not by keys such as:
> > "redis"
> > ["python", "redis"]
> >
> > It's a tag-like situation. I can have the parameter and the field
> > sorted in alphabetical order if necessary. But it should support any
> > number of "tags" as the key.
> >
> > The only solution I figured, is to compose a list of possible key
> > combinations in the map function, then emit all of them. And this way,
> > one view will only be able to handle a fixed number of keys. For that
> > matter, the function gets really complicated for the 3-key (or above)
> > query. So I'm wondering if there are better ways to do this.
> >
> > Thanks!
> > He Shiming
> > ------------------------------------------------------------------------
> >  VPRO   www.vpro.nl
> > ------------------------------------------------------------------------
> >
>
>
>
> --
> Best regards,
> He Shiming
>



-- 
Mark Hahn
Website Manager
mark@boutiquing.com
949-229-1012

Re: Mapping multiple entries in an array field? (like tags)

Posted by He Shiming <he...@gmail.com>.
@Patrick, thanks. However, emitting tags separately gives me separate
results with each tag as key. That way client side calculation might
be huge. I'll have to load each of the documents to get meaning
results. I might also want to run reduce on this query, and the
grouping will be done for each key. It's not really what I'm
expecting.

@Nils, thanks. I'm using lucene anyway, so I'll give it a try. I'm
just hoping that this can be done with map/reduce, I didn't expect it
to be difficult.

On Sun, May 22, 2011 at 5:15 PM, Nils Breunese <N....@vpro.nl> wrote:
> You could get the documents for the individual tags and perform the logic on the client, but when that's not feasible (and it soon stops being feasible) you'll want to take a look at couchdb-lucene which allows you to do these kinds of OR queries: https://github.com/rnewson/couchdb-lucene
>
> Nils.
> ________________________________________
> Van: He Shiming [heshiming@gmail.com]
> Verzonden: zondag 22 mei 2011 8:59
> Aan: user@couchdb.apache.org
> Onderwerp: Mapping multiple entries in an array field? (like tags)
>
> Dear Community,
>
> I'm working with documents like this:
> {
>    "tags": ["python", "couchdb", "web"]
> }
>
> I'd like to be able to query this document, by all of the following keys:
> "python"
> ["python", "web"]
> ["couchdb", "python"]
> ["web", "python", "couchdb"]
>
> but not by keys such as:
> "redis"
> ["python", "redis"]
>
> It's a tag-like situation. I can have the parameter and the field
> sorted in alphabetical order if necessary. But it should support any
> number of "tags" as the key.
>
> The only solution I figured, is to compose a list of possible key
> combinations in the map function, then emit all of them. And this way,
> one view will only be able to handle a fixed number of keys. For that
> matter, the function gets really complicated for the 3-key (or above)
> query. So I'm wondering if there are better ways to do this.
>
> Thanks!
> He Shiming
> ------------------------------------------------------------------------
>  VPRO   www.vpro.nl
> ------------------------------------------------------------------------
>



-- 
Best regards,
He Shiming

RE: Mapping multiple entries in an array field? (like tags)

Posted by Nils Breunese <N....@vpro.nl>.
You could get the documents for the individual tags and perform the logic on the client, but when that's not feasible (and it soon stops being feasible) you'll want to take a look at couchdb-lucene which allows you to do these kinds of OR queries: https://github.com/rnewson/couchdb-lucene

Nils.
________________________________________
Van: He Shiming [heshiming@gmail.com]
Verzonden: zondag 22 mei 2011 8:59
Aan: user@couchdb.apache.org
Onderwerp: Mapping multiple entries in an array field? (like tags)

Dear Community,

I'm working with documents like this:
{
    "tags": ["python", "couchdb", "web"]
}

I'd like to be able to query this document, by all of the following keys:
"python"
["python", "web"]
["couchdb", "python"]
["web", "python", "couchdb"]

but not by keys such as:
"redis"
["python", "redis"]

It's a tag-like situation. I can have the parameter and the field
sorted in alphabetical order if necessary. But it should support any
number of "tags" as the key.

The only solution I figured, is to compose a list of possible key
combinations in the map function, then emit all of them. And this way,
one view will only be able to handle a fixed number of keys. For that
matter, the function gets really complicated for the 3-key (or above)
query. So I'm wondering if there are better ways to do this.

Thanks!
He Shiming
------------------------------------------------------------------------
 VPRO   www.vpro.nl
------------------------------------------------------------------------

Re: Mapping multiple entries in an array field? (like tags)

Posted by Patrick Barnes <mr...@gmail.com>.
Hi He,

The view query api supports a 'keys' parameter.
http://wiki.apache.org/couchdb/HTTP_view_API#Querying_Options

So your view can emit a parameter for every tag.

function(doc) {
   for (var i in doc.tags)
     emit(doc.tags[i],null);
}

(so the view results would look like:
{key:"python",id:"doc1",value:null},
{key:"couchdb",id:"doc1",value:null},
{key:"web",id:"doc1,value:null},
...

And to fetch results for a given set of tags (say 'couchdb' and 
'python') you can call:

curl -X POST http://localhost:5984/db/ \
-H "Content-Type: application/json" -d '{"keys":["couchdb","python"]}'

Hope that helps,
-Patrick Barnes

On 22/05/2011 4:59 PM, He Shiming wrote:
> Dear Community,
>
> I'm working with documents like this:
> {
>      "tags": ["python", "couchdb", "web"]
> }
>
> I'd like to be able to query this document, by all of the following keys:
> "python"
> ["python", "web"]
> ["couchdb", "python"]
> ["web", "python", "couchdb"]
>
> but not by keys such as:
> "redis"
> ["python", "redis"]
>
> It's a tag-like situation. I can have the parameter and the field
> sorted in alphabetical order if necessary. But it should support any
> number of "tags" as the key.
>
> The only solution I figured, is to compose a list of possible key
> combinations in the map function, then emit all of them. And this way,
> one view will only be able to handle a fixed number of keys. For that
> matter, the function gets really complicated for the 3-key (or above)
> query. So I'm wondering if there are better ways to do this.
>
> Thanks!
> He Shiming
>