You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Rob Crowell <ro...@gmail.com> on 2011/11/30 21:10:52 UTC

Building views to locate documents WITHOUT a certain set of tags

Hey everyone, view question here.

I've got couch records that represent images.  They may have any
number of tags (from zero to hundreds).  However, while there are
thousands of tags in the dataset, there are only a couple that are
considered "bad" (BROKEN_IMAGE, BLANK_IMAGE, etc.)  Here's an example
document:

{
    _id: ...,
    url: "http://example.org/whatever.png",
    tags: ["OUTDOORS", "BEACH", "RED_DRESS"]
}

I wrote a view to emit documents that don't have these "bad" tags by
hard-coding the list of bad tags and checking every tag against this
list.  If none of the tags are bad, then emit the document.

However, a user may also specify tags that he doesn't like
(OFFENSIVE_IMAGE, DENVER_BRONCOS, whatever).  Is there any good way to
build a view around this idea ("show me all documents that don't have
a set of tags") short of defining a custom view (with their own "bad"
tags list) for every user?

I could do this filtering client-side of course, but if I wanted to
generate an exhaustive list of matching documents (for a report or
something similar) then it would be a lot of work.  I'm stumped at the
moment.  Thanks for any suggestions!

Re: Building views to locate documents WITHOUT a certain set of tags

Posted by Rob Crowell <ro...@gmail.com>.

On Wed, Nov 30, 2011 at 3:56 PM, Dave Cottlehuber <da...@muse.net.nz> wrote:
> On 30 November 2011 21:49, Rob Crowell <ro...@gmail.com> wrote:
>> I suppose it would be possible to make multiple queries, using
>> startkey and endkey to pull out the ranges.
>>
>> 1. Sort the "bad" tags: (BROKEN_IMAGE, OFFENSIVE_IMAGE)
>> 2. For each bad tag, request documents:
>>    i. Query 1:
>>        startkey = []
>>        endkey = ["BROKEN_IMAGE"]
>>
>>    ii. Query 2:
>>        startkey = ["BROKEN_IMAGE", {}]
>>        endkey = ["OFFENSIVE_IMAGE"]
>>
>>    iii. Query 3:
>>        startkey = ["OFFENSIVE_IMAGE", {}]
>>        endkey = [{}]
>>
>> Requires making N+1 queries, which for a fairly small list wouldn't be too bad.
>>
>> On Wed, Nov 30, 2011 at 3:10 PM, Rob Crowell <ro...@gmail.com> wrote:
>>> Hey everyone, view question here.
>>>
>>> I've got couch records that represent images.  They may have any
>>> number of tags (from zero to hundreds).  However, while there are
>>> thousands of tags in the dataset, there are only a couple that are
>>> considered "bad" (BROKEN_IMAGE, BLANK_IMAGE, etc.)  Here's an example
>>> document:
>>>
>>> {
>>>    _id: ...,
>>>    url: "http://example.org/whatever.png",
>>>    tags: ["OUTDOORS", "BEACH", "RED_DRESS"]
>>> }
>>>
>>> I wrote a view to emit documents that don't have these "bad" tags by
>>> hard-coding the list of bad tags and checking every tag against this
>>> list.  If none of the tags are bad, then emit the document.
>>>
>>> However, a user may also specify tags that he doesn't like
>>> (OFFENSIVE_IMAGE, DENVER_BRONCOS, whatever).  Is there any good way to
>>> build a view around this idea ("show me all documents that don't have
>>> a set of tags") short of defining a custom view (with their own "bad"
>>> tags list) for every user?
>>>
>>> I could do this filtering client-side of course, but if I wanted to
>>> generate an exhaustive list of matching documents (for a report or
>>> something similar) then it would be a lot of work.  I'm stumped at the
>>> moment.  Thanks for any suggestions!
>>>
>
> foo AND bar
> NOT baz
> CONTAINS beer
>
> Classic use cases for couchdb-lucence or elasticsearch.
>
> A+
> Dave
>

Thanks, I'll look into those.

I think the method I outlined in my second message doesn't work
anyways, since documents can have multiple tags (d'oh!).  I'd need to
get all documents, and then get the list of documents that have any of
the invalid tags (using multiple queries similar to my incorrect
solution earlier), and then write some code to remove the documents
with at least one of the bad tags from the overall list.  Yuck.

Re: Building views to locate documents WITHOUT a certain set of tags

Posted by Dave Cottlehuber <da...@muse.net.nz>.

On 30 November 2011 21:49, Rob Crowell <ro...@gmail.com> wrote:
> I suppose it would be possible to make multiple queries, using
> startkey and endkey to pull out the ranges.
>
> 1. Sort the "bad" tags: (BROKEN_IMAGE, OFFENSIVE_IMAGE)
> 2. For each bad tag, request documents:
>    i. Query 1:
>        startkey = []
>        endkey = ["BROKEN_IMAGE"]
>
>    ii. Query 2:
>        startkey = ["BROKEN_IMAGE", {}]
>        endkey = ["OFFENSIVE_IMAGE"]
>
>    iii. Query 3:
>        startkey = ["OFFENSIVE_IMAGE", {}]
>        endkey = [{}]
>
> Requires making N+1 queries, which for a fairly small list wouldn't be too bad.
>
> On Wed, Nov 30, 2011 at 3:10 PM, Rob Crowell <ro...@gmail.com> wrote:
>> Hey everyone, view question here.
>>
>> I've got couch records that represent images.  They may have any
>> number of tags (from zero to hundreds).  However, while there are
>> thousands of tags in the dataset, there are only a couple that are
>> considered "bad" (BROKEN_IMAGE, BLANK_IMAGE, etc.)  Here's an example
>> document:
>>
>> {
>>    _id: ...,
>>    url: "http://example.org/whatever.png",
>>    tags: ["OUTDOORS", "BEACH", "RED_DRESS"]
>> }
>>
>> I wrote a view to emit documents that don't have these "bad" tags by
>> hard-coding the list of bad tags and checking every tag against this
>> list.  If none of the tags are bad, then emit the document.
>>
>> However, a user may also specify tags that he doesn't like
>> (OFFENSIVE_IMAGE, DENVER_BRONCOS, whatever).  Is there any good way to
>> build a view around this idea ("show me all documents that don't have
>> a set of tags") short of defining a custom view (with their own "bad"
>> tags list) for every user?
>>
>> I could do this filtering client-side of course, but if I wanted to
>> generate an exhaustive list of matching documents (for a report or
>> something similar) then it would be a lot of work.  I'm stumped at the
>> moment.  Thanks for any suggestions!
>>

foo AND bar
NOT baz
CONTAINS beer

Classic use cases for couchdb-lucence or elasticsearch.

A+
Dave

Re: Building views to locate documents WITHOUT a certain set of tags

Posted by Jason Smith <jh...@iriscouch.com>.

On Thu, Dec 1, 2011 at 3:49 AM, Rob Crowell <ro...@gmail.com> wrote:
> I suppose it would be possible to make multiple queries, using
> startkey and endkey to pull out the ranges.
>
> 1. Sort the "bad" tags: (BROKEN_IMAGE, OFFENSIVE_IMAGE)
> 2. For each bad tag, request documents:
>    i. Query 1:
>        startkey = []
>        endkey = ["BROKEN_IMAGE"]
>
>    ii. Query 2:
>        startkey = ["BROKEN_IMAGE", {}]
>        endkey = ["OFFENSIVE_IMAGE"]
>
>    iii. Query 3:
>        startkey = ["OFFENSIVE_IMAGE", {}]
>        endkey = [{}]
>
> Requires making N+1 queries, which for a fairly small list wouldn't be too bad.

If you have a view of docs matching a condition, you can find docs
*not* matching that condition efficiently: make simultaneous queries
to _all_docs and your view. Both will be sorted by doc id. Iterate
through both at the same time (no need to storing them in memory),
spotting ids listed in _all_docs but not your view.

I wrote this up here: http://stackoverflow.com/a/6210422/2938

Notes:

* If rows have identical keys, CouchDB sorts them by doc id. You can
emit any value for the rows; what's important here is row.id
* You can generalize the technique to perform multiple "NOT" queries
simultaneously.
* This is a situation where concurrent or event-driven languages like
Javascript or Erlang shine
* I'm pretty sure that in practice, the "NOT" queries add zero cost to
the query. It always takes the same time to complete: the time to
fetch _all_docs.

I do not know if this technique has a name. If it doesn't, may I
propose: "The Thai Massage."

>
> On Wed, Nov 30, 2011 at 3:10 PM, Rob Crowell <ro...@gmail.com> wrote:
>> Hey everyone, view question here.
>>
>> I've got couch records that represent images.  They may have any
>> number of tags (from zero to hundreds).  However, while there are
>> thousands of tags in the dataset, there are only a couple that are
>> considered "bad" (BROKEN_IMAGE, BLANK_IMAGE, etc.)  Here's an example
>> document:
>>
>> {
>>    _id: ...,
>>    url: "http://example.org/whatever.png",
>>    tags: ["OUTDOORS", "BEACH", "RED_DRESS"]
>> }
>>
>> I wrote a view to emit documents that don't have these "bad" tags by
>> hard-coding the list of bad tags and checking every tag against this
>> list.  If none of the tags are bad, then emit the document.
>>
>> However, a user may also specify tags that he doesn't like
>> (OFFENSIVE_IMAGE, DENVER_BRONCOS, whatever).  Is there any good way to
>> build a view around this idea ("show me all documents that don't have
>> a set of tags") short of defining a custom view (with their own "bad"
>> tags list) for every user?
>>
>> I could do this filtering client-side of course, but if I wanted to
>> generate an exhaustive list of matching documents (for a report or
>> something similar) then it would be a lot of work.  I'm stumped at the
>> moment.  Thanks for any suggestions!
>>
>



-- 
Iris Couch

Re: Building views to locate documents WITHOUT a certain set of tags

Posted by Rob Crowell <ro...@gmail.com>.

I suppose it would be possible to make multiple queries, using
startkey and endkey to pull out the ranges.

1. Sort the "bad" tags: (BROKEN_IMAGE, OFFENSIVE_IMAGE)
2. For each bad tag, request documents:
    i. Query 1:
        startkey = []
        endkey = ["BROKEN_IMAGE"]

    ii. Query 2:
        startkey = ["BROKEN_IMAGE", {}]
        endkey = ["OFFENSIVE_IMAGE"]

    iii. Query 3:
        startkey = ["OFFENSIVE_IMAGE", {}]
        endkey = [{}]

Requires making N+1 queries, which for a fairly small list wouldn't be too bad.

On Wed, Nov 30, 2011 at 3:10 PM, Rob Crowell <ro...@gmail.com> wrote:
> Hey everyone, view question here.
>
> I've got couch records that represent images.  They may have any
> number of tags (from zero to hundreds).  However, while there are
> thousands of tags in the dataset, there are only a couple that are
> considered "bad" (BROKEN_IMAGE, BLANK_IMAGE, etc.)  Here's an example
> document:
>
> {
>    _id: ...,
>    url: "http://example.org/whatever.png",
>    tags: ["OUTDOORS", "BEACH", "RED_DRESS"]
> }
>
> I wrote a view to emit documents that don't have these "bad" tags by
> hard-coding the list of bad tags and checking every tag against this
> list.  If none of the tags are bad, then emit the document.
>
> However, a user may also specify tags that he doesn't like
> (OFFENSIVE_IMAGE, DENVER_BRONCOS, whatever).  Is there any good way to
> build a view around this idea ("show me all documents that don't have
> a set of tags") short of defining a custom view (with their own "bad"
> tags list) for every user?
>
> I could do this filtering client-side of course, but if I wanted to
> generate an exhaustive list of matching documents (for a report or
> something similar) then it would be a lot of work.  I'm stumped at the
> moment.  Thanks for any suggestions!
>