You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Rob Crowell <ro...@gmail.com> on 2012/04/13 23:40:48 UTC

Question on CouchDB schema design with an ever-changing tags taxonomy

I have a design question!  Sorry it's a bit long, but basically the
scenario is: what to do when your documents are changing a lot?

As a hypothetical scenario, imagine we are writing an image cataloging
application that's backed by CouchDB.  Basically we're scraping the web,
extracting images, and assigning tags to them.  We want to allow users to
filter our image database by date as well as website, so we need to keep
track of when and where we encounter an image each time we find it.
 CouchDB's btrees are perfect for this!  We just need a couple different
views that we can reduce on to remove duplicates and get aggregate stats:
(tag, date, image-id), (publisher, date, image-id), and (date, image-id)

As a simple example, imagine we found the same photo of George Washington
50,000 times during the past year.  Originally when we processed it, we
made a mistake and incorrectly classified this photo as "woman", so we have
50,000 documents in CouchDB that reference this photo of George with the
tag "woman."  Here's an example of one of our 50,000 scrape documents
containing a reference to George:

    {
        _id: "6f5902ac237024bdd0c176cb93063dc4",
        url: "http://www.example.org/great-photos.html",
        date: [2012, 4, 12, 3, 18, 9],
        images: [
            {
                _id: "bb329147df256345ef7acc34f944187f",
                filename: "http://www.example.com/photo-of-george.gif",
                position: {x: 300, y: 500},
                tags: ["woman"]
            },
            {
                _id: "de5b07c88ccb9ebaf5d5ccb3cd834c76",
                filename: "http://www.example.net/five-cats-in-a-bucket.gif
",
                position: {x: 0, y: 2000},
                tags: ["cat", "bucket"]
            }
        ]
    }

Some time later an admin discovers this photo and decides it should be
tagged "George Washington".  Furthermore, he knows this should be tagged as
"man" not "woman".

We kick off a big batch job to find all scrape documents that contain a
reference to this photo of George Washington, and update them in bulk 1,000
at a time.  While we're updating these documents, we haven't shut down
CouchDB, so things are in an inconsistent state, which should be OK
assuming we can get through the job fairly quickly.  Furthermore, our
tagging script crashes in the middle, which fortunately we caught and
logged (this time...), but now we need to go back and resume tagging from
where we left off which also won't happen instantly.

Now imagine that users are changing our tags all the time, and we have lots
of these requests running in parallel.  Our database grinds to a halt and
we can't even keep up with the updates anymore.  We've got lots of these
big documents that we're paging through basically in random order (the _id
of the scrape request has no locality with the images it contains), which
is keeping our disk nice and busy all day and night.

Instead of writing the tags into each request document, we can instead keep
a mapping of {image-id: [tag, tag, ...]} and {tag: [image-id, image-id,
...]}, but there may be thousands of images tagged with "man", and
performing a map/reduce on each one separately to extract the site and date
could take several minutes even for a single tag when the disk is busy as
ours perpetually is.

Is there a good general approach to get a handle on this problem?