You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Peter Herndon <tp...@gmail.com> on 2008/11/24 18:24:49 UTC

Evaluating CouchDB

Hi all,

I'm in the process of looking at various technologies to implement a
"digital object repository".  The concept, and our current
implementation, come from http://www.fedora-commons.org/.  A digital
object repository is a store for managing an object made up of an XML
file that describes the object's structure and includes object
metadata, plus one or more binary files as various datastreams.  As an
example, take an image object:  the FOXML (Fedora Object XML) file
details the location and kind of the datastreams, includes Dublin Core
and MODS metadata in namespaces, and includes some RDF/XML that
describes the object's relationship to other objects (e.g. isMemberOf
collection).  The datastreams for the image object include a
thumbnail-sized image, a screen-sized image (roughly 300 x 400), and
the original image in its full resolution.  Images are not the only
content type handled by the software, pretty much anything can be
managed by the repository, PDFs, audio, video, XML, text, MS Office
documents, whatever you want.

The repository software provides access control, and provides APIs
(both SOAP and, to a limited extent, REST) to manage objects, their
metadata, and their binary datastreams. The XML is stored locally on
the file system, and the datastreams can be either stored locally, or
referenced by HTTP.  The problem with the software is that it's got a
great architectural vision, but the implementation is of variable
quality.  There are lots of different little pieces, and many of them
are not written with good best practices in mind, or they have zero
exposure to real-world environments and the code reflects that, etc.
Plus, my days of slinging Java and enjoying it are long since past.

Our current implementation consists of a Java front end, plus the
repository on the back end.  We have approximately 40GB of images
stored in the repository at the moment, from our pilot project.  We
have four other departments wanting to use the software, either in a
group repository or in a dedicated repository of their own.  The most
intimidating project is one that currently has 20+ TB of images, and
anticipates creating and ingesting 240+ GB more per day, when in full
swing.  We don't really expect to ingest that much data directly into
the repository, as our network would be a major bottleneck -- the lab
that creates the data is physically located a good distance away from
our data center, and those images are already being transferred once
to a file share at the data center.  If we continue with our current
back-end, we'll likely stick a web server in front of the file share,
and use the HTTP reference, rather than transferring them again to the
repository's storage.

Anyway, that's my current use case, and my next use case.  I know that
CouchDB isn't finished yet, and hasn't been optimized yet, but does
anyone have any opinions on whether CouchDB would be a reasonable fit
for managing the metadata associated with each object?  And, likewise,
would CouchDB be a reasonable fit for managing the binary datastreams?
 Would it be practical to store the datastreams in CouchDB itself, and
up to what size limit/throughput limit?  Would it be better to store
the datastreams externally and use CouchDB to manage the metadata and
access control?  Also, looking down the road, are there plans for
CouchDB's development that would improve its fitness for this purpose
in the future?

Thanks very much for any insight you can share,

---Peter Herndon

Re: Evaluating CouchDB

Posted by Peter Herndon <tp...@gmail.com>.
Thanks very much, Chris, I greatly appreciate your insight.  I'll keep
you informed on how things work out.

---Peter

On Tue, Nov 25, 2008 at 1:53 PM, Chris Anderson <jc...@apache.org> wrote:
> On Mon, Nov 24, 2008 at 9:24 AM, Peter Herndon <tp...@gmail.com> wrote:
>
>>
>> Anyway, that's my current use case, and my next use case.  I know that
>> CouchDB isn't finished yet, and hasn't been optimized yet, but does
>> anyone have any opinions on whether CouchDB would be a reasonable fit
>> for managing the metadata associated with each object?
>
> I think CouchDB is pretty much design with this use case in mind. If
> you were lucky enough to convince the organization to switch from XML
> to JSON, the software would pretty much write itself. And CouchDB does
> a fairly decent job of dealing in XML, as well (using Spidermonkey's
> E4X engine) so that's not even required.
>
>> And, likewise,
>> would CouchDB be a reasonable fit for managing the binary datastreams?
>>  Would it be practical to store the datastreams in CouchDB itself, and
>> up to what size limit/throughput limit?
>
> CouchDB's attachment support is pretty much designed for this use case
> (attachments can be multi-GB files, and aren't sent to view servers).
> From your description, it sounds like you are maxing out IO at the
> network level, so it's hard to say how CouchDB would interact with
> such a stream, without seeing it in action. However, CouchDB's
> replication and distribution capabilities should make managing
> multi-site projects as simple as one can hope for. If you shard
> projects as databases, then you can use replication to make them
> available on the local network for the various sites, which should
> make it easier to avoid load bottlenecks at a central repository.
>
>> Would it be better to store
>> the datastreams externally and use CouchDB to manage the metadata and
>> access control?
>
> It's not clear - obviously importing TBs of data from a filesystem to
> CouchDB will take time and expense, even if CouchDB handles it
> swimmingly. The nice thing about the schemaless documents is that you
> can be flexible going forward, maybe referencing some assets via URIs
> and storing others as attachments.
>
> Also, looking down the road, are there plans for
>> CouchDB's development that would improve its fitness for this purpose
>> in the future?
>>
>
> Your project sounds like a good fit for CouchDB. Of course, you are
> talking about working on the high end of the performance / scalability
> curve, and CouchDB is relatively new, so you'll have to be comfortable
> as a trail-blazer (not that you'd be the only one, but with a new
> technology, you'll be in a smaller crowd than if you used something
> that's been around longer.)
>
> I think the biggest positive reason to use CouchDB for your project is
> the easy of federation / distribution / offline work. Once you've
> built the business-rules and document format around your project and
> CouchDB, booting up other instances of the project for more media
> collections should be straightforward. Because the documents will be
> more self-contained that what you'd have with a SQL store, for
> instance, you could build something amenable to merging multiple
> repositories, or splitting off just a portion of a repository for a
> particular purpose. This flexibility seems like a big win, as it would
> allow you to respond to things like datacenter-level bottlenecks with
> changes that users will understand, such as moving just the necessary
> sub-collections to a more local server.
>
> Good luck and keep us up to date with your progress.
>
> Chris
>
> --
> Chris Anderson
> http://jchris.mfdz.com
>

Re: Evaluating CouchDB

Posted by Chris Anderson <jc...@apache.org>.
On Mon, Nov 24, 2008 at 9:24 AM, Peter Herndon <tp...@gmail.com> wrote:

>
> Anyway, that's my current use case, and my next use case.  I know that
> CouchDB isn't finished yet, and hasn't been optimized yet, but does
> anyone have any opinions on whether CouchDB would be a reasonable fit
> for managing the metadata associated with each object?

I think CouchDB is pretty much design with this use case in mind. If
you were lucky enough to convince the organization to switch from XML
to JSON, the software would pretty much write itself. And CouchDB does
a fairly decent job of dealing in XML, as well (using Spidermonkey's
E4X engine) so that's not even required.

> And, likewise,
> would CouchDB be a reasonable fit for managing the binary datastreams?
>  Would it be practical to store the datastreams in CouchDB itself, and
> up to what size limit/throughput limit?

CouchDB's attachment support is pretty much designed for this use case
(attachments can be multi-GB files, and aren't sent to view servers).
>From your description, it sounds like you are maxing out IO at the
network level, so it's hard to say how CouchDB would interact with
such a stream, without seeing it in action. However, CouchDB's
replication and distribution capabilities should make managing
multi-site projects as simple as one can hope for. If you shard
projects as databases, then you can use replication to make them
available on the local network for the various sites, which should
make it easier to avoid load bottlenecks at a central repository.

> Would it be better to store
> the datastreams externally and use CouchDB to manage the metadata and
> access control?

It's not clear - obviously importing TBs of data from a filesystem to
CouchDB will take time and expense, even if CouchDB handles it
swimmingly. The nice thing about the schemaless documents is that you
can be flexible going forward, maybe referencing some assets via URIs
and storing others as attachments.

Also, looking down the road, are there plans for
> CouchDB's development that would improve its fitness for this purpose
> in the future?
>

Your project sounds like a good fit for CouchDB. Of course, you are
talking about working on the high end of the performance / scalability
curve, and CouchDB is relatively new, so you'll have to be comfortable
as a trail-blazer (not that you'd be the only one, but with a new
technology, you'll be in a smaller crowd than if you used something
that's been around longer.)

I think the biggest positive reason to use CouchDB for your project is
the easy of federation / distribution / offline work. Once you've
built the business-rules and document format around your project and
CouchDB, booting up other instances of the project for more media
collections should be straightforward. Because the documents will be
more self-contained that what you'd have with a SQL store, for
instance, you could build something amenable to merging multiple
repositories, or splitting off just a portion of a repository for a
particular purpose. This flexibility seems like a big win, as it would
allow you to respond to things like datacenter-level bottlenecks with
changes that users will understand, such as moving just the necessary
sub-collections to a more local server.

Good luck and keep us up to date with your progress.

Chris

-- 
Chris Anderson
http://jchris.mfdz.com