You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Barry Wark <ba...@gmail.com> on 2009/02/10 20:33:59 UTC

A permanent view for user-entered query with complex boolean expressions?

Hi all,

I'm in the planning stage for a frontend to a large  data set of
physiology data. I'm new to CouchDB and would like to get some
feedback on the feasibility of some ideas before I dig to far into
implementation.

The data:
Conceptually, the important parts of the data set can be modeled as a
set of trials. Each trial has one or more stimulus settings which are
key-value pairs. Not all trials have the same set of settings and not
all trials with the same setting have the same value for that setting.
CouchDB documents appear well-suited for this form of data. In
addition, each trial has one or more numeric datasets, each order 1MB,
but up to 100MB. It seems that having CouchDB documents that contain a
key-value pair like

"parameters" : {
    "parameter1" : value1,
    "parameter2" : value 2,
    //etc.
}

and with attachments for the numeric data sets is the CouchDB way to go.

Users will want to query this data set for all trials whose settings
satisfy some boolean expression. So, for example "trials where
(parameters['parameter1'] == 10 AND parameters['parameter2'] >= 42)"

So, now a few questions:

1. Is there a way to create a permanent view that supports queries
like that above? I got as far as a view like

map:
function map(doc) {
    for parameter in doc.parameters {
        emit([parameter, doc.parameters[parameter]], doc._id)
    }
}

reduce:
function reduce(keys, values, rereduce) {
    if(rereduce) {
        return union(values)
    }

    return values
}

I believe this will give a view which, when queried with group=True
will give a set of rows with keyed by [parameter, parameterValue] and
with a list of trial document IDs that have that
parameter:parameterValue. Is this correct?

Given this, I could do a union of the values of rows with
startkey=[parameter1, 10],count=1 and startkey=[parameter2, 42] to get
the set of trial document ids that match the query.

But is there a way to structure the view's map/reduce so that I don't
have to do the union in my code (i.e. CouchDB does it as part of the
map/reduce)? The approach outlined above leads to an HTTP GET for each
term in the boolean expression, for example.

2. What is the (practical) limit on attachment size? Is it reasonable
to store multi-MB attachments in the database? If not, I will go with
an external file(s) for the numeric data and storing a reference in
the trial document.

Thanks for any insight,

Barry

Re: A permanent view for user-entered query with complex boolean expressions?

Posted by Paul Davis <pa...@gmail.com>.

Barry,

On Tue, Feb 10, 2009 at 2:33 PM, Barry Wark <ba...@gmail.com> wrote:
> Hi all,
>
> I'm in the planning stage for a frontend to a large  data set of
> physiology data. I'm new to CouchDB and would like to get some
> feedback on the feasibility of some ideas before I dig to far into
> implementation.
>
> The data:
> Conceptually, the important parts of the data set can be modeled as a
> set of trials. Each trial has one or more stimulus settings which are
> key-value pairs. Not all trials have the same set of settings and not
> all trials with the same setting have the same value for that setting.
> CouchDB documents appear well-suited for this form of data. In
> addition, each trial has one or more numeric datasets, each order 1MB,
> but up to 100MB. It seems that having CouchDB documents that contain a
> key-value pair like
>
> "parameters" : {
>    "parameter1" : value1,
>    "parameter2" : value 2,
>    //etc.
> }
>
> and with attachments for the numeric data sets is the CouchDB way to go.
>

This is exaclty the layout I'd recommend using.

> Users will want to query this data set for all trials whose settings
> satisfy some boolean expression. So, for example "trials where
> (parameters['parameter1'] == 10 AND parameters['parameter2'] >= 42)"
>
> So, now a few questions:
>
> 1. Is there a way to create a permanent view that supports queries
> like that above? I got as far as a view like
>
> map:
> function map(doc) {
>    for parameter in doc.parameters {
>        emit([parameter, doc.parameters[parameter]], doc._id)
>    }
> }
>
> reduce:
> function reduce(keys, values, rereduce) {
>    if(rereduce) {
>        return union(values)
>    }
>
>    return values
> }
>
> I believe this will give a view which, when queried with group=True
> will give a set of rows with keyed by [parameter, parameterValue] and
> with a list of trial document IDs that have that
> parameter:parameterValue. Is this correct?
>
> Given this, I could do a union of the values of rows with
> startkey=[parameter1, 10],count=1 and startkey=[parameter2, 42] to get
> the set of trial document ids that match the query.
>
> But is there a way to structure the view's map/reduce so that I don't
> have to do the union in my code (i.e. CouchDB does it as part of the
> map/reduce)? The approach outlined above leads to an HTTP GET for each
> term in the boolean expression, for example.
>

Unfortunately, this is one of the aspects of CouchDB that is hard to
overcome. Lots of user specificable queries can lead to complications
without some limitation. Hopefully by the time 1.0 rolls through we'll
have made much more progress in dynamic query capabilities, but until
then the method I'd recommend would be something along the lines of
this:

The first step is to know how many doc id's you have for each
parameter. Here we'll set that up:

// Map
function(doc)
{
    for(var prop in doc) if(!prop.substr(0,1) == "_") emit(prop, 1);
}

// Reduce
function(keys, values)
{
    return sum(values);
}

Now you can query this with multi-get so that you know the number of
docids for each input parameter in your query by posting a JSON body
to the view:

curl -X POST -d '{"keys": ["param1", "param2", "paramN"]}'
http://127.0.0.1:5984/db_name/_view/vname?group=true

Now that we know the relative number of docids we can start searching
for the result set by applying each boolean clause using set math. We
just apply from the smallest number of docids to the largest to try
and make sure we keep resource usage to a minimum.

At the moment, that's the pure CouchDB way. In real life for your
query interface I'd most likely write a small slave process that uses
the _external interface. Hopefully in the next months a couple feature
ideas I have rattling around will coalesce into an implementation that
will make things like this easier from directly within CouchDB. But
for right now, that's all hand waving.

> 2. What is the (practical) limit on attachment size? Is it reasonable
> to store multi-MB attachments in the database? If not, I will go with
> an external file(s) for the numeric data and storing a reference in
> the trial document.
>
> Thanks for any insight,
>
> Barry
>

Trunk has support for streaming writes when a Content-Length header is
present. Chris Anderson was just working the other day on streaming
writes to disk in the absence of a Content-Length header. That
basically means that if your HTTP client sends a content-length
header, the sky's the limit. If you don't send a Content-Length
header, you'll be limited by the available RAM on the machine running
CouchDB until Chris finishes his patch.

A small caveat for the current implementation is that larger
attachments can end up causing a bit of RAM usage on the receiving
end. I would doubt that 100MiB attachments are big enough to cause an
issue, but you may want to test that before relying on it. Hopefully
this is taken care of pre-0.9 (the bits and pieces appear to be
falling in to place at least).

HTH,
Paul Davis

Re: A permanent view for user-entered query with complex boolean expressions?

Posted by Brian Candler <B....@pobox.com>.

On Tue, Feb 10, 2009 at 02:16:08PM -0800, Barry Wark wrote:
> >> map:
> >> function map(doc) {
> >>    for parameter in doc.parameters {
> >>        emit([parameter, doc.parameters[parameter]], doc._id)
> >>    }
> >> }
> >>
> >> reduce:
> >> function reduce(keys, values, rereduce) {
> >>    if(rereduce) {
> >>        return union(values)
> >>    }
> >>
> >>    return values
> >> }
> 
> In fact, I think I messed up; I don't really need the reduce function
> in this view do I?

And I don't believe you need to emit the doc._id as a value either; null
will do. (Couch remembers each source doc_id anyway, and provides them in
the view)

Re: A permanent view for user-entered query with complex boolean expressions?

Posted by Paul Davis <pa...@gmail.com>.

On Tue, Feb 10, 2009 at 5:16 PM, Barry Wark <ba...@gmail.com> wrote:
> Paul,
>
> Thanks for the very interesting response. CouchDB is looking like a
> huge win for us in the long run. A couple of quick follow ups inline
> below...
>

[snip]

>> Now that we know the relative number of docids we can start searching
>> for the result set by applying each boolean clause using set math. We
>> just apply from the smallest number of docids to the largest to try
>> and make sure we keep resource usage to a minimum.
>
> This seems like a very common pattern. Is there any chance of getting
> it implemented in CouchDB?
>

There's definitely a chance. No idea how soon or in what exact nature
the feature would end up looking like but I've found the CouchDB
community very accepting of patches for new features. I'm pretty
interested in such a feature myself and I'll be posting a summary
email of my thoughts shortly for discussion on what people would find
most beneficial as well as finding if there are any flaws in my logic
etc.

>>
>> At the moment, that's the pure CouchDB way. In real life for your
>> query interface I'd most likely write a small slave process that uses
>> the _external interface. Hopefully in the next months a couple feature
>> ideas I have rattling around will coalesce into an implementation that
>> will make things like this easier from directly within CouchDB. But
>> for right now, that's all hand waving.
>
> I'm not familiar with the _external interface yet. Is there some
> documentation? Is this how the lucene index that Robert mentions
> works?
>
> User-specifiable queries like this  is going to be a critical feature
> for us, whether we go with CouchDB or not, so I'm very interested in
> keeping up with related developments. Feel free to contact me offline
> if you're interested in more specific use cases etc.
>
> Thanks again,
> Barry
>

The current _external interface documentation can be found at [1]. Its
basically just a thin wrapper around the HTTP request and response to
allow for custom code to be run behind CouchDB.

The easiest way to keep up is by following the dev list. I'd
definitely recommend it if you're interested in testing features as
they land.

[1] http://wiki.apache.org/couchdb/ExternalProcesses

HTH,
Paul Davis

Re: A permanent view for user-entered query with complex boolean expressions?

Posted by Robert Newson <ro...@gmail.com>.

couchdb-lucene uses [externals] to receive queries from the client and
it currently polls all_docs_by_seq for updates. This seems to match
Lucene's batch-oriented model anyway, so I've not looked deeply into
the update_notification option, etc.

B.

On Tue, Feb 10, 2009 at 5:16 PM, Barry Wark <ba...@gmail.com> wrote:
> Paul,
>
> Thanks for the very interesting response. CouchDB is looking like a
> huge win for us in the long run. A couple of quick follow ups inline
> below...
>
> On Tue, Feb 10, 2009 at 12:09 PM, Paul Davis
> <pa...@gmail.com> wrote:
>> Barry,
>>
>> On Tue, Feb 10, 2009 at 2:33 PM, Barry Wark <ba...@gmail.com> wrote:
>>> Hi all,
>>>
>>> I'm in the planning stage for a frontend to a large  data set of
>>> physiology data. I'm new to CouchDB and would like to get some
>>> feedback on the feasibility of some ideas before I dig to far into
>>> implementation.
>>>
>>> The data:
>>> Conceptually, the important parts of the data set can be modeled as a
>>> set of trials. Each trial has one or more stimulus settings which are
>>> key-value pairs. Not all trials have the same set of settings and not
>>> all trials with the same setting have the same value for that setting.
>>> CouchDB documents appear well-suited for this form of data. In
>>> addition, each trial has one or more numeric datasets, each order 1MB,
>>> but up to 100MB. It seems that having CouchDB documents that contain a
>>> key-value pair like
>>>
>>> "parameters" : {
>>>    "parameter1" : value1,
>>>    "parameter2" : value 2,
>>>    //etc.
>>> }
>>>
>>> and with attachments for the numeric data sets is the CouchDB way to go.
>>>
>>
>> This is exaclty the layout I'd recommend using.
>>
>>> Users will want to query this data set for all trials whose settings
>>> satisfy some boolean expression. So, for example "trials where
>>> (parameters['parameter1'] == 10 AND parameters['parameter2'] >= 42)"
>>>
>>> So, now a few questions:
>>>
>>> 1. Is there a way to create a permanent view that supports queries
>>> like that above? I got as far as a view like
>>>
>>> map:
>>> function map(doc) {
>>>    for parameter in doc.parameters {
>>>        emit([parameter, doc.parameters[parameter]], doc._id)
>>>    }
>>> }
>>>
>>> reduce:
>>> function reduce(keys, values, rereduce) {
>>>    if(rereduce) {
>>>        return union(values)
>>>    }
>>>
>>>    return values
>>> }
>
> In fact, I think I messed up; I don't really need the reduce function
> in this view do I?
>>>
>>> I believe this will give a view which, when queried with group=True
>>> will give a set of rows with keyed by [parameter, parameterValue] and
>>> with a list of trial document IDs that have that
>>> parameter:parameterValue. Is this correct?
>>>
>>> Given this, I could do a union of the values of rows with
>>> startkey=[parameter1, 10],count=1 and startkey=[parameter2, 42] to get
>>> the set of trial document ids that match the query.
>>>
>>> But is there a way to structure the view's map/reduce so that I don't
>>> have to do the union in my code (i.e. CouchDB does it as part of the
>>> map/reduce)? The approach outlined above leads to an HTTP GET for each
>>> term in the boolean expression, for example.
>>>
>>
>> Unfortunately, this is one of the aspects of CouchDB that is hard to
>> overcome. Lots of user specificable queries can lead to complications
>> without some limitation. Hopefully by the time 1.0 rolls through we'll
>> have made much more progress in dynamic query capabilities, but until
>> then the method I'd recommend would be something along the lines of
>> this:
>>
>> The first step is to know how many doc id's you have for each
>> parameter. Here we'll set that up:
>>
>> // Map
>> function(doc)
>> {
>>    for(var prop in doc) if(!prop.substr(0,1) == "_") emit(prop, 1);
>> }
>>
>> // Reduce
>> function(keys, values)
>> {
>>    return sum(values);
>> }
>>
>> Now you can query this with multi-get so that you know the number of
>> docids for each input parameter in your query by posting a JSON body
>> to the view:
>>
>> curl -X POST -d '{"keys": ["param1", "param2", "paramN"]}'
>> http://127.0.0.1:5984/db_name/_view/vname?group=true
>>
>> Now that we know the relative number of docids we can start searching
>> for the result set by applying each boolean clause using set math. We
>> just apply from the smallest number of docids to the largest to try
>> and make sure we keep resource usage to a minimum.
>
> This seems like a very common pattern. Is there any chance of getting
> it implemented in CouchDB?
>
>>
>> At the moment, that's the pure CouchDB way. In real life for your
>> query interface I'd most likely write a small slave process that uses
>> the _external interface. Hopefully in the next months a couple feature
>> ideas I have rattling around will coalesce into an implementation that
>> will make things like this easier from directly within CouchDB. But
>> for right now, that's all hand waving.
>
> I'm not familiar with the _external interface yet. Is there some
> documentation? Is this how the lucene index that Robert mentions
> works?
>
> User-specifiable queries like this  is going to be a critical feature
> for us, whether we go with CouchDB or not, so I'm very interested in
> keeping up with related developments. Feel free to contact me offline
> if you're interested in more specific use cases etc.
>
> Thanks again,
> Barry
>
>>
>>> 2. What is the (practical) limit on attachment size? Is it reasonable
>>> to store multi-MB attachments in the database? If not, I will go with
>>> an external file(s) for the numeric data and storing a reference in
>>> the trial document.
>>>
>>> Thanks for any insight,
>>>
>>> Barry
>>>
>>
>> Trunk has support for streaming writes when a Content-Length header is
>> present. Chris Anderson was just working the other day on streaming
>> writes to disk in the absence of a Content-Length header. That
>> basically means that if your HTTP client sends a content-length
>> header, the sky's the limit. If you don't send a Content-Length
>> header, you'll be limited by the available RAM on the machine running
>> CouchDB until Chris finishes his patch.
>>
>> A small caveat for the current implementation is that larger
>> attachments can end up causing a bit of RAM usage on the receiving
>> end. I would doubt that 100MiB attachments are big enough to cause an
>> issue, but you may want to test that before relying on it. Hopefully
>> this is taken care of pre-0.9 (the bits and pieces appear to be
>> falling in to place at least).
>>
>> HTH,
>> Paul Davis
>>
>

Re: A permanent view for user-entered query with complex boolean expressions?

Posted by Barry Wark <ba...@gmail.com>.

Paul,

Thanks for the very interesting response. CouchDB is looking like a
huge win for us in the long run. A couple of quick follow ups inline
below...

On Tue, Feb 10, 2009 at 12:09 PM, Paul Davis
<pa...@gmail.com> wrote:
> Barry,
>
> On Tue, Feb 10, 2009 at 2:33 PM, Barry Wark <ba...@gmail.com> wrote:
>> Hi all,
>>
>> I'm in the planning stage for a frontend to a large  data set of
>> physiology data. I'm new to CouchDB and would like to get some
>> feedback on the feasibility of some ideas before I dig to far into
>> implementation.
>>
>> The data:
>> Conceptually, the important parts of the data set can be modeled as a
>> set of trials. Each trial has one or more stimulus settings which are
>> key-value pairs. Not all trials have the same set of settings and not
>> all trials with the same setting have the same value for that setting.
>> CouchDB documents appear well-suited for this form of data. In
>> addition, each trial has one or more numeric datasets, each order 1MB,
>> but up to 100MB. It seems that having CouchDB documents that contain a
>> key-value pair like
>>
>> "parameters" : {
>>    "parameter1" : value1,
>>    "parameter2" : value 2,
>>    //etc.
>> }
>>
>> and with attachments for the numeric data sets is the CouchDB way to go.
>>
>
> This is exaclty the layout I'd recommend using.
>
>> Users will want to query this data set for all trials whose settings
>> satisfy some boolean expression. So, for example "trials where
>> (parameters['parameter1'] == 10 AND parameters['parameter2'] >= 42)"
>>
>> So, now a few questions:
>>
>> 1. Is there a way to create a permanent view that supports queries
>> like that above? I got as far as a view like
>>
>> map:
>> function map(doc) {
>>    for parameter in doc.parameters {
>>        emit([parameter, doc.parameters[parameter]], doc._id)
>>    }
>> }
>>
>> reduce:
>> function reduce(keys, values, rereduce) {
>>    if(rereduce) {
>>        return union(values)
>>    }
>>
>>    return values
>> }

In fact, I think I messed up; I don't really need the reduce function
in this view do I?
>>
>> I believe this will give a view which, when queried with group=True
>> will give a set of rows with keyed by [parameter, parameterValue] and
>> with a list of trial document IDs that have that
>> parameter:parameterValue. Is this correct?
>>
>> Given this, I could do a union of the values of rows with
>> startkey=[parameter1, 10],count=1 and startkey=[parameter2, 42] to get
>> the set of trial document ids that match the query.
>>
>> But is there a way to structure the view's map/reduce so that I don't
>> have to do the union in my code (i.e. CouchDB does it as part of the
>> map/reduce)? The approach outlined above leads to an HTTP GET for each
>> term in the boolean expression, for example.
>>
>
> Unfortunately, this is one of the aspects of CouchDB that is hard to
> overcome. Lots of user specificable queries can lead to complications
> without some limitation. Hopefully by the time 1.0 rolls through we'll
> have made much more progress in dynamic query capabilities, but until
> then the method I'd recommend would be something along the lines of
> this:
>
> The first step is to know how many doc id's you have for each
> parameter. Here we'll set that up:
>
> // Map
> function(doc)
> {
>    for(var prop in doc) if(!prop.substr(0,1) == "_") emit(prop, 1);
> }
>
> // Reduce
> function(keys, values)
> {
>    return sum(values);
> }
>
> Now you can query this with multi-get so that you know the number of
> docids for each input parameter in your query by posting a JSON body
> to the view:
>
> curl -X POST -d '{"keys": ["param1", "param2", "paramN"]}'
> http://127.0.0.1:5984/db_name/_view/vname?group=true
>
> Now that we know the relative number of docids we can start searching
> for the result set by applying each boolean clause using set math. We
> just apply from the smallest number of docids to the largest to try
> and make sure we keep resource usage to a minimum.

This seems like a very common pattern. Is there any chance of getting
it implemented in CouchDB?

>
> At the moment, that's the pure CouchDB way. In real life for your
> query interface I'd most likely write a small slave process that uses
> the _external interface. Hopefully in the next months a couple feature
> ideas I have rattling around will coalesce into an implementation that
> will make things like this easier from directly within CouchDB. But
> for right now, that's all hand waving.

I'm not familiar with the _external interface yet. Is there some
documentation? Is this how the lucene index that Robert mentions
works?

User-specifiable queries like this  is going to be a critical feature
for us, whether we go with CouchDB or not, so I'm very interested in
keeping up with related developments. Feel free to contact me offline
if you're interested in more specific use cases etc.

Thanks again,
Barry

>
>> 2. What is the (practical) limit on attachment size? Is it reasonable
>> to store multi-MB attachments in the database? If not, I will go with
>> an external file(s) for the numeric data and storing a reference in
>> the trial document.
>>
>> Thanks for any insight,
>>
>> Barry
>>
>
> Trunk has support for streaming writes when a Content-Length header is
> present. Chris Anderson was just working the other day on streaming
> writes to disk in the absence of a Content-Length header. That
> basically means that if your HTTP client sends a content-length
> header, the sky's the limit. If you don't send a Content-Length
> header, you'll be limited by the available RAM on the machine running
> CouchDB until Chris finishes his patch.
>
> A small caveat for the current implementation is that larger
> attachments can end up causing a bit of RAM usage on the receiving
> end. I would doubt that 100MiB attachments are big enough to cause an
> issue, but you may want to test that before relying on it. Hopefully
> this is taken care of pre-0.9 (the bits and pieces appear to be
> falling in to place at least).
>
> HTH,
> Paul Davis
>

Re: A permanent view for user-entered query with complex boolean expressions?

Posted by Robert Newson <ro...@gmail.com>.

>>So, for example "trials where
>>(parameters['parameter1'] == 10 AND parameters['parameter2'] >= 42)"

Whether this can be accomplished inside CouchDB or not, it should be
possible to do inside a Lucene index based off CouchDB data. I don't
currently do the right thing for numbers in range queries, but when I
fix that this kind of query should return _id,_rev pairs;

../dbname/_fti?q=parameter1:10 AND parameter2:[0 TO 42]

the current issue (for my code) is that 0 to 42 is evaluated
lexicographically, which is wrong. I have a planned fix but no time
this week to commit it. It'll appear at
http://github.com/rnewson/couchdb-lucene when it does.

I expect the query to evaluate very quickly, even when you change the
start and end points (in fact my approach expects you to re-query with
different points).

B.

On Tue, Feb 10, 2009 at 3:34 PM, Chris Anderson <jc...@apache.org> wrote:
> On Tue, Feb 10, 2009 at 12:09 PM, Paul Davis
> <pa...@gmail.com> wrote:
>>
>> Trunk has support for streaming writes when a Content-Length header is
>> present. Chris Anderson was just working the other day on streaming
>> writes to disk in the absence of a Content-Length header. That
>> basically means that if your HTTP client sends a content-length
>> header, the sky's the limit. If you don't send a Content-Length
>> header, you'll be limited by the available RAM on the machine running
>> CouchDB until Chris finishes his patch.
>
> Just to clear up, currently attachment PUTs without Content-Length
> headers are rejected. I think that we fixed the RAM buffering issue
> after all:
>
> https://issues.apache.org/jira/browse/COUCHDB-189 (fixed)
>
> So if you know the length of the attachment, PUT should work for you
> no matter how big it is.
>
> --
> Chris Anderson
> http://jchris.mfdz.com
>

Re: A permanent view for user-entered query with complex boolean expressions?

Posted by Jan Lehnardt <ja...@apache.org>.

On 10 Feb 2009, at 23:10, Barry Wark wrote:

> On Tue, Feb 10, 2009 at 12:34 PM, Chris Anderson <jc...@apache.org>  
> wrote:
>> On Tue, Feb 10, 2009 at 12:09 PM, Paul Davis
>> <pa...@gmail.com> wrote:
>>>
>>> Trunk has support for streaming writes when a Content-Length  
>>> header is
>>> present. Chris Anderson was just working the other day on streaming
>>> writes to disk in the absence of a Content-Length header. That
>>> basically means that if your HTTP client sends a content-length
>>> header, the sky's the limit. If you don't send a Content-Length
>>> header, you'll be limited by the available RAM on the machine  
>>> running
>>> CouchDB until Chris finishes his patch.
>>
>> Just to clear up, currently attachment PUTs without Content-Length
>> headers are rejected. I think that we fixed the RAM buffering issue
>> after all:
>>
>> https://issues.apache.org/jira/browse/COUCHDB-189 (fixed)
>>
>> So if you know the length of the attachment, PUT should work for you
>> no matter how big it is.
>
> Very cool. What about reading the attachment? Is there a significant
> performance hit for streaming the attachment out of the database as
> opposed to reading the data directly out of a separate file?

We don't use sendfile() yet, so it is not optimal, but overhead is  
minimal.

Cheers
Jan
--

Re: A permanent view for user-entered query with complex boolean expressions?

Posted by Barry Wark <ba...@gmail.com>.

On Tue, Feb 10, 2009 at 12:34 PM, Chris Anderson <jc...@apache.org> wrote:
> On Tue, Feb 10, 2009 at 12:09 PM, Paul Davis
> <pa...@gmail.com> wrote:
>>
>> Trunk has support for streaming writes when a Content-Length header is
>> present. Chris Anderson was just working the other day on streaming
>> writes to disk in the absence of a Content-Length header. That
>> basically means that if your HTTP client sends a content-length
>> header, the sky's the limit. If you don't send a Content-Length
>> header, you'll be limited by the available RAM on the machine running
>> CouchDB until Chris finishes his patch.
>
> Just to clear up, currently attachment PUTs without Content-Length
> headers are rejected. I think that we fixed the RAM buffering issue
> after all:
>
> https://issues.apache.org/jira/browse/COUCHDB-189 (fixed)
>
> So if you know the length of the attachment, PUT should work for you
> no matter how big it is.

Very cool. What about reading the attachment? Is there a significant
performance hit for streaming the attachment out of the database as
opposed to reading the data directly out of a separate file?

thanks,
Barry

>
> --
> Chris Anderson
> http://jchris.mfdz.com
>

Re: A permanent view for user-entered query with complex boolean expressions?

Posted by Chris Anderson <jc...@apache.org>.

On Tue, Feb 10, 2009 at 12:09 PM, Paul Davis
<pa...@gmail.com> wrote:
>
> Trunk has support for streaming writes when a Content-Length header is
> present. Chris Anderson was just working the other day on streaming
> writes to disk in the absence of a Content-Length header. That
> basically means that if your HTTP client sends a content-length
> header, the sky's the limit. If you don't send a Content-Length
> header, you'll be limited by the available RAM on the machine running
> CouchDB until Chris finishes his patch.

Just to clear up, currently attachment PUTs without Content-Length
headers are rejected. I think that we fixed the RAM buffering issue
after all:

https://issues.apache.org/jira/browse/COUCHDB-189 (fixed)

So if you know the length of the attachment, PUT should work for you
no matter how big it is.

-- 
Chris Anderson
http://jchris.mfdz.com

Re: A permanent view for user-entered query with complex boolean expressions?

Posted by Paul Davis <pa...@gmail.com>.

Barry,

On Tue, Feb 10, 2009 at 2:33 PM, Barry Wark <ba...@gmail.com> wrote:
> Hi all,
>
> I'm in the planning stage for a frontend to a large  data set of
> physiology data. I'm new to CouchDB and would like to get some
> feedback on the feasibility of some ideas before I dig to far into
> implementation.
>
> The data:
> Conceptually, the important parts of the data set can be modeled as a
> set of trials. Each trial has one or more stimulus settings which are
> key-value pairs. Not all trials have the same set of settings and not
> all trials with the same setting have the same value for that setting.
> CouchDB documents appear well-suited for this form of data. In
> addition, each trial has one or more numeric datasets, each order 1MB,
> but up to 100MB. It seems that having CouchDB documents that contain a
> key-value pair like
>
> "parameters" : {
>    "parameter1" : value1,
>    "parameter2" : value 2,
>    //etc.
> }
>
> and with attachments for the numeric data sets is the CouchDB way to go.
>

This is exaclty the layout I'd recommend using.

> Users will want to query this data set for all trials whose settings
> satisfy some boolean expression. So, for example "trials where
> (parameters['parameter1'] == 10 AND parameters['parameter2'] >= 42)"
>
> So, now a few questions:
>
> 1. Is there a way to create a permanent view that supports queries
> like that above? I got as far as a view like
>
> map:
> function map(doc) {
>    for parameter in doc.parameters {
>        emit([parameter, doc.parameters[parameter]], doc._id)
>    }
> }
>
> reduce:
> function reduce(keys, values, rereduce) {
>    if(rereduce) {
>        return union(values)
>    }
>
>    return values
> }
>
> I believe this will give a view which, when queried with group=True
> will give a set of rows with keyed by [parameter, parameterValue] and
> with a list of trial document IDs that have that
> parameter:parameterValue. Is this correct?
>
> Given this, I could do a union of the values of rows with
> startkey=[parameter1, 10],count=1 and startkey=[parameter2, 42] to get
> the set of trial document ids that match the query.
>
> But is there a way to structure the view's map/reduce so that I don't
> have to do the union in my code (i.e. CouchDB does it as part of the
> map/reduce)? The approach outlined above leads to an HTTP GET for each
> term in the boolean expression, for example.
>

Unfortunately, this is one of the aspects of CouchDB that is hard to
overcome. Lots of user specificable queries can lead to complications
without some limitation. Hopefully by the time 1.0 rolls through we'll
have made much more progress in dynamic query capabilities, but until
then the method I'd recommend would be something along the lines of
this:

The first step is to know how many doc id's you have for each
parameter. Here we'll set that up:

// Map
function(doc)
{
    for(var prop in doc) if(!prop.substr(0,1) == "_") emit(prop, 1);
}

// Reduce
function(keys, values)
{
    return sum(values);
}

Now you can query this with multi-get so that you know the number of
docids for each input parameter in your query by posting a JSON body
to the view:

curl -X POST -d '{"keys": ["param1", "param2", "paramN"]}'
http://127.0.0.1:5984/db_name/_view/vname?group=true

Now that we know the relative number of docids we can start searching
for the result set by applying each boolean clause using set math. We
just apply from the smallest number of docids to the largest to try
and make sure we keep resource usage to a minimum.

At the moment, that's the pure CouchDB way. In real life for your
query interface I'd most likely write a small slave process that uses
the _external interface. Hopefully in the next months a couple feature
ideas I have rattling around will coalesce into an implementation that
will make things like this easier from directly within CouchDB. But
for right now, that's all hand waving.

> 2. What is the (practical) limit on attachment size? Is it reasonable
> to store multi-MB attachments in the database? If not, I will go with
> an external file(s) for the numeric data and storing a reference in
> the trial document.
>
> Thanks for any insight,
>
> Barry
>

Trunk has support for streaming writes when a Content-Length header is
present. Chris Anderson was just working the other day on streaming
writes to disk in the absence of a Content-Length header. That
basically means that if your HTTP client sends a content-length
header, the sky's the limit. If you don't send a Content-Length
header, you'll be limited by the available RAM on the machine running
CouchDB until Chris finishes his patch.

A small caveat for the current implementation is that larger
attachments can end up causing a bit of RAM usage on the receiving
end. I would doubt that 100MiB attachments are big enough to cause an
issue, but you may want to test that before relying on it. Hopefully
this is taken care of pre-0.9 (the bits and pieces appear to be
falling in to place at least).

HTH,
Paul Davis