You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Torstein Krause Johansen <to...@gmail.com> on 2011/06/06 14:12:22 UTC

Re: Complex queries & results

Hi Benjamin,

and thanks for your comments.

On 31/05/11 22:11, Benjamin Young wrote:
> On 5/27/11 5:16 AM, Torstein Krause Johansen wrote:

>>> ?group=true&group_level=2&startkey=["2011-05-26"]&endkey=["2011-05-27",
>>> {}]
>>>
>>> results in:
>>>
>>> {
>>> "key": ["2011-05-26", "Lisa"],
>>> "value": 1
>>> },
>>> {
>>> "key": ["2011-05-26", "John"],
>>> "value": 2
>>> },
>>> {
>>> "key": ["2011-05-27", "John"],
>>> "value": 1
>>> }
>>>
>>> You can of course emit not just days, but also weeks, months,
>>> quarters if that's what you always want. If it arbitrary and you need
>>> to aggregate the names afterwards from this smaller set, yo should do
>>> it in the client (whoever calls CouchDB to get this information out).
>>
>> Mhmm, ok, thanks for explaining this.
>>
>> It means though, that for every unique time stamp that a_name has an
>> entry, there will be a corresponding count returned (like the three
>> you listed above).
>>
>> Hence, if a_name has 1000 entries at slightly different times within
>> the time range I'm searching for (my created_at includes seconds), I
>> will get 1000 such entries back.
>
> It really just depends on what you want to count/reduce/etc. If you only
> want a count of the names (and don't want additional
> granularity--name+year counts) then just return the name as the index.
> If you want the count of names by year/month/day, etc, then return those
> *after* the name, so you can add specificity by incrementing your
> group_level param.

There's probably, something I haven't understood here. If I add my 
search fields after a_name, then how can I limit my search on start and 
endkey when a_name cannot be included in the start and end keys (since 
the name is what I want to count on)?

Just to be sure, I want to re-state what I want: I have documents with 
the following fields:

{
     one_id : 1,
     another_id : 22,
     created_at : "2011-05-26 23:22:11",
     a_name : "Lisa"
}

I want to be able to search all occurrences with a combination of the 
three first ones as query parameters and then count the number of a_name 
occurrences within each of these search collections.

There will be many entries like the one above (say 30.000), where the 
only difference is the created_at field. Searching for these variable 
parameters:

     one_id=1,
     another_id=22,
     created_at > "2011-05-26 23:30:00"
     created_at < "2011-05-27 01:00:00"

I want to end up with a dictionary listing the names and their count 
matching the search parameters:

{
    "Lisa" : 132
    "John" : 16
}

If I put [created_at, one_id, another_id, a_name] in the key, I can use 
the start and end keys :
?group=true&
group_level=4&
startkey=["2011-05-26 23:30:00",1,22]&
endkey=["2011-05-27 01:00:00",2,23]

I will get results like these:
{
   "key": ["2011-05-26 23:30:10", 1, 22, "Lisa"],
   "value": 1
},
{
   "key": ["2011-05-26 23:30:12", 1, 22, "Lisa"],
   "value": 3
},
{
   "key": ["2011-05-26 23:33:43", 1, 22, "Lisa"],
   "value": 5
},
[..]

Giving me a quite big result set, since there's so many hits where the 
created_at is slightly different.

> Alternatively, if you want to count *just* the names and *just* the
> dates, you'll need two indexes ones for names and one for dates as you
> can't "skip" the key groups (as your example tried to do with [{},...].
>
> Basically, you'll need an additional view/index for each key you're
> wanting to count + whatever output you want to make the counting more
> granular (in this case, date).

Mhmm. So in this case, it means I need an index for one_id, another_id 
and a_name (three ones)? If yes, I'm puzzled as to how I can make use of 
these indexes just with one GET request?

[..]

Initially, I got something working for my use case, using two indexes, 
one to get the a_name values based based on the search queries a_value, 
another_value & created_at. Querying the second index, I got the number 
of occurrences for a_name within the hits returned from the first query.

However, this didn't feel optimal (although I've read posts on the 
mailing list of people doing two batches of queries before), so I tried 
to go down a different road, as described above.

Best regards,

-Torstein

Re: Complex queries & results

Posted by Torstein Krause Johansen <to...@gmail.com>.

Hi Robert,

On 09/06/11 17:10, Robert Newson wrote:
> If you're not keeping the connection alive (I assume a loop calling
> wget is unable to do so) then much of the variance will be down to
> creating new TCP connections. A further problem is that of ephemeral
> port exhaustion, you might find the spikes in latency are cyclical,
> which is when wget is blocked waiting for a port (the others still
> lingering in TIME_WAIT, for example).

Thanks for your concern, but I don't see this as a problem. I run these 
tests many times and remove the max/min responses that deviate from the 
main tendency. This I do for both loops of the wget. The test results 
are easily repeatable and I'm thus pretty confident to say that the 
query times go up <this> much on my system when Couch is busy updating 
its views.

> I suggest using apachebench or nodeload
> (https://github.com/benschmaus/nodeload) with appropriate settings to
> reuse connections.

Thanks for that, I'll have a look at your pointers (have used ab -k 
before), nodeload is a new kid on the block I haven't checked out yet.

Normally, though, I'm satisfied with using httperf for testing read 
performance (making sure that I don't run out of file handles of course, 
as well as tuning the server side to re-cycle the TCP connections fast 
enough) and siege for testing write performance. I used wget here merely 
to get a quick indication of the difference in query times.

> "I expect stale=update_after to behave the same as stale=ok but also
> trigger view index update."
>
> Yes, it's true.

Cheers for the confirmation.

-Torstein

Re: Complex queries & results

Posted by Robert Newson <rn...@apache.org>.

If you're not keeping the connection alive (I assume a loop calling
wget is unable to do so) then much of the variance will be down to
creating new TCP connections. A further problem is that of ephemeral
port exhaustion, you might find the spikes in latency are cyclical,
which is when wget is blocked waiting for a port (the others still
lingering in TIME_WAIT, for example).

I suggest using apachebench or nodeload
(https://github.com/benschmaus/nodeload) with appropriate settings to
reuse connections.

"I expect stale=update_after to behave the same as stale=ok but also
trigger view index update."

Yes, it's true.

B.

On 9 June 2011 10:02, Torstein Krause Johansen
<to...@gmail.com> wrote:
> Heya,
>
> On 09/06/11 16:07, Marcello Nuccio wrote:
>
>> 2011/6/9 Torstein Krause Johansen<to...@gmail.com>:
>>>
>>> On 08/06/11 19:21, Sean Copenhaver wrote:
>>>>
>>>> For stale=update_after/ok, does that mean the next query could block
>>>> until
>>>> the view updated? Even if the it was also a stale=update_after/ok query?
>>>
>>> I've tested this with two eternal wget loops, one running with stale=ok
>>> and
>>> one the default non stale data setting.
>>
>> I think a more interesting test is with two loops, one running with
>> stale=ok and one running with stale=update_after.
>
> Could be, but I'm still on 0.11.0-2.3 :-)
>
>> I expect stale=update_after to behave the same as stale=ok but also
>> trigger view index update.
>> Is this true?
>
> Don't know, but that's what I'd expect too.
>
> Cheers,
>
> -Torstein
>

Re: Complex queries & results

Posted by Torstein Krause Johansen <to...@gmail.com>.

Heya,

On 09/06/11 16:07, Marcello Nuccio wrote:

> 2011/6/9 Torstein Krause Johansen<to...@gmail.com>:
>>
>> On 08/06/11 19:21, Sean Copenhaver wrote:
>>> For stale=update_after/ok, does that mean the next query could block until
>>> the view updated? Even if the it was also a stale=update_after/ok query?
>>
>> I've tested this with two eternal wget loops, one running with stale=ok and
>> one the default non stale data setting.
>
> I think a more interesting test is with two loops, one running with
> stale=ok and one running with stale=update_after.

Could be, but I'm still on 0.11.0-2.3 :-)

> I expect stale=update_after to behave the same as stale=ok but also
> trigger view index update.
> Is this true?

Don't know, but that's what I'd expect too.

Cheers,

-Torstein

Re: Complex queries & results

Posted by Marcello Nuccio <ma...@gmail.com>.

Hi Torstein,

2011/6/9 Torstein Krause Johansen <to...@gmail.com>:
>
> On 08/06/11 19:21, Sean Copenhaver wrote:
>> For stale=update_after/ok, does that mean the next query could block until
>> the view updated? Even if the it was also a stale=update_after/ok query?
>
> I've tested this with two eternal wget loops, one running with stale=ok and
> one the default non stale data setting.

I think a more interesting test is with two loops, one running with
stale=ok and one running with stale=update_after.
I expect stale=update_after to behave the same as stale=ok but also
trigger view index update.
Is this true?

thanks,
  Marcello

Re: Complex queries & results

Posted by Torstein Krause Johansen <to...@gmail.com>.

Hi Sean,

On 08/06/11 19:21, Sean Copenhaver wrote:

> For stale=update_after/ok, does that mean the next query could block until
> the view updated? Even if the it was also a stale=update_after/ok query?

I've tested this with two eternal wget loops, one running with stale=ok 
and one the default non stale data setting.

During the index update (600 documents/s), the stale=ok loop continued 
to deliver results throughout my tests while the other wget loop hung 
for more than five minutes. Also worth to note, is that the the stale=ok 
query times went up from ~0.05 to ~0.5s during my tests.

Cheers,

-Torstein

Re: Complex queries & results

Posted by Sean Copenhaver <se...@gmail.com>.

Torstein, great to here that the reduce is working for you.

For stale=update_after/ok, does that mean the next query could block until
the view updated? Even if the it was also a stale=update_after/ok query?

On Wed, Jun 8, 2011 at 5:43 AM, Torstein Krause Johansen <
torsteinkrausework@gmail.com> wrote:

> On 08/06/11 13:56, Marcello Nuccio wrote:
>
>> I have not tested it, but CouchDB 1.1 does have some improvements:
>>
>>
>> http://docs.couchbase.org/couchdb-release-1.1/index.html#couchdb-release-1.1-etag
>>
>> http://docs.couchbase.org/couchdb-release-1.1/index.html#coudhdb-release-1.1-updateafter
>>
>
> Thanks for the hints, stale=update_after seems to be reason enough for
> upgrading to 1.1 :-)
>
> Cheers,
>
> -Torstein
>

-- 
“The limits of language are the limits of one's world. “ -Ludwig von
Wittgenstein

Re: Complex queries & results

Posted by Torstein Krause Johansen <to...@gmail.com>.

On 08/06/11 13:56, Marcello Nuccio wrote:
> I have not tested it, but CouchDB 1.1 does have some improvements:
>
> http://docs.couchbase.org/couchdb-release-1.1/index.html#couchdb-release-1.1-etag
> http://docs.couchbase.org/couchdb-release-1.1/index.html#coudhdb-release-1.1-updateafter

Thanks for the hints, stale=update_after seems to be reason enough for 
upgrading to 1.1 :-)

Cheers,

-Torstein

Re: Complex queries & results

Posted by Gabor Ratky <rg...@rgabostyle.com>.

The default blocking also only applies to views in a single design document, any other view in other design documents can be queried and will respond immediately if they are up to date (or with stale=ok, stale=update_after as mentioned below). This also means that the view update is triggered only for views in the design document in question and not others.

Gabor

On Wednesday, June 8, 2011 at 10:53 AM, Robert Newson wrote:

> "couch doesn't answer any query while it's doing its view updates."
> 
> This is not true, though it is the default behavior to block queries
> until the view is up to date. Pass ?stale=ok to get results from the
> view immediately (but not that the view won't be updated). See the new
> ?stale=update_after for a non-blocking view query that triggers a view
> update asynchronously.
> 
> B.
> 
> On 8 June 2011 06:56, Marcello Nuccio <marcello.nuccio@gmail.com (mailto:marcello.nuccio@gmail.com)> wrote:
> > I have not tested it, but CouchDB 1.1 does have some improvements:
> > 
> > http://docs.couchbase.org/couchdb-release-1.1/index.html#couchdb-release-1.1-etag
> > http://docs.couchbase.org/couchdb-release-1.1/index.html#coudhdb-release-1.1-updateafter
> > 
> > - Marcello
> > 
> > 2011/6/8 Mark Hahn <mark@boutiquing.com (mailto:mark@boutiquing.com)>:
> > > > couch doesn't answer any query while it's doing its view updates.
> > > 
> > > I was shocked when I first experienced this myself. I posted here and found
> > > the bad news. This was the first and so far only serious wart I've found on
> > > couch.
> > > 
> > > I received replies suggesting that I put different views in separate design
> > > docs. Apparently the blocking only happens on the same design doc, at least
> > > when updating the view code. When one view in a design doc needs updating
> > > it updates them all. I haven't tried this yet.
> > > 
> > > I also haven't yet played with the "stale" option, which allows reading docs
> > > that are out-of-date. Between the two options I've got my fingers crossed
> > > that I can avoiding any real blocking. Blocking a server is the worst sin
> > > of all.
> > > 
> > > 
> > > On Tue, Jun 7, 2011 at 9:49 PM, Torstein Krause Johansen <
> > > torsteinkrausework@gmail.com (mailto:torsteinkrausework@gmail.com)> wrote:
> > > 
> > > > On 07/06/11 14:21, Torstein Krause Johansen wrote:
> > > > 
> > > > On 06/06/11 22:09, Sean Copenhaver wrote:
> > > > 
> > > > https://gist.github.com/1010318
> > > > > > 
> > > > > > I tried this out with 10 docs fitting your example structure and with a
> > > > > > plain query (no grouping, no filtering, reduce on) I get back:
> > > > > > 
> > > > > > { John: 4, Jane: 6 }
> > > > > 
> > > > > Looks spot on! Thank you _so_ much for doing this.
> > > > > 
> > > > > I'm really curious how this performs, I will be-siege my couch with bulk
> > > > > updates, giving it a big-ish data set while simultaneously be-siege it
> > > > > with reads GETs querying this map/reduce you've created. Will be very
> > > > > interesting.
> > > > 
> > > > I started by using siege to post 1000s of documents with 14 fields &
> > > > values (the actual data my application will be using) and let it run
> > > > till I got a fair data set. After reducing the now ~710,000 document big DB
> > > > from 4.2GB to ~360MB, the queries went from ~8s to ~0.05s. Fantastic.
> > > > 
> > > > I then unleashed siege again (100 parallel threads this time, creating
> > > > 200 new documents each using the bulk endpoint (siege somehow didn't
> > > > want to work with my initial 1000 document big .json file, so I had to
> > > > reduce it to 200 to make siege not choke on it)) and wget (creating random
> > > > data, using the normal document endpoint), the queries immediately started
> > > > to climb upwards, 1s, 2s, 3s ... 80s and with no sign of stopping.
> > > > 
> > > > To see if it was the simultaneous write and read that were causing the
> > > > longer query times, I stopped siege and wget on my test machine
> > > > (different host, going through the same network switch).
> > > > 
> > > > Since there had been quite a number of new documents, couch started
> > > > its checkpoint view updating leaving my couch unable to respond to any
> > > > queries for around 90s.
> > > > 
> > > > The query times then dropped down, stabilising on 0.06 to 0.08s when
> > > > querying the DB with now ~800,000 documents and result sets containing ~50
> > > > keys with ~2000 counts each. Great!
> > > > 
> > > > The climbing query times when doing so many updates is not a real
> > > > concern for me as I'll put a queue in front of couch which buffers up
> > > > the incoming write requests and fires up a bulk update every 30
> > > > seconds or so. Couch seems more than fast enough write-wise as long as
> > > > the documents are provided in bulks.
> > > > 
> > > > What does worry me, though, is that couch doesn't answer any query
> > > > while it's doing its view updates. Even with a nice cache server in
> > > > front which can serve old content till couch is finished updating its
> > > > views, I still find it a bit unsettling. Do you have any tips for me here?
> > > > 
> > > > Cheers,
> > > > 
> > > > -Torstein
> > > 
> > > 
> > > 
> > > --
> > > Mark Hahn
> > > Website Manager
> > > mark@boutiquing.com (mailto:mark@boutiquing.com)
> > > 949-229-1012

Re: Complex queries & results

Posted by Robert Newson <ro...@gmail.com>.

"couch doesn't answer any query while it's doing its view updates."

This is not true, though it is the default behavior to block queries
until the view is up to date. Pass ?stale=ok to get results from the
view immediately (but not that the view won't be updated). See the new
?stale=update_after for a non-blocking view query that triggers a view
update asynchronously.

B.

On 8 June 2011 06:56, Marcello Nuccio <ma...@gmail.com> wrote:
> I have not tested it, but CouchDB 1.1 does have some improvements:
>
> http://docs.couchbase.org/couchdb-release-1.1/index.html#couchdb-release-1.1-etag
> http://docs.couchbase.org/couchdb-release-1.1/index.html#coudhdb-release-1.1-updateafter
>
> - Marcello
>
> 2011/6/8 Mark Hahn <ma...@boutiquing.com>:
>>> couch doesn't answer any query while it's doing its view updates.
>>
>> I was shocked when I first experienced this myself.  I posted here and found
>> the bad news.  This was the first and so far only serious wart I've found on
>> couch.
>>
>> I received replies suggesting that I put different views in separate design
>> docs.  Apparently the blocking only happens on the same design doc, at least
>> when updating the view code.  When one view in a design doc needs updating
>> it updates them all.  I haven't tried this yet.
>>
>> I also haven't yet played with the "stale" option, which allows reading docs
>> that are out-of-date.  Between the two options I've got my fingers crossed
>> that I can avoiding any real blocking.  Blocking a server is the worst sin
>> of all.
>>
>>
>> On Tue, Jun 7, 2011 at 9:49 PM, Torstein Krause Johansen <
>> torsteinkrausework@gmail.com> wrote:
>>
>>> On 07/06/11 14:21, Torstein Krause Johansen wrote:
>>>
>>>  On 06/06/11 22:09, Sean Copenhaver wrote:
>>>>
>>>
>>>  https://gist.github.com/1010318
>>>>>
>>>>> I tried this out with 10 docs fitting your example structure and with a
>>>>> plain query (no grouping, no filtering, reduce on) I get back:
>>>>>
>>>>> { John: 4, Jane: 6 }
>>>>>
>>>>
>>>> Looks spot on! Thank you _so_ much for doing this.
>>>>
>>>> I'm really curious how this performs, I will be-siege my couch with bulk
>>>> updates, giving it a big-ish data set while simultaneously be-siege it
>>>> with reads GETs querying this map/reduce you've created. Will be very
>>>> interesting.
>>>>
>>>
>>> I started by using siege to post 1000s of documents with 14 fields &
>>> values (the actual data my application will be using) and let it run
>>> till I got a fair data set. After reducing the now ~710,000 document big DB
>>> from 4.2GB to ~360MB, the queries went from ~8s to ~0.05s. Fantastic.
>>>
>>> I then unleashed siege again (100 parallel threads this time, creating
>>> 200 new documents each using the bulk endpoint (siege somehow didn't
>>> want to work with my initial 1000 document big .json file, so I had to
>>> reduce it to 200 to make siege not choke on it)) and wget (creating random
>>> data, using the normal document endpoint), the queries immediately started
>>> to climb upwards, 1s, 2s, 3s ... 80s and with no sign of stopping.
>>>
>>> To see if it was the simultaneous write and read that were causing the
>>> longer query times, I stopped siege and wget on my test machine
>>> (different host, going through the same network switch).
>>>
>>> Since there had been quite a number of new documents, couch started
>>> its checkpoint view updating leaving my couch unable to respond to any
>>> queries for around 90s.
>>>
>>> The query times then dropped down, stabilising on 0.06 to 0.08s when
>>> querying the DB with now ~800,000 documents and result sets containing ~50
>>> keys with ~2000 counts each. Great!
>>>
>>> The climbing query times when doing so many updates is not a real
>>> concern for me as I'll put a queue in front of couch which buffers up
>>> the incoming write requests and fires up a bulk update every 30
>>> seconds or so. Couch seems more than fast enough write-wise as long as
>>> the documents are provided in bulks.
>>>
>>> What does worry me, though, is that couch doesn't answer any query
>>> while it's doing its view updates. Even with a nice cache server in
>>> front which can serve old content till couch is finished updating its
>>> views, I still find it a bit unsettling. Do you have any tips for me here?
>>>
>>> Cheers,
>>>
>>> -Torstein
>>>
>>
>>
>>
>> --
>> Mark Hahn
>> Website Manager
>> mark@boutiquing.com
>> 949-229-1012
>>
>

Re: Complex queries & results

Posted by Marcello Nuccio <ma...@gmail.com>.

I have not tested it, but CouchDB 1.1 does have some improvements:

http://docs.couchbase.org/couchdb-release-1.1/index.html#couchdb-release-1.1-etag
http://docs.couchbase.org/couchdb-release-1.1/index.html#coudhdb-release-1.1-updateafter

- Marcello

2011/6/8 Mark Hahn <ma...@boutiquing.com>:
>> couch doesn't answer any query while it's doing its view updates.
>
> I was shocked when I first experienced this myself.  I posted here and found
> the bad news.  This was the first and so far only serious wart I've found on
> couch.
>
> I received replies suggesting that I put different views in separate design
> docs.  Apparently the blocking only happens on the same design doc, at least
> when updating the view code.  When one view in a design doc needs updating
> it updates them all.  I haven't tried this yet.
>
> I also haven't yet played with the "stale" option, which allows reading docs
> that are out-of-date.  Between the two options I've got my fingers crossed
> that I can avoiding any real blocking.  Blocking a server is the worst sin
> of all.
>
>
> On Tue, Jun 7, 2011 at 9:49 PM, Torstein Krause Johansen <
> torsteinkrausework@gmail.com> wrote:
>
>> On 07/06/11 14:21, Torstein Krause Johansen wrote:
>>
>>  On 06/06/11 22:09, Sean Copenhaver wrote:
>>>
>>
>>  https://gist.github.com/1010318
>>>>
>>>> I tried this out with 10 docs fitting your example structure and with a
>>>> plain query (no grouping, no filtering, reduce on) I get back:
>>>>
>>>> { John: 4, Jane: 6 }
>>>>
>>>
>>> Looks spot on! Thank you _so_ much for doing this.
>>>
>>> I'm really curious how this performs, I will be-siege my couch with bulk
>>> updates, giving it a big-ish data set while simultaneously be-siege it
>>> with reads GETs querying this map/reduce you've created. Will be very
>>> interesting.
>>>
>>
>> I started by using siege to post 1000s of documents with 14 fields &
>> values (the actual data my application will be using) and let it run
>> till I got a fair data set. After reducing the now ~710,000 document big DB
>> from 4.2GB to ~360MB, the queries went from ~8s to ~0.05s. Fantastic.
>>
>> I then unleashed siege again (100 parallel threads this time, creating
>> 200 new documents each using the bulk endpoint (siege somehow didn't
>> want to work with my initial 1000 document big .json file, so I had to
>> reduce it to 200 to make siege not choke on it)) and wget (creating random
>> data, using the normal document endpoint), the queries immediately started
>> to climb upwards, 1s, 2s, 3s ... 80s and with no sign of stopping.
>>
>> To see if it was the simultaneous write and read that were causing the
>> longer query times, I stopped siege and wget on my test machine
>> (different host, going through the same network switch).
>>
>> Since there had been quite a number of new documents, couch started
>> its checkpoint view updating leaving my couch unable to respond to any
>> queries for around 90s.
>>
>> The query times then dropped down, stabilising on 0.06 to 0.08s when
>> querying the DB with now ~800,000 documents and result sets containing ~50
>> keys with ~2000 counts each. Great!
>>
>> The climbing query times when doing so many updates is not a real
>> concern for me as I'll put a queue in front of couch which buffers up
>> the incoming write requests and fires up a bulk update every 30
>> seconds or so. Couch seems more than fast enough write-wise as long as
>> the documents are provided in bulks.
>>
>> What does worry me, though, is that couch doesn't answer any query
>> while it's doing its view updates. Even with a nice cache server in
>> front which can serve old content till couch is finished updating its
>> views, I still find it a bit unsettling. Do you have any tips for me here?
>>
>> Cheers,
>>
>> -Torstein
>>
>
>
>
> --
> Mark Hahn
> Website Manager
> mark@boutiquing.com
> 949-229-1012
>

Re: Complex queries & results

Posted by Mark Hahn <ma...@boutiquing.com>.

> couch doesn't answer any query while it's doing its view updates.

I was shocked when I first experienced this myself.  I posted here and found
the bad news.  This was the first and so far only serious wart I've found on
couch.

I received replies suggesting that I put different views in separate design
docs.  Apparently the blocking only happens on the same design doc, at least
when updating the view code.  When one view in a design doc needs updating
it updates them all.  I haven't tried this yet.

I also haven't yet played with the "stale" option, which allows reading docs
that are out-of-date.  Between the two options I've got my fingers crossed
that I can avoiding any real blocking.  Blocking a server is the worst sin
of all.


On Tue, Jun 7, 2011 at 9:49 PM, Torstein Krause Johansen <
torsteinkrausework@gmail.com> wrote:

> On 07/06/11 14:21, Torstein Krause Johansen wrote:
>
>  On 06/06/11 22:09, Sean Copenhaver wrote:
>>
>
>  https://gist.github.com/1010318
>>>
>>> I tried this out with 10 docs fitting your example structure and with a
>>> plain query (no grouping, no filtering, reduce on) I get back:
>>>
>>> { John: 4, Jane: 6 }
>>>
>>
>> Looks spot on! Thank you _so_ much for doing this.
>>
>> I'm really curious how this performs, I will be-siege my couch with bulk
>> updates, giving it a big-ish data set while simultaneously be-siege it
>> with reads GETs querying this map/reduce you've created. Will be very
>> interesting.
>>
>
> I started by using siege to post 1000s of documents with 14 fields &
> values (the actual data my application will be using) and let it run
> till I got a fair data set. After reducing the now ~710,000 document big DB
> from 4.2GB to ~360MB, the queries went from ~8s to ~0.05s. Fantastic.
>
> I then unleashed siege again (100 parallel threads this time, creating
> 200 new documents each using the bulk endpoint (siege somehow didn't
> want to work with my initial 1000 document big .json file, so I had to
> reduce it to 200 to make siege not choke on it)) and wget (creating random
> data, using the normal document endpoint), the queries immediately started
> to climb upwards, 1s, 2s, 3s ... 80s and with no sign of stopping.
>
> To see if it was the simultaneous write and read that were causing the
> longer query times, I stopped siege and wget on my test machine
> (different host, going through the same network switch).
>
> Since there had been quite a number of new documents, couch started
> its checkpoint view updating leaving my couch unable to respond to any
> queries for around 90s.
>
> The query times then dropped down, stabilising on 0.06 to 0.08s when
> querying the DB with now ~800,000 documents and result sets containing ~50
> keys with ~2000 counts each. Great!
>
> The climbing query times when doing so many updates is not a real
> concern for me as I'll put a queue in front of couch which buffers up
> the incoming write requests and fires up a bulk update every 30
> seconds or so. Couch seems more than fast enough write-wise as long as
> the documents are provided in bulks.
>
> What does worry me, though, is that couch doesn't answer any query
> while it's doing its view updates. Even with a nice cache server in
> front which can serve old content till couch is finished updating its
> views, I still find it a bit unsettling. Do you have any tips for me here?
>
> Cheers,
>
> -Torstein
>



-- 
Mark Hahn
Website Manager
mark@boutiquing.com
949-229-1012

Re: Complex queries & results

Posted by Torstein Krause Johansen <to...@gmail.com>.

On 07/06/11 14:21, Torstein Krause Johansen wrote:

> On 06/06/11 22:09, Sean Copenhaver wrote:

>> https://gist.github.com/1010318
>>
>> I tried this out with 10 docs fitting your example structure and with a
>> plain query (no grouping, no filtering, reduce on) I get back:
>>
>> { John: 4, Jane: 6 }
>
> Looks spot on! Thank you _so_ much for doing this.
>
> I'm really curious how this performs, I will be-siege my couch with bulk
> updates, giving it a big-ish data set while simultaneously be-siege it
> with reads GETs querying this map/reduce you've created. Will be very
> interesting.

I started by using siege to post 1000s of documents with 14 fields &
values (the actual data my application will be using) and let it run
till I got a fair data set. After reducing the now ~710,000 document big 
DB from 4.2GB to ~360MB, the queries went from ~8s to ~0.05s. Fantastic.

I then unleashed siege again (100 parallel threads this time, creating
200 new documents each using the bulk endpoint (siege somehow didn't
want to work with my initial 1000 document big .json file, so I had to
reduce it to 200 to make siege not choke on it)) and wget (creating 
random data, using the normal document endpoint), the queries 
immediately started to climb upwards, 1s, 2s, 3s ... 80s and with no 
sign of stopping.

To see if it was the simultaneous write and read that were causing the
longer query times, I stopped siege and wget on my test machine
(different host, going through the same network switch).

Since there had been quite a number of new documents, couch started
its checkpoint view updating leaving my couch unable to respond to any
queries for around 90s.

The query times then dropped down, stabilising on 0.06 to 0.08s when
querying the DB with now ~800,000 documents and result sets containing 
~50 keys with ~2000 counts each. Great!

The climbing query times when doing so many updates is not a real
concern for me as I'll put a queue in front of couch which buffers up
the incoming write requests and fires up a bulk update every 30
seconds or so. Couch seems more than fast enough write-wise as long as
the documents are provided in bulks.

What does worry me, though, is that couch doesn't answer any query
while it's doing its view updates. Even with a nice cache server in
front which can serve old content till couch is finished updating its 
views, I still find it a bit unsettling. Do you have any tips for me here?

Cheers,

-Torstein

Re: Complex queries & results

Posted by Sean Copenhaver <se...@gmail.com>.

Yeah I'm afraid I can not attest to when a reduce function is unsafe.

I would image you have to be careful because the reduce values for each page
of the b-tree get stored in the view. So you wouldn't want a reduce value
that is getting out of hand. A reduce function that has side effects could
lead to hard to understand results since the whole range of values isn't
computed on each query. Also a reduce function that took a long time to
compute could slow down updating an index. There may be other
considerations.

I wouldn't think that the reduce function I gave you in the example is
unsafe mainly because it's a very simple aggregation dictionary, names to
integers. But it is an unbounded result which is never good. However many
unique 'a_name' values you have is the size of the result. It might cause
problems if you have a millions of unique 'a_name' values but probably not
when you have tens of thousands of unique values. So the order of magnitude
here is important I think, but only testing will uncover that.

On Tue, Jun 7, 2011 at 2:50 AM, Marcello Nuccio
<ma...@gmail.com>wrote:

> 2011/6/7 Torstein Krause Johansen <to...@gmail.com>:
> > I'm still puzzled, though. When reading up on reduces, I got put off
> doing
> > anything fancy in the reduce function as the guide on
> > http://guide.couchdb.org/draft/views.html#example/3 states:
> >
> > "A common mistake new CouchDB users make is attempting to construct
> complex
> > aggregate values with a reduce function. Full reductions should result in
> a
> > scalar value, like 5, and not, for instance, a JSON hash with a set of
> > unique keys and the count of each."
> >
> > And from my understanding, this is exactly what I want to do here, but
> > perhaps I'm misunderstanding the author's meaning here?
>
> In my experience it is not a problem for a reduce function to return an
> hash.
>
> The strict requirement for reduce functions, is to actually reduce the
> data it get passed to a small constant size value, be it a scalar or
> an object.
>
> You can look at this example for an idea of what I mean
>
> http://stackoverflow.com/questions/5637412/getting-a-list-of-documents-with-a-maximum-field-value-from-couchdb-view/5654154#5654154
>
> I hope to be corrected if I am wrong.
>
> Marcello
>

-- 
“The limits of language are the limits of one's world. “ -Ludwig von
Wittgenstein

Re: Complex queries & results

Posted by Marcello Nuccio <ma...@gmail.com>.

2011/6/7 Torstein Krause Johansen <to...@gmail.com>:
> I'm still puzzled, though. When reading up on reduces, I got put off doing
> anything fancy in the reduce function as the guide on
> http://guide.couchdb.org/draft/views.html#example/3 states:
>
> "A common mistake new CouchDB users make is attempting to construct complex
> aggregate values with a reduce function. Full reductions should result in a
> scalar value, like 5, and not, for instance, a JSON hash with a set of
> unique keys and the count of each."
>
> And from my understanding, this is exactly what I want to do here, but
> perhaps I'm misunderstanding the author's meaning here?

In my experience it is not a problem for a reduce function to return an hash.

The strict requirement for reduce functions, is to actually reduce the
data it get passed to a small constant size value, be it a scalar or
an object.

You can look at this example for an idea of what I mean
http://stackoverflow.com/questions/5637412/getting-a-list-of-documents-with-a-maximum-field-value-from-couchdb-view/5654154#5654154

I hope to be corrected if I am wrong.

Marcello

Re: Complex queries & results

Posted by Torstein Krause Johansen <to...@gmail.com>.

Hi Sean,

and thank you so much for your reply.

On 06/06/11 22:09, Sean Copenhaver wrote:

> Anyway, back to what you are trying to accomplish. Honestly it sounds like
> you are trying to get too advanced for the built-in _count or _sum reduce
> functions. Have you tried writing a custom reduce function that does the
> grouping how you want, basically by name alone?
>
> https://gist.github.com/1010318
>
> I tried this out with 10 docs fitting your example structure and with a
> plain query (no grouping, no filtering, reduce on) I get back:
>
> { John: 4, Jane: 6 }

Looks spot on! Thank you _so_ much for doing this.

I'm really curious how this performs, I will be-siege my couch with bulk 
updates, giving it a big-ish data set while simultaneously be-siege it 
with reads GETs querying this map/reduce you've created. Will be very 
interesting.

I'm still puzzled, though. When reading up on reduces, I got put off 
doing anything fancy in the reduce function as the guide on 
http://guide.couchdb.org/draft/views.html#example/3 states:

"A common mistake new CouchDB users make is attempting to construct 
complex aggregate values with a reduce function. Full reductions should 
result in a scalar value, like 5, and not, for instance, a JSON hash 
with a set of unique keys and the count of each."

And from my understanding, this is exactly what I want to do here, but 
perhaps I'm misunderstanding the author's meaning here?

Cheers,

-Torstein

Re: Complex queries & results

Posted by Sean Copenhaver <se...@gmail.com>.

I would like to just add that CouchDB views represents a single dimensional
index. Index as in the same term in a relational database. A list keys is
like specifying sub-ordering, order by the first, then the second, then
the....

It sounded like at some point you may have been a bit confused as to what a
map function is defining. It is defining the index's key and the value
mapped to that key. This gives an otherwise completely unstructured
assortment of data a view of consistent structure.

Anyway, back to what you are trying to accomplish. Honestly it sounds like
you are trying to get too advanced for the built-in _count or _sum reduce
functions. Have you tried writing a custom reduce function that does the
grouping how you want, basically by name alone?

https://gist.github.com/1010318

I tried this out with 10 docs fitting your example structure and with a
plain query (no grouping, no filtering, reduce on) I get back:

{ John: 4, Jane: 6 }

Maybe this example can get you started. The map function I defined above I
used those keys because it looks like you are interested in filtering on
them. In my query I used for the example results I actually didn't use the
key at all (no filtering).

On Mon, Jun 6, 2011 at 8:12 AM, Torstein Krause Johansen <
torsteinkrausework@gmail.com> wrote:

> Hi Benjamin,
>
> and thanks for your comments.
>
>
> On 31/05/11 22:11, Benjamin Young wrote:
>
>> On 5/27/11 5:16 AM, Torstein Krause Johansen wrote:
>>
>
>  ?group=true&group_level=2&startkey=["2011-05-26"]&endkey=["2011-05-27",
>>>> {}]
>>>>
>>>> results in:
>>>>
>>>> {
>>>> "key": ["2011-05-26", "Lisa"],
>>>> "value": 1
>>>> },
>>>> {
>>>> "key": ["2011-05-26", "John"],
>>>> "value": 2
>>>> },
>>>> {
>>>> "key": ["2011-05-27", "John"],
>>>> "value": 1
>>>> }
>>>>
>>>> You can of course emit not just days, but also weeks, months,
>>>> quarters if that's what you always want. If it arbitrary and you need
>>>> to aggregate the names afterwards from this smaller set, yo should do
>>>> it in the client (whoever calls CouchDB to get this information out).
>>>>
>>>
>>> Mhmm, ok, thanks for explaining this.
>>>
>>> It means though, that for every unique time stamp that a_name has an
>>> entry, there will be a corresponding count returned (like the three
>>> you listed above).
>>>
>>> Hence, if a_name has 1000 entries at slightly different times within
>>> the time range I'm searching for (my created_at includes seconds), I
>>> will get 1000 such entries back.
>>>
>>
>> It really just depends on what you want to count/reduce/etc. If you only
>> want a count of the names (and don't want additional
>> granularity--name+year counts) then just return the name as the index.
>> If you want the count of names by year/month/day, etc, then return those
>> *after* the name, so you can add specificity by incrementing your
>> group_level param.
>>
>
> There's probably, something I haven't understood here. If I add my search
> fields after a_name, then how can I limit my search on start and endkey when
> a_name cannot be included in the start and end keys (since the name is what
> I want to count on)?
>
> Just to be sure, I want to re-state what I want: I have documents with the
> following fields:
>
> {
>    one_id : 1,
>    another_id : 22,
>    created_at : "2011-05-26 23:22:11",
>    a_name : "Lisa"
> }
>
> I want to be able to search all occurrences with a combination of the three
> first ones as query parameters and then count the number of a_name
> occurrences within each of these search collections.
>
> There will be many entries like the one above (say 30.000), where the only
> difference is the created_at field. Searching for these variable parameters:
>
>    one_id=1,
>    another_id=22,
>    created_at > "2011-05-26 23:30:00"
>    created_at < "2011-05-27 01:00:00"
>
> I want to end up with a dictionary listing the names and their count
> matching the search parameters:
>
> {
>   "Lisa" : 132
>   "John" : 16
> }
>
> If I put [created_at, one_id, another_id, a_name] in the key, I can use the
> start and end keys :
> ?group=true&
> group_level=4&
> startkey=["2011-05-26 23:30:00",1,22]&
> endkey=["2011-05-27 01:00:00",2,23]
>
> I will get results like these:
> {
>  "key": ["2011-05-26 23:30:10", 1, 22, "Lisa"],
>  "value": 1
> },
> {
>  "key": ["2011-05-26 23:30:12", 1, 22, "Lisa"],
>  "value": 3
> },
> {
>  "key": ["2011-05-26 23:33:43", 1, 22, "Lisa"],
>  "value": 5
> },
> [..]
>
> Giving me a quite big result set, since there's so many hits where the
> created_at is slightly different.
>
>
>  Alternatively, if you want to count *just* the names and *just* the
>> dates, you'll need two indexes ones for names and one for dates as you
>> can't "skip" the key groups (as your example tried to do with [{},...].
>>
>> Basically, you'll need an additional view/index for each key you're
>> wanting to count + whatever output you want to make the counting more
>> granular (in this case, date).
>>
>
> Mhmm. So in this case, it means I need an index for one_id, another_id and
> a_name (three ones)? If yes, I'm puzzled as to how I can make use of these
> indexes just with one GET request?
>
> [..]
>
> Initially, I got something working for my use case, using two indexes, one
> to get the a_name values based based on the search queries a_value,
> another_value & created_at. Querying the second index, I got the number of
> occurrences for a_name within the hits returned from the first query.
>
> However, this didn't feel optimal (although I've read posts on the mailing
> list of people doing two batches of queries before), so I tried to go down a
> different road, as described above.
>
> Best regards,
>
> -Torstein
>



-- 
“The limits of language are the limits of one's world. “ -Ludwig von
Wittgenstein