You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Sean Clark Hess <se...@gmail.com> on 2010/03/01 17:30:24 UTC

View Troubles

We have a database full of tv listing information. I want to write a view
that will let me say "What is playing on ESPN at 14:45?" One way I can
accomplish this is to emit the show multiple times, on the half hour, and
then only ask what is playing at each half-hour interval. This is no ideal,
because it makes my view bigger (and this table is already huge).

Can anyone think of a better idea? I'm willing to change how the objects are
stored if that would help.

*Event Database Objects*
{ "network": "ESPN", "date":"2010-02-28", "time":"14:30", "duration" :
"01:25" }

Re: View Troubles

Posted by Brian Candler <B....@pobox.com>.
On Tue, Mar 02, 2010 at 12:20:28PM -0700, Sean Clark Hess wrote:
> > Multi-range queries have been mooted as a future feature.
> 
> 
> Mooted? As in "decided to be moot"? I've never heard it used as a verb
> before :)

http://www.chambersharrap.co.uk/chambers/features/chref/chref.py/main?title=21st&query=moot

> I'm ok working around it, but I don't really see why not to add it
> if you're allowed to ask for multiple keys.

It's just something that nobody has gotten around to implementing yet. The
suggestion was something like {queries:[startkey:"foo1",endkey:"foo2"],...}

> > Do remember that views themselves can be compacted.
> 
> Woah, what? What does that even mean? I thought compaction just removed old
> versions of documents. Does Couch keep around old versions of views?

As far as I know, the view Btrees are also append-only, which means that
compaction will remove wasted space.  Also, the keys may not have been
emitted in optimal order, and compaction should fix that too.

The view compaction API is documented on the wiki.

Regards,

Brian.

Re: View Troubles

Posted by Zachary Zolton <za...@gmail.com>.
Views use the same append-only B-tree storage as the database files.
So when you update a doc, the old revision will be there until
compaction.

On Tue, Mar 2, 2010 at 1:20 PM, Sean Clark Hess <se...@gmail.com> wrote:
>>
>> No, although you can have a different view emitting [user, time, station]
>
> instead of [user, station, time]
>
>
> Good point. I'm constantly worried about size, since that database is 12 GB
> without any views.
>
> Multi-range queries have been mooted as a future feature.
>
>
> Mooted? As in "decided to be moot"? I've never heard it used as a verb
> before :) I'm ok working around it, but I don't really see why not to add it
> if you're allowed to ask for multiple keys.
>
> Do remember that views themselves can be compacted.
>
>
> Woah, what? What does that even mean? I thought compaction just removed old
> versions of documents. Does Couch keep around old versions of views?
>
> Thanks for the tip about the caching.
>
> On Mon, Mar 1, 2010 at 11:36 AM, Brian Candler <B....@pobox.com> wrote:
>
>> On Mon, Mar 01, 2010 at 10:16:11AM -0700, Sean Clark Hess wrote:
>> > I think I'll just emit every half hour. I can't do ranges easily, because
>> > another query needs to ask for "everything playing on all of a user's
>> > stations at 10:00". I have to use the "keys" post for that, which doesn't
>> > take ranges.
>>
>> No, although you can have a different view emitting [user, time, station]
>> instead of [user, station, time]
>>
>> Multi-range queries have been mooted as a future feature.
>>
>> > Right now we're emitting the data we need for that (above) query as the
>> > value for that view. I was mostly worried about size because we're
>> emitting
>> > so much data. Would it be much slower to emit null and use include_docs?
>> > That would save a significant amount of disk space, right?
>>
>> The view index will be smaller by the size of each document. When fetching
>> docs using include_docs=true it'll need to traverse the Btree to find them;
>> but equally it will need to traverse less disk space for the view itself,
>> and hence may find more of it in cache.  You'll need to evaluate the
>> tradeoff yourself, but if you're only fetching some dozens of rows at a
>> time
>> I'd have thought the include_docs overhead would not be noticed.
>>
>> Do remember that views themselves can be compacted.
>>
>> Cheers,
>>
>> Brian.
>>
>

Re: View Troubles

Posted by Sean Clark Hess <se...@gmail.com>.
>
> No, although you can have a different view emitting [user, time, station]

instead of [user, station, time]


Good point. I'm constantly worried about size, since that database is 12 GB
without any views.

Multi-range queries have been mooted as a future feature.


Mooted? As in "decided to be moot"? I've never heard it used as a verb
before :) I'm ok working around it, but I don't really see why not to add it
if you're allowed to ask for multiple keys.

Do remember that views themselves can be compacted.


Woah, what? What does that even mean? I thought compaction just removed old
versions of documents. Does Couch keep around old versions of views?

Thanks for the tip about the caching.

On Mon, Mar 1, 2010 at 11:36 AM, Brian Candler <B....@pobox.com> wrote:

> On Mon, Mar 01, 2010 at 10:16:11AM -0700, Sean Clark Hess wrote:
> > I think I'll just emit every half hour. I can't do ranges easily, because
> > another query needs to ask for "everything playing on all of a user's
> > stations at 10:00". I have to use the "keys" post for that, which doesn't
> > take ranges.
>
> No, although you can have a different view emitting [user, time, station]
> instead of [user, station, time]
>
> Multi-range queries have been mooted as a future feature.
>
> > Right now we're emitting the data we need for that (above) query as the
> > value for that view. I was mostly worried about size because we're
> emitting
> > so much data. Would it be much slower to emit null and use include_docs?
> > That would save a significant amount of disk space, right?
>
> The view index will be smaller by the size of each document. When fetching
> docs using include_docs=true it'll need to traverse the Btree to find them;
> but equally it will need to traverse less disk space for the view itself,
> and hence may find more of it in cache.  You'll need to evaluate the
> tradeoff yourself, but if you're only fetching some dozens of rows at a
> time
> I'd have thought the include_docs overhead would not be noticed.
>
> Do remember that views themselves can be compacted.
>
> Cheers,
>
> Brian.
>

Re: View Troubles

Posted by Brian Candler <B....@pobox.com>.
On Mon, Mar 01, 2010 at 10:16:11AM -0700, Sean Clark Hess wrote:
> I think I'll just emit every half hour. I can't do ranges easily, because
> another query needs to ask for "everything playing on all of a user's
> stations at 10:00". I have to use the "keys" post for that, which doesn't
> take ranges.

No, although you can have a different view emitting [user, time, station]
instead of [user, station, time]

Multi-range queries have been mooted as a future feature.

> Right now we're emitting the data we need for that (above) query as the
> value for that view. I was mostly worried about size because we're emitting
> so much data. Would it be much slower to emit null and use include_docs?
> That would save a significant amount of disk space, right?

The view index will be smaller by the size of each document. When fetching
docs using include_docs=true it'll need to traverse the Btree to find them;
but equally it will need to traverse less disk space for the view itself,
and hence may find more of it in cache.  You'll need to evaluate the
tradeoff yourself, but if you're only fetching some dozens of rows at a time
I'd have thought the include_docs overhead would not be noticed.

Do remember that views themselves can be compacted.

Cheers,

Brian.

Re: View Troubles

Posted by Sean Clark Hess <se...@gmail.com>.
I think I'll just emit every half hour. I can't do ranges easily, because
another query needs to ask for "everything playing on all of a user's
stations at 10:00". I have to use the "keys" post for that, which doesn't
take ranges.

Right now we're emitting the data we need for that (above) query as the
value for that view. I was mostly worried about size because we're emitting
so much data. Would it be much slower to emit null and use include_docs?
That would save a significant amount of disk space, right?

On Mon, Mar 1, 2010 at 9:53 AM, Brian Candler <B....@pobox.com> wrote:

> On Mon, Mar 01, 2010 at 09:30:24AM -0700, Sean Clark Hess wrote:
> > We have a database full of tv listing information. I want to write a view
> > that will let me say "What is playing on ESPN at 14:45?" One way I can
> > accomplish this is to emit the show multiple times, on the half hour, and
> > then only ask what is playing at each half-hour interval. This is no
> ideal,
> > because it makes my view bigger (and this table is already huge).
> >
> > Can anyone think of a better idea? I'm willing to change how the objects
> are
> > stored if that would help.
> >
> > *Event Database Objects*
> > { "network": "ESPN", "date":"2010-02-28", "time":"14:30", "duration" :
> > "01:25" }
>
> If you know that no show is longer than 3 hours, then just query for all
> shows on ESPN which start between 11:45 and 14:45.
>
> If your view design emits [channel, date, time] then this is just a
> startkey/endkey query - and if you do this with descending=true then you'll
> probably get the program you want first, and can ignore the remainder.  If
> you get nothing, then repeat the query with a longer period.
>
> Sometimes a multi-key fetch is helpful. e.g. if programmes always start on
> 5-minute boundaries you might get away with
>
> {"keys":[["2010-02-28","14:45","ESPN"],
>         ["2010-02-28","14:40","ESPN"],
>         ["2010-02-28","14:35","ESPN"],
>         ..]}
>
> But in this case, emitting the keys to allow a single startkey/endkey would
> be more efficient.
>
> You might be able to shrink the view by parsing the date and time and
> emitting a number:
>
> js> Date.parse("2010/02/28 14:45")
> 1267368300000
>
> This is the number of milliseconds since Jan 1 1970, and should take 8
> bytes
> (double-precision float). Beware time zone issues of course.
>
> HTH,
>
> Brian.
>

Re: View Troubles

Posted by Brian Candler <B....@pobox.com>.
On Mon, Mar 01, 2010 at 09:30:24AM -0700, Sean Clark Hess wrote:
> We have a database full of tv listing information. I want to write a view
> that will let me say "What is playing on ESPN at 14:45?" One way I can
> accomplish this is to emit the show multiple times, on the half hour, and
> then only ask what is playing at each half-hour interval. This is no ideal,
> because it makes my view bigger (and this table is already huge).
> 
> Can anyone think of a better idea? I'm willing to change how the objects are
> stored if that would help.
> 
> *Event Database Objects*
> { "network": "ESPN", "date":"2010-02-28", "time":"14:30", "duration" :
> "01:25" }

If you know that no show is longer than 3 hours, then just query for all
shows on ESPN which start between 11:45 and 14:45.

If your view design emits [channel, date, time] then this is just a
startkey/endkey query - and if you do this with descending=true then you'll
probably get the program you want first, and can ignore the remainder.  If
you get nothing, then repeat the query with a longer period.

Sometimes a multi-key fetch is helpful. e.g. if programmes always start on
5-minute boundaries you might get away with

{"keys":[["2010-02-28","14:45","ESPN"],
         ["2010-02-28","14:40","ESPN"],
         ["2010-02-28","14:35","ESPN"],
         ..]}

But in this case, emitting the keys to allow a single startkey/endkey would
be more efficient.

You might be able to shrink the view by parsing the date and time and
emitting a number:

js> Date.parse("2010/02/28 14:45")
1267368300000

This is the number of milliseconds since Jan 1 1970, and should take 8 bytes
(double-precision float). Beware time zone issues of course.

HTH,

Brian.

Re: View Troubles

Posted by J Chris Anderson <jc...@gmail.com>.
On Mar 1, 2010, at 8:30 AM, Sean Clark Hess wrote:

> We have a database full of tv listing information. I want to write a view
> that will let me say "What is playing on ESPN at 14:45?" One way I can
> accomplish this is to emit the show multiple times, on the half hour, and
> then only ask what is playing at each half-hour interval. This is no ideal,
> because it makes my view bigger (and this table is already huge).
> 

This is the right way to do it. Hopefully disk space trends to free fast enough to support your app. :) Probably each show will only emit a handful of times anyway, so the cost isn't actually so high.

the alternative is to emit the start and end times, and then make up for it with a lot of complexity on the the query side. this alternative approach will only work if you can lay down a validation saying what the longest allowable show is, and then base your complex query logic around it. I'd recommend the first approach instead.

Chris

> Can anyone think of a better idea? I'm willing to change how the objects are
> stored if that would help.
> 
> *Event Database Objects*
> { "network": "ESPN", "date":"2010-02-28", "time":"14:30", "duration" :
> "01:25" }