You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by muji <mi...@freeformsystems.com> on 2011/06/03 15:43:55 UTC

Tracking file throughput?

Hi,

I'm still new to couchdb and nosql so apologies if the answer to this
is trivial.

I'm trying to track the throughput of a file sent via a POST request
in a couchdb document.

My initial implementation creates a document for the file before the
POST is sent and then I have an update handler that increments the
"uploadbytes" for every chunk of data received from the client.

This *nearly* works except that I get document update conflicts (which
I think is to do with me not being able to throttle back the upload
while the db is updated) but the main problem is that for large files
(~2.4GB) the number of document revisions is around 40-50,000. So I
have a single document taking up between 0.7GB and 1GB. After
compaction if reduces to ~380KB which of course is much better but
this still seems excessive and poses problems with compacting to a
write heavy database. I understand the trick to that is to replicate,
compact and replicate back to the source, please correct me if I'm
wrong...

So, I don't think this approach is viable which makes me wonder
whether setting the _revs_limit will help, although I understand that
setting this per database still requires compaction and will save on
space after compaction.

I was thinking that tracking the throughput as chunks in individual
documents and then calculating the throughput with a map/reduce on all
the chunks might be a better approach. Although I'm concerned that
having lots of little documents for each data chunk will also take up
large amounts of space...

Any advice and guidance on the best way to tackle this would be much
appreciated.

-- 
muji.

Re: Tracking file throughput?

Posted by Owen Marshall <om...@facilityone.com>.

At Fri, 3 Jun 2011 16:49:25 +0200,
Jan Lehnardt wrote:
> 
> On 3 Jun 2011, at 16:47, Owen Marshall wrote:
> 
> > The odds of this happening are much lower because they tend to be very fast, but with a ton of rapid writes, anything is possible!
> 
> http://html5zombo.com/

They ported this to HTML5?!

Pack it in, everybody -- the Internet is complete ;-)

-- 
Owen Marshall
FacilityONE
http://www.facilityone.com | (502) 805-2126

Re: Tracking file throughput?

Posted by Jan Lehnardt <ja...@apache.org>.

On 3 Jun 2011, at 16:47, Owen Marshall wrote:

> The odds of this happening are much lower because they tend to be very fast, but with a ton of rapid writes, anything is possible!

http://html5zombo.com/

Cheers
Jan
--

Re: Tracking file throughput?

Posted by Owen Marshall <om...@facilityone.com>.

At Fri, 3 Jun 2011 15:28:54 +0100,
muji wrote:

> A quick search for continuous compaction didn't yield anything, and I
> don't see anything here:
> 
> http://wiki.apache.org/couchdb/Compaction
> 
> Could you point me in the right direction please?

I *think* what Jan means is to fire off a compaction call to the database either with each update, or every so many updates. I looked at this as an option under similar circumstances but didn't end up doing it because the database was under heavy writes and rapid compaction made me feel just too... nervous ;-)

You should experiment with the effects of this. It may be absolutely fine for you.

> Funny you mention about caching before updating couch, that was my
> very first implementation! I was updating Redis with the throughput
> and then updating the file document once the upload completed. That
> worked very well but I wanted to remove Redis from the stack as the
> application is already pretty complex.
> 
> I'm guessing my best option is to revert back to that technique?

Maybe just prepare the data directly in your application layer and send the document out only once, when everything completes.

> As an aside, why would my document update handler be raising
> conflicts? My understanding was that update handlers would not raise
> conflicts - is that correct?

IIRC, document update handlers *can* run into conflicts. The odds of this happening are much lower because they tend to be very fast, but with a ton of rapid writes, anything is possible!

-- 
Owen Marshall
FacilityONE
http://www.facilityone.com | (502) 805-2126

Re: Tracking file throughput?

Posted by Jan Lehnardt <ja...@apache.org>.

On 3 Jun 2011, at 17:00, muji wrote:

> Thanks again for your help Jan.
> 
> Sorry, I thought that continuous compaction might be a feature I had
> overlooked. I have no problems automating a compaction process, I
> always envisaged needing to do that...
> 
> I think that I will revert to running far fewer updates on the couchdb
> document and caching the throughput in Redis as disc space is more of
> a priority than application complexity.
> 
> A few more (different) questions in the pipeline as I'm still learning couch ;)

Sure, any time :)

Cheers
Jan
-- 

> 
> On Fri, Jun 3, 2011 at 3:37 PM, Jan Lehnardt <ja...@apache.org> wrote:
>> 
>> On 3 Jun 2011, at 16:28, muji wrote:
>> 
>>> Thanks very much for the help.
>>> 
>>> I could of course reduce the amount of times the update is done but
>>> the service plans to bill based on throughput so this is quite
>>> critical from a billing perspective.
>> 
>> You can still bill on throughput as you will know exactly how much
>> date has been transferred in what amount of time, but reporting is
>> going to be less granular, i.e. chunks of say 10MB and not 100Kb or
>> however big chunks are.
>> 
>>> A quick search for continuous compaction didn't yield anything, and I
>>> don't see anything here:
>>> 
>>> http://wiki.apache.org/couchdb/Compaction
>>> 
>>> Could you point me in the right direction please?
>> 
>> I made it up and I explained how to do it. Pseduocode:
>> 
>> while(`curl http://127.0.0.1:5984/db/_compact`);
>> 
>>> Funny you mention about caching before updating couch, that was my
>>> very first implementation! I was updating Redis with the throughput
>>> and then updating the file document once the upload completed. That
>>> worked very well but I wanted to remove Redis from the stack as the
>>> application is already pretty complex.
>>> 
>>> I'm guessing my best option is to revert back to that technique?
>> 
>> It depends on what your goals are. The initial design you mentioned
>> seems fine to me if you compact often. If you are optimising for
>> disk space, Redis or memcached may be a good idea. If you are
>> optimising for a small stack, not having Redis or memcached is a
>> good idea.
>> 
>>> As an aside, why would my document update handler be raising
>>> conflicts? My understanding was that update handlers would not raise
>>> conflicts - is that correct?
>> 
>> That is not correct.
>> 
>> Cheers
>> Jan
>> --
>> 
>>> 
>>> Thanks!
>>> 
>>> On Fri, Jun 3, 2011 at 3:03 PM, Jan Lehnardt <ja...@apache.org> wrote:
>>>> Hi,
>>>> 
>>>> On 3 Jun 2011, at 15:43, muji wrote:
>>>>> I'm still new to couchdb and nosql so apologies if the answer to this
>>>>> is trivial.
>>>> 
>>>> No worries, we're all new at something :)
>>>> 
>>>>> 
>>>>> I'm trying to track the throughput of a file sent via a POST request
>>>>> in a couchdb document.
>>>>> 
>>>>> My initial implementation creates a document for the file before the
>>>>> POST is sent and then I have an update handler that increments the
>>>>> "uploadbytes" for every chunk of data received from the client.
>>>> 
>>>> Could you make that little less frequent in interpolate between the
>>>> data points? Instead of tracking bytes exactly at the chunk boundaries,
>>>> just update every 10 or so MB? And have the UI adjust accordingly?
>>>> 
>>>> 
>>>>> This *nearly* works except that I get document update conflicts (which
>>>>> I think is to do with me not being able to throttle back the upload
>>>>> while the db is updated) but the main problem is that for large files
>>>>> (~2.4GB) the number of document revisions is around 40-50,000. So I
>>>>> have a single document taking up between 0.7GB and 1GB. After
>>>>> compaction if reduces to ~380KB which of course is much better but
>>>>> this still seems excessive and poses problems with compacting to a
>>>>> write heavy database. I understand the trick to that is to replicate,
>>>>> compact and replicate back to the source, please correct me if I'm
>>>>> wrong...
>>>> 
>>>> Hm no that won't do anything, just regular compaction is good enough.
>>>> 
>>>>> So, I don't think this approach is viable which makes me wonder
>>>>> whether setting the _revs_limit will help, although I understand that
>>>>> setting this per database still requires compaction and will save on
>>>>> space after compaction.
>>>> 
>>>> _revs_limit won't help, you will always need to compact to get rid of
>>>> data.
>>>> 
>>>>> I was thinking that tracking the throughput as chunks in individual
>>>>> documents and then calculating the throughput with a map/reduce on all
>>>>> the chunks might be a better approach. Although I'm concerned that
>>>>> having lots of little documents for each data chunk will also take up
>>>>> large amounts of space...
>>>> 
>>>> Yeah, wouldn't save any space here. That said, the numbers you quote,
>>>> I wouldn't call "large amounts".
>>>> 
>>>> 
>>>>> Any advice and guidance on the best way to tackle this would be much
>>>>> appreciated.
>>>> 
>>>> I'd either set up continuous compaction (restart compaction right when
>>>> it is done) to keep DB size at a minimum or use an in-memory store
>>>> to keep track of the uploaded bytes.
>>>> 
>>>> Ideally though, CouchDB would give you an endpoint to query that kind
>>>> of data.
>>>> 
>>>> Cheers
>>>> Jan
>>>> --
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> muji.
>> 
>> 
> 
> 
> 
> -- 
> muji.

Re: Tracking file throughput?

Posted by muji <mi...@freeformsystems.com>.

Thanks again for your help Jan.

Sorry, I thought that continuous compaction might be a feature I had
overlooked. I have no problems automating a compaction process, I
always envisaged needing to do that...

I think that I will revert to running far fewer updates on the couchdb
document and caching the throughput in Redis as disc space is more of
a priority than application complexity.

A few more (different) questions in the pipeline as I'm still learning couch ;)

On Fri, Jun 3, 2011 at 3:37 PM, Jan Lehnardt <ja...@apache.org> wrote:
>
> On 3 Jun 2011, at 16:28, muji wrote:
>
>> Thanks very much for the help.
>>
>> I could of course reduce the amount of times the update is done but
>> the service plans to bill based on throughput so this is quite
>> critical from a billing perspective.
>
> You can still bill on throughput as you will know exactly how much
> date has been transferred in what amount of time, but reporting is
> going to be less granular, i.e. chunks of say 10MB and not 100Kb or
> however big chunks are.
>
>> A quick search for continuous compaction didn't yield anything, and I
>> don't see anything here:
>>
>> http://wiki.apache.org/couchdb/Compaction
>>
>> Could you point me in the right direction please?
>
> I made it up and I explained how to do it. Pseduocode:
>
> while(`curl http://127.0.0.1:5984/db/_compact`);
>
>> Funny you mention about caching before updating couch, that was my
>> very first implementation! I was updating Redis with the throughput
>> and then updating the file document once the upload completed. That
>> worked very well but I wanted to remove Redis from the stack as the
>> application is already pretty complex.
>>
>> I'm guessing my best option is to revert back to that technique?
>
> It depends on what your goals are. The initial design you mentioned
> seems fine to me if you compact often. If you are optimising for
> disk space, Redis or memcached may be a good idea. If you are
> optimising for a small stack, not having Redis or memcached is a
> good idea.
>
>> As an aside, why would my document update handler be raising
>> conflicts? My understanding was that update handlers would not raise
>> conflicts - is that correct?
>
> That is not correct.
>
> Cheers
> Jan
> --
>
>>
>> Thanks!
>>
>> On Fri, Jun 3, 2011 at 3:03 PM, Jan Lehnardt <ja...@apache.org> wrote:
>>> Hi,
>>>
>>> On 3 Jun 2011, at 15:43, muji wrote:
>>>> I'm still new to couchdb and nosql so apologies if the answer to this
>>>> is trivial.
>>>
>>> No worries, we're all new at something :)
>>>
>>>>
>>>> I'm trying to track the throughput of a file sent via a POST request
>>>> in a couchdb document.
>>>>
>>>> My initial implementation creates a document for the file before the
>>>> POST is sent and then I have an update handler that increments the
>>>> "uploadbytes" for every chunk of data received from the client.
>>>
>>> Could you make that little less frequent in interpolate between the
>>> data points? Instead of tracking bytes exactly at the chunk boundaries,
>>> just update every 10 or so MB? And have the UI adjust accordingly?
>>>
>>>
>>>> This *nearly* works except that I get document update conflicts (which
>>>> I think is to do with me not being able to throttle back the upload
>>>> while the db is updated) but the main problem is that for large files
>>>> (~2.4GB) the number of document revisions is around 40-50,000. So I
>>>> have a single document taking up between 0.7GB and 1GB. After
>>>> compaction if reduces to ~380KB which of course is much better but
>>>> this still seems excessive and poses problems with compacting to a
>>>> write heavy database. I understand the trick to that is to replicate,
>>>> compact and replicate back to the source, please correct me if I'm
>>>> wrong...
>>>
>>> Hm no that won't do anything, just regular compaction is good enough.
>>>
>>>> So, I don't think this approach is viable which makes me wonder
>>>> whether setting the _revs_limit will help, although I understand that
>>>> setting this per database still requires compaction and will save on
>>>> space after compaction.
>>>
>>> _revs_limit won't help, you will always need to compact to get rid of
>>> data.
>>>
>>>> I was thinking that tracking the throughput as chunks in individual
>>>> documents and then calculating the throughput with a map/reduce on all
>>>> the chunks might be a better approach. Although I'm concerned that
>>>> having lots of little documents for each data chunk will also take up
>>>> large amounts of space...
>>>
>>> Yeah, wouldn't save any space here. That said, the numbers you quote,
>>> I wouldn't call "large amounts".
>>>
>>>
>>>> Any advice and guidance on the best way to tackle this would be much
>>>> appreciated.
>>>
>>> I'd either set up continuous compaction (restart compaction right when
>>> it is done) to keep DB size at a minimum or use an in-memory store
>>> to keep track of the uploaded bytes.
>>>
>>> Ideally though, CouchDB would give you an endpoint to query that kind
>>> of data.
>>>
>>> Cheers
>>> Jan
>>> --
>>>
>>>
>>
>>
>>
>> --
>> muji.
>
>



-- 
muji.

Re: Tracking file throughput?

Posted by Jan Lehnardt <ja...@apache.org>.

On 3 Jun 2011, at 16:28, muji wrote:

> Thanks very much for the help.
> 
> I could of course reduce the amount of times the update is done but
> the service plans to bill based on throughput so this is quite
> critical from a billing perspective.

You can still bill on throughput as you will know exactly how much
date has been transferred in what amount of time, but reporting is
going to be less granular, i.e. chunks of say 10MB and not 100Kb or
however big chunks are.

> A quick search for continuous compaction didn't yield anything, and I
> don't see anything here:
> 
> http://wiki.apache.org/couchdb/Compaction
> 
> Could you point me in the right direction please?

I made it up and I explained how to do it. Pseduocode:

while(`curl http://127.0.0.1:5984/db/_compact`);

> Funny you mention about caching before updating couch, that was my
> very first implementation! I was updating Redis with the throughput
> and then updating the file document once the upload completed. That
> worked very well but I wanted to remove Redis from the stack as the
> application is already pretty complex.
> 
> I'm guessing my best option is to revert back to that technique?

It depends on what your goals are. The initial design you mentioned
seems fine to me if you compact often. If you are optimising for
disk space, Redis or memcached may be a good idea. If you are
optimising for a small stack, not having Redis or memcached is a
good idea.

> As an aside, why would my document update handler be raising
> conflicts? My understanding was that update handlers would not raise
> conflicts - is that correct?

That is not correct.

Cheers
Jan
-- 

> 
> Thanks!
> 
> On Fri, Jun 3, 2011 at 3:03 PM, Jan Lehnardt <ja...@apache.org> wrote:
>> Hi,
>> 
>> On 3 Jun 2011, at 15:43, muji wrote:
>>> I'm still new to couchdb and nosql so apologies if the answer to this
>>> is trivial.
>> 
>> No worries, we're all new at something :)
>> 
>>> 
>>> I'm trying to track the throughput of a file sent via a POST request
>>> in a couchdb document.
>>> 
>>> My initial implementation creates a document for the file before the
>>> POST is sent and then I have an update handler that increments the
>>> "uploadbytes" for every chunk of data received from the client.
>> 
>> Could you make that little less frequent in interpolate between the
>> data points? Instead of tracking bytes exactly at the chunk boundaries,
>> just update every 10 or so MB? And have the UI adjust accordingly?
>> 
>> 
>>> This *nearly* works except that I get document update conflicts (which
>>> I think is to do with me not being able to throttle back the upload
>>> while the db is updated) but the main problem is that for large files
>>> (~2.4GB) the number of document revisions is around 40-50,000. So I
>>> have a single document taking up between 0.7GB and 1GB. After
>>> compaction if reduces to ~380KB which of course is much better but
>>> this still seems excessive and poses problems with compacting to a
>>> write heavy database. I understand the trick to that is to replicate,
>>> compact and replicate back to the source, please correct me if I'm
>>> wrong...
>> 
>> Hm no that won't do anything, just regular compaction is good enough.
>> 
>>> So, I don't think this approach is viable which makes me wonder
>>> whether setting the _revs_limit will help, although I understand that
>>> setting this per database still requires compaction and will save on
>>> space after compaction.
>> 
>> _revs_limit won't help, you will always need to compact to get rid of
>> data.
>> 
>>> I was thinking that tracking the throughput as chunks in individual
>>> documents and then calculating the throughput with a map/reduce on all
>>> the chunks might be a better approach. Although I'm concerned that
>>> having lots of little documents for each data chunk will also take up
>>> large amounts of space...
>> 
>> Yeah, wouldn't save any space here. That said, the numbers you quote,
>> I wouldn't call "large amounts".
>> 
>> 
>>> Any advice and guidance on the best way to tackle this would be much
>>> appreciated.
>> 
>> I'd either set up continuous compaction (restart compaction right when
>> it is done) to keep DB size at a minimum or use an in-memory store
>> to keep track of the uploaded bytes.
>> 
>> Ideally though, CouchDB would give you an endpoint to query that kind
>> of data.
>> 
>> Cheers
>> Jan
>> --
>> 
>> 
> 
> 
> 
> -- 
> muji.

Re: Tracking file throughput?

Posted by muji <mi...@freeformsystems.com>.

Thanks very much for the help.

I could of course reduce the amount of times the update is done but
the service plans to bill based on throughput so this is quite
critical from a billing perspective.

A quick search for continuous compaction didn't yield anything, and I
don't see anything here:

http://wiki.apache.org/couchdb/Compaction

Could you point me in the right direction please?

Funny you mention about caching before updating couch, that was my
very first implementation! I was updating Redis with the throughput
and then updating the file document once the upload completed. That
worked very well but I wanted to remove Redis from the stack as the
application is already pretty complex.

I'm guessing my best option is to revert back to that technique?

As an aside, why would my document update handler be raising
conflicts? My understanding was that update handlers would not raise
conflicts - is that correct?

Thanks!

On Fri, Jun 3, 2011 at 3:03 PM, Jan Lehnardt <ja...@apache.org> wrote:
> Hi,
>
> On 3 Jun 2011, at 15:43, muji wrote:
>> I'm still new to couchdb and nosql so apologies if the answer to this
>> is trivial.
>
> No worries, we're all new at something :)
>
>>
>> I'm trying to track the throughput of a file sent via a POST request
>> in a couchdb document.
>>
>> My initial implementation creates a document for the file before the
>> POST is sent and then I have an update handler that increments the
>> "uploadbytes" for every chunk of data received from the client.
>
> Could you make that little less frequent in interpolate between the
> data points? Instead of tracking bytes exactly at the chunk boundaries,
> just update every 10 or so MB? And have the UI adjust accordingly?
>
>
>> This *nearly* works except that I get document update conflicts (which
>> I think is to do with me not being able to throttle back the upload
>> while the db is updated) but the main problem is that for large files
>> (~2.4GB) the number of document revisions is around 40-50,000. So I
>> have a single document taking up between 0.7GB and 1GB. After
>> compaction if reduces to ~380KB which of course is much better but
>> this still seems excessive and poses problems with compacting to a
>> write heavy database. I understand the trick to that is to replicate,
>> compact and replicate back to the source, please correct me if I'm
>> wrong...
>
> Hm no that won't do anything, just regular compaction is good enough.
>
>> So, I don't think this approach is viable which makes me wonder
>> whether setting the _revs_limit will help, although I understand that
>> setting this per database still requires compaction and will save on
>> space after compaction.
>
> _revs_limit won't help, you will always need to compact to get rid of
> data.
>
>> I was thinking that tracking the throughput as chunks in individual
>> documents and then calculating the throughput with a map/reduce on all
>> the chunks might be a better approach. Although I'm concerned that
>> having lots of little documents for each data chunk will also take up
>> large amounts of space...
>
> Yeah, wouldn't save any space here. That said, the numbers you quote,
> I wouldn't call "large amounts".
>
>
>> Any advice and guidance on the best way to tackle this would be much
>> appreciated.
>
> I'd either set up continuous compaction (restart compaction right when
> it is done) to keep DB size at a minimum or use an in-memory store
> to keep track of the uploaded bytes.
>
> Ideally though, CouchDB would give you an endpoint to query that kind
> of data.
>
> Cheers
> Jan
> --
>
>



-- 
muji.

Re: Tracking file throughput?

Posted by Jan Lehnardt <ja...@apache.org>.

Hi,

On 3 Jun 2011, at 15:43, muji wrote:
> I'm still new to couchdb and nosql so apologies if the answer to this
> is trivial.

No worries, we're all new at something :)

> 
> I'm trying to track the throughput of a file sent via a POST request
> in a couchdb document.
> 
> My initial implementation creates a document for the file before the
> POST is sent and then I have an update handler that increments the
> "uploadbytes" for every chunk of data received from the client.

Could you make that little less frequent in interpolate between the
data points? Instead of tracking bytes exactly at the chunk boundaries,
just update every 10 or so MB? And have the UI adjust accordingly?


> This *nearly* works except that I get document update conflicts (which
> I think is to do with me not being able to throttle back the upload
> while the db is updated) but the main problem is that for large files
> (~2.4GB) the number of document revisions is around 40-50,000. So I
> have a single document taking up between 0.7GB and 1GB. After
> compaction if reduces to ~380KB which of course is much better but
> this still seems excessive and poses problems with compacting to a
> write heavy database. I understand the trick to that is to replicate,
> compact and replicate back to the source, please correct me if I'm
> wrong...

Hm no that won't do anything, just regular compaction is good enough.

> So, I don't think this approach is viable which makes me wonder
> whether setting the _revs_limit will help, although I understand that
> setting this per database still requires compaction and will save on
> space after compaction.

_revs_limit won't help, you will always need to compact to get rid of
data.

> I was thinking that tracking the throughput as chunks in individual
> documents and then calculating the throughput with a map/reduce on all
> the chunks might be a better approach. Although I'm concerned that
> having lots of little documents for each data chunk will also take up
> large amounts of space...

Yeah, wouldn't save any space here. That said, the numbers you quote,
I wouldn't call "large amounts".


> Any advice and guidance on the best way to tackle this would be much
> appreciated.

I'd either set up continuous compaction (restart compaction right when
it is done) to keep DB size at a minimum or use an in-memory store
to keep track of the uploaded bytes.

Ideally though, CouchDB would give you an endpoint to query that kind
of data.

Cheers
Jan
--