You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Julian Moritz <ma...@julianmoritz.de> on 2010/07/04 11:36:57 UTC

Why I think view generation should be done concurrent.

Hi,

a few days ago I've tweeted a wish to have view generation done
concurrent. I'll tell you why (because @janl doesn't think so).

I've got some documents in the form of:

_id: 1,
_rev: 3-abc, 
url: http://www.abc.com,
hrefs: [http://www.xyz.com, 
	http://www.nbc.com,
	...,
	...,
	...]

As you can imagine me crawling the web, I got plenty of them. And every
second thousands more. I've got a view, map.py is:

def fun(doc):    
    h = hash
    if doc.has_key("hrefs"):
        for href in doc["hrefs"]:
            yield (h(href), href), None

reduce.py is:

def fun(key, value, rereduce):
    return True

If you're not able to read python code: it's generating a large list of
unique pseudo-randomly ordered urls. I'm calling this view quite often
(to get new urls to be crawled). 

What is my problem now? My couchdb process is at 100%cpu and the view
needs sometimes quite long to be generated (even if I got only testing
data about 5-10 GB). I've got 4 cores and 3 of them are sleeping. I
think it could be way more faster if every core was used. What does
couchdb do with a very large system, let's say 64 atom cores (which
would be in an idle mode energy saving) and 20TB of data? Using 1 core
with let's say 1ghz to munch down 20TB? Oh please. 

Why doesn't couchdb use all cores to generate views?

Regards
Julian

P.S.: Maybe I'm totally wrong and the way you do it is right, but ATM it
makes me mad to see one core out of four working and the rest is idle.

Re: Why I think view generation should be done concurrent.

Posted by Sebastian Cohnen <se...@googlemail.com>.

AFAIK the current architectures does not play well with this approach. even if you have multiple concurrent view servers (and you have a a very fast storge), the view itself need to be written to one file, and to one b-tree. so you could process the mapping faster, but the new bottleneck would be the process which writes the view back to disk.

(correct me if I'm wrong here).

On 04.07.2010, at 11:36, Julian Moritz wrote:

> Hi,
> 
> a few days ago I've tweeted a wish to have view generation done
> concurrent. I'll tell you why (because @janl doesn't think so).
> 
> I've got some documents in the form of:
> 
> _id: 1,
> _rev: 3-abc, 
> url: http://www.abc.com,
> hrefs: [http://www.xyz.com, 
> 	http://www.nbc.com,
> 	...,
> 	...,
> 	...]
> 
> As you can imagine me crawling the web, I got plenty of them. And every
> second thousands more. I've got a view, map.py is:
> 
> def fun(doc):    
>    h = hash
>    if doc.has_key("hrefs"):
>        for href in doc["hrefs"]:
>            yield (h(href), href), None
> 
> reduce.py is:
> 
> def fun(key, value, rereduce):
>    return True
> 
> If you're not able to read python code: it's generating a large list of
> unique pseudo-randomly ordered urls. I'm calling this view quite often
> (to get new urls to be crawled). 
> 
> What is my problem now? My couchdb process is at 100%cpu and the view
> needs sometimes quite long to be generated (even if I got only testing
> data about 5-10 GB). I've got 4 cores and 3 of them are sleeping. I
> think it could be way more faster if every core was used. What does
> couchdb do with a very large system, let's say 64 atom cores (which
> would be in an idle mode energy saving) and 20TB of data? Using 1 core
> with let's say 1ghz to munch down 20TB? Oh please. 
> 
> Why doesn't couchdb use all cores to generate views?
> 
> Regards
> Julian
> 
> P.S.: Maybe I'm totally wrong and the way you do it is right, but ATM it
> makes me mad to see one core out of four working and the rest is idle.
> 
> 
> 
> 
>

Re: Why I think view generation should be done concurrent.

Posted by Julian Moritz <ma...@julianmoritz.de>.

Hi,

Am Sonntag, den 04.07.2010, 10:28 -0700 schrieb J Chris Anderson:
> On Jul 4, 2010, at 10:24 AM, Julian Moritz wrote:
> 
> > Hi,
> > 
> > Am Sonntag, den 04.07.2010, 09:37 -0700 schrieb J Chris Anderson:
> >> On Jul 4, 2010, at 9:21 AM, Julian Moritz wrote:
> >> 
> >>> Am Sonntag, den 04.07.2010, 07:10 -0700 schrieb J Chris Anderson:
> >>> 
> >>>>> reduce.py is:
> >>>>> 
> >>>>> def fun(key, value, rereduce):
> >>>>>  return True
> >>>>> 
> >>>> 
> >>>> You should remove this reduce function. It's not doing you any good and it's burning up your CPU. Things will be much faster without it.
> >>>> 
> >>> 
> >>> But does the view then still what I want to? I need the keys to be
> >>> unique.
> >>> 
> >> 
> >> if you just need unique keys, you can replace the text of the python reduce function with "_count" and you will avoid the python overhead for reduce, which will help alot.
> >> 
> > 
> > ok, thanks.
> > 
> >> also, if what you are really saying is that you only want each URL in your database once, it might make sense to consider using URLs (or URL hashes) as your docids, to prevent duplicates.
> >> 
> > 
> > nope. I'm yield'ing the _outgoing_ urls of each url. Having one document
> > per url is another topic (and I do that already).
> > 
> 
> one thing you can do that is kinda neat is in the map emit(fetched_url, 1) for each URL that has been fetched, and emit(linked_url, 0) for any URL that is linked.
> 
> then you can use _sum, instead of _count, and you will know to fetch any urls where the reduce value is 0, because they haven't been fetched yet.
> 

cooly-dooly, thank you a lot!

Regards
Julian

> > Regards
> > Julian
> > 
> >> 
> >>> Regards
> >>> Julian
> >>> 
> >>>> Chris
> >>>> 
> >>>>> If you're not able to read python code: it's generating a large list of
> >>>>> unique pseudo-randomly ordered urls. I'm calling this view quite often
> >>>>> (to get new urls to be crawled). 
> >>>>> 
> >>>>> What is my problem now? My couchdb process is at 100%cpu and the view
> >>>>> needs sometimes quite long to be generated (even if I got only testing
> >>>>> data about 5-10 GB). I've got 4 cores and 3 of them are sleeping. I
> >>>>> think it could be way more faster if every core was used. What does
> >>>>> couchdb do with a very large system, let's say 64 atom cores (which
> >>>>> would be in an idle mode energy saving) and 20TB of data? Using 1 core
> >>>>> with let's say 1ghz to munch down 20TB? Oh please. 
> >>>>> 
> >>>>> Why doesn't couchdb use all cores to generate views?
> >>>>> 
> >>>>> Regards
> >>>>> Julian
> >>>>> 
> >>>>> P.S.: Maybe I'm totally wrong and the way you do it is right, but ATM it
> >>>>> makes me mad to see one core out of four working and the rest is idle.
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> 
> >>>> 
> >>> 
> >>> 
> >> 
> > 
> > 
>

Re: Why I think view generation should be done concurrent.

Posted by J Chris Anderson <jc...@gmail.com>.

On Jul 4, 2010, at 10:24 AM, Julian Moritz wrote:

> Hi,
> 
> Am Sonntag, den 04.07.2010, 09:37 -0700 schrieb J Chris Anderson:
>> On Jul 4, 2010, at 9:21 AM, Julian Moritz wrote:
>> 
>>> Am Sonntag, den 04.07.2010, 07:10 -0700 schrieb J Chris Anderson:
>>> 
>>>>> reduce.py is:
>>>>> 
>>>>> def fun(key, value, rereduce):
>>>>>  return True
>>>>> 
>>>> 
>>>> You should remove this reduce function. It's not doing you any good and it's burning up your CPU. Things will be much faster without it.
>>>> 
>>> 
>>> But does the view then still what I want to? I need the keys to be
>>> unique.
>>> 
>> 
>> if you just need unique keys, you can replace the text of the python reduce function with "_count" and you will avoid the python overhead for reduce, which will help alot.
>> 
> 
> ok, thanks.
> 
>> also, if what you are really saying is that you only want each URL in your database once, it might make sense to consider using URLs (or URL hashes) as your docids, to prevent duplicates.
>> 
> 
> nope. I'm yield'ing the _outgoing_ urls of each url. Having one document
> per url is another topic (and I do that already).
> 

one thing you can do that is kinda neat is in the map emit(fetched_url, 1) for each URL that has been fetched, and emit(linked_url, 0) for any URL that is linked.

then you can use _sum, instead of _count, and you will know to fetch any urls where the reduce value is 0, because they haven't been fetched yet.

> Regards
> Julian
> 
>> 
>>> Regards
>>> Julian
>>> 
>>>> Chris
>>>> 
>>>>> If you're not able to read python code: it's generating a large list of
>>>>> unique pseudo-randomly ordered urls. I'm calling this view quite often
>>>>> (to get new urls to be crawled). 
>>>>> 
>>>>> What is my problem now? My couchdb process is at 100%cpu and the view
>>>>> needs sometimes quite long to be generated (even if I got only testing
>>>>> data about 5-10 GB). I've got 4 cores and 3 of them are sleeping. I
>>>>> think it could be way more faster if every core was used. What does
>>>>> couchdb do with a very large system, let's say 64 atom cores (which
>>>>> would be in an idle mode energy saving) and 20TB of data? Using 1 core
>>>>> with let's say 1ghz to munch down 20TB? Oh please. 
>>>>> 
>>>>> Why doesn't couchdb use all cores to generate views?
>>>>> 
>>>>> Regards
>>>>> Julian
>>>>> 
>>>>> P.S.: Maybe I'm totally wrong and the way you do it is right, but ATM it
>>>>> makes me mad to see one core out of four working and the rest is idle.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>> 
> 
>

Re: Why I think view generation should be done concurrent.

Posted by Julian Moritz <ma...@julianmoritz.de>.

Hi,

Am Sonntag, den 04.07.2010, 09:37 -0700 schrieb J Chris Anderson:
> On Jul 4, 2010, at 9:21 AM, Julian Moritz wrote:
> 
> > Am Sonntag, den 04.07.2010, 07:10 -0700 schrieb J Chris Anderson:
> > 
> >>> reduce.py is:
> >>> 
> >>> def fun(key, value, rereduce):
> >>>   return True
> >>> 
> >> 
> >> You should remove this reduce function. It's not doing you any good and it's burning up your CPU. Things will be much faster without it.
> >> 
> > 
> > But does the view then still what I want to? I need the keys to be
> > unique.
> > 
> 
> if you just need unique keys, you can replace the text of the python reduce function with "_count" and you will avoid the python overhead for reduce, which will help alot.
> 

ok, thanks.

> also, if what you are really saying is that you only want each URL in your database once, it might make sense to consider using URLs (or URL hashes) as your docids, to prevent duplicates.
> 

nope. I'm yield'ing the _outgoing_ urls of each url. Having one document
per url is another topic (and I do that already).

Regards
Julian

> 
> > Regards
> > Julian
> > 
> >> Chris
> >> 
> >>> If you're not able to read python code: it's generating a large list of
> >>> unique pseudo-randomly ordered urls. I'm calling this view quite often
> >>> (to get new urls to be crawled). 
> >>> 
> >>> What is my problem now? My couchdb process is at 100%cpu and the view
> >>> needs sometimes quite long to be generated (even if I got only testing
> >>> data about 5-10 GB). I've got 4 cores and 3 of them are sleeping. I
> >>> think it could be way more faster if every core was used. What does
> >>> couchdb do with a very large system, let's say 64 atom cores (which
> >>> would be in an idle mode energy saving) and 20TB of data? Using 1 core
> >>> with let's say 1ghz to munch down 20TB? Oh please. 
> >>> 
> >>> Why doesn't couchdb use all cores to generate views?
> >>> 
> >>> Regards
> >>> Julian
> >>> 
> >>> P.S.: Maybe I'm totally wrong and the way you do it is right, but ATM it
> >>> makes me mad to see one core out of four working and the rest is idle.
> >>> 
> >>> 
> >>> 
> >>> 
> >>> 
> >> 
> > 
> > 
>

Re: Why I think view generation should be done concurrent.

Posted by J Chris Anderson <jc...@gmail.com>.

On Jul 4, 2010, at 9:21 AM, Julian Moritz wrote:

> Am Sonntag, den 04.07.2010, 07:10 -0700 schrieb J Chris Anderson:
> 
>>> reduce.py is:
>>> 
>>> def fun(key, value, rereduce):
>>>   return True
>>> 
>> 
>> You should remove this reduce function. It's not doing you any good and it's burning up your CPU. Things will be much faster without it.
>> 
> 
> But does the view then still what I want to? I need the keys to be
> unique.
> 

if you just need unique keys, you can replace the text of the python reduce function with "_count" and you will avoid the python overhead for reduce, which will help alot.

also, if what you are really saying is that you only want each URL in your database once, it might make sense to consider using URLs (or URL hashes) as your docids, to prevent duplicates.


> Regards
> Julian
> 
>> Chris
>> 
>>> If you're not able to read python code: it's generating a large list of
>>> unique pseudo-randomly ordered urls. I'm calling this view quite often
>>> (to get new urls to be crawled). 
>>> 
>>> What is my problem now? My couchdb process is at 100%cpu and the view
>>> needs sometimes quite long to be generated (even if I got only testing
>>> data about 5-10 GB). I've got 4 cores and 3 of them are sleeping. I
>>> think it could be way more faster if every core was used. What does
>>> couchdb do with a very large system, let's say 64 atom cores (which
>>> would be in an idle mode energy saving) and 20TB of data? Using 1 core
>>> with let's say 1ghz to munch down 20TB? Oh please. 
>>> 
>>> Why doesn't couchdb use all cores to generate views?
>>> 
>>> Regards
>>> Julian
>>> 
>>> P.S.: Maybe I'm totally wrong and the way you do it is right, but ATM it
>>> makes me mad to see one core out of four working and the rest is idle.
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
> 
>

Re: Why I think view generation should be done concurrent.

Posted by Julian Moritz <ma...@julianmoritz.de>.

Am Sonntag, den 04.07.2010, 07:10 -0700 schrieb J Chris Anderson:

> > reduce.py is:
> > 
> > def fun(key, value, rereduce):
> >    return True
> > 
> 
> You should remove this reduce function. It's not doing you any good and it's burning up your CPU. Things will be much faster without it.
> 

But does the view then still what I want to? I need the keys to be
unique.

Regards
Julian

> Chris
> 
> > If you're not able to read python code: it's generating a large list of
> > unique pseudo-randomly ordered urls. I'm calling this view quite often
> > (to get new urls to be crawled). 
> > 
> > What is my problem now? My couchdb process is at 100%cpu and the view
> > needs sometimes quite long to be generated (even if I got only testing
> > data about 5-10 GB). I've got 4 cores and 3 of them are sleeping. I
> > think it could be way more faster if every core was used. What does
> > couchdb do with a very large system, let's say 64 atom cores (which
> > would be in an idle mode energy saving) and 20TB of data? Using 1 core
> > with let's say 1ghz to munch down 20TB? Oh please. 
> > 
> > Why doesn't couchdb use all cores to generate views?
> > 
> > Regards
> > Julian
> > 
> > P.S.: Maybe I'm totally wrong and the way you do it is right, but ATM it
> > makes me mad to see one core out of four working and the rest is idle.
> > 
> > 
> > 
> > 
> > 
>

Re: Why I think view generation should be done concurrent.

Posted by J Chris Anderson <jc...@gmail.com>.

On Jul 4, 2010, at 2:36 AM, Julian Moritz wrote:

> Hi,
> 
> a few days ago I've tweeted a wish to have view generation done
> concurrent. I'll tell you why (because @janl doesn't think so).
> 
> I've got some documents in the form of:
> 
> _id: 1,
> _rev: 3-abc, 
> url: http://www.abc.com,
> hrefs: [http://www.xyz.com, 
> 	http://www.nbc.com,
> 	...,
> 	...,
> 	...]
> 
> As you can imagine me crawling the web, I got plenty of them. And every
> second thousands more. I've got a view, map.py is:
> 
> def fun(doc):    
>    h = hash
>    if doc.has_key("hrefs"):
>        for href in doc["hrefs"]:
>            yield (h(href), href), None
> 
> reduce.py is:
> 
> def fun(key, value, rereduce):
>    return True
> 

You should remove this reduce function. It's not doing you any good and it's burning up your CPU. Things will be much faster without it.

Chris

> If you're not able to read python code: it's generating a large list of
> unique pseudo-randomly ordered urls. I'm calling this view quite often
> (to get new urls to be crawled). 
> 
> What is my problem now? My couchdb process is at 100%cpu and the view
> needs sometimes quite long to be generated (even if I got only testing
> data about 5-10 GB). I've got 4 cores and 3 of them are sleeping. I
> think it could be way more faster if every core was used. What does
> couchdb do with a very large system, let's say 64 atom cores (which
> would be in an idle mode energy saving) and 20TB of data? Using 1 core
> with let's say 1ghz to munch down 20TB? Oh please. 
> 
> Why doesn't couchdb use all cores to generate views?
> 
> Regards
> Julian
> 
> P.S.: Maybe I'm totally wrong and the way you do it is right, but ATM it
> makes me mad to see one core out of four working and the rest is idle.
> 
> 
> 
> 
>