You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Roland <ro...@rvh-gmbh.de> on 2013/02/20 17:53:50 UTC

nutch with cassandra internal network usage

Hi list,

we're experimenting with nutch 2.1 and cassandra 1.2.1 (on different hosts).
Our cassandra 'webpage' store has about 31GB right now on disk, we add 
URLs by 'injecting' them, about 100k-300k per cycle.
When starting a 'fetch' run, it now needs about an hour before the 
queues are set up / the first page is fetched.
During this time we can see about 180MBit/s network traffic from the 
cassandra host to the nutch host (outgoing of cassandra).
If I calculate the transferred data during this time (taking only 
150Mbit/s into account):
150MBit/s*1000*1000/8/1024/1024/1024*3600sec ~= 62GB

So, why does nutch load all data from the db, and not only the relevant 
data of this fetch? And why does it happen twice?

Thanks,
Roland

Re: nutch with cassandra internal network usage

Posted by Roland <ro...@rvh-gmbh.de>.
Hi Alex,

the GeneratorJob seems to have a solution for that, if not it would 
iterate over all records too, am I right?

--Roland

Am 20.02.2013 20:42, schrieb alxsss@aim.com:
> Hi,
>
> This is because fetch's mapper goes over all records and selects those that has the given batchId. Currently mappers of all nutch commands does not have filters.
> It is interesting to know if you can selects records with a given batchId in cassandra without iterating over all records.
>
>
> Alex.
>

Re: nutch with cassandra internal network usage

Posted by al...@aim.com.
Hi,

This is because fetch's mapper goes over all records and selects those that has the given batchId. Currently mappers of all nutch commands does not have filters.
It is interesting to know if you can selects records with a given batchId in cassandra without iterating over all records.


Alex.

 

 

 

-----Original Message-----
From: Roland <ro...@rvh-gmbh.de>
To: user <us...@nutch.apache.org>
Sent: Wed, Feb 20, 2013 10:56 am
Subject: Re: nutch with cassandra internal network usage


Hi Lewis,

the GeneratorJob takes only ~5 minutes.
I'm running it in standalone mode, like this:
./bin/nutch fetch 1361367698-1708119958 -threads 40

It's configured to fetch & parse, but it makes no difference if it only 
fetches:
FetcherJob: starting
FetcherJob: batchId: 1361367698-1708119958
FetcherJob: threads: 40
FetcherJob: parsing: true
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1

--Roland


Am 20.02.2013 19:44, schrieb Lewis John Mcgibbney:
> Hi Roland,
>
> You say you start a fetch run, does this mean the FetcherJob or
> GeneratorJob? What kind of settings do you run your zNutch server with?

 

Re: nutch with cassandra internal network usage

Posted by Lewis John Mcgibbney <le...@gmail.com>.
I am assuming that your generate.max.count property value is set to the
default -1? Have you tried configuring more, smaller batchId's (fetch
lists)?
I don't have an immediate answer as to why overall, the FetcherJob is
taking this amount of time and resources

On Wednesday, February 20, 2013, Roland <ro...@rvh-gmbh.de> wrote:
> Hi Lewis,
>
> the GeneratorJob takes only ~5 minutes.
> I'm running it in standalone mode, like this:
> ./bin/nutch fetch 1361367698-1708119958 -threads 40
>
> It's configured to fetch & parse, but it makes no difference if it only
fetches:
> FetcherJob: starting
> FetcherJob: batchId: 1361367698-1708119958
> FetcherJob: threads: 40
> FetcherJob: parsing: true
> FetcherJob: resuming: false
> FetcherJob : timelimit set for : -1
>
> --Roland
>
>
> Am 20.02.2013 19:44, schrieb Lewis John Mcgibbney:
>>
>> Hi Roland,
>>
>> You say you start a fetch run, does this mean the FetcherJob or
>> GeneratorJob? What kind of settings do you run your zNutch server with?
>

-- 
*Lewis*

Re: nutch with cassandra internal network usage

Posted by Roland <ro...@rvh-gmbh.de>.
Hi Lewis,

the GeneratorJob takes only ~5 minutes.
I'm running it in standalone mode, like this:
./bin/nutch fetch 1361367698-1708119958 -threads 40

It's configured to fetch & parse, but it makes no difference if it only 
fetches:
FetcherJob: starting
FetcherJob: batchId: 1361367698-1708119958
FetcherJob: threads: 40
FetcherJob: parsing: true
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1

--Roland


Am 20.02.2013 19:44, schrieb Lewis John Mcgibbney:
> Hi Roland,
>
> You say you start a fetch run, does this mean the FetcherJob or
> GeneratorJob? What kind of settings do you run your zNutch server with?

Re: nutch with cassandra internal network usage

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Roland,

You say you start a fetch run, does this mean the FetcherJob or
GeneratorJob? What kind of settings do you run your zNutch server with?

On Wednesday, February 20, 2013, Roland <ro...@rvh-gmbh.de> wrote:
> Hi list,
>
> we're experimenting with nutch 2.1 and cassandra 1.2.1 (on ? hosts).
> Our cassandra 'webpage' store has about 31GB right now on disk, we add
URLs by 'injecting' them, about 100k-300k per cycle.
> When starting a 'fetch' run, it now needs about an hour before the queues
are set up / the first page is fetched.
> During this time we can see about 180MBit/s network traffic from the
cassandra host to the nutch host (outgoing of cassandra).
> If I calculate the transferred data during this time (taking only
150Mbit/s into account):
> 150MBit/s*1000*1000/8/1024/1024/1024*3600sec ~= 62GB
>
> So, why does nutch load all data from the db, and not only the relevant
data of this fetch? And why does it happen twice?
>
> Thanks,
> Roland
>

-- 
*Lewis*