You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Roland <ro...@rvh-gmbh.de> on 2013/02/20 17:53:50 UTC
nutch with cassandra internal network usage
Hi list,
we're experimenting with nutch 2.1 and cassandra 1.2.1 (on different hosts).
Our cassandra 'webpage' store has about 31GB right now on disk, we add
URLs by 'injecting' them, about 100k-300k per cycle.
When starting a 'fetch' run, it now needs about an hour before the
queues are set up / the first page is fetched.
During this time we can see about 180MBit/s network traffic from the
cassandra host to the nutch host (outgoing of cassandra).
If I calculate the transferred data during this time (taking only
150Mbit/s into account):
150MBit/s*1000*1000/8/1024/1024/1024*3600sec ~= 62GB
So, why does nutch load all data from the db, and not only the relevant
data of this fetch? And why does it happen twice?
Thanks,
Roland
Re: nutch with cassandra internal network usage
Posted by Roland <ro...@rvh-gmbh.de>.
Hi Alex,
the GeneratorJob seems to have a solution for that, if not it would
iterate over all records too, am I right?
--Roland
Am 20.02.2013 20:42, schrieb alxsss@aim.com:
> Hi,
>
> This is because fetch's mapper goes over all records and selects those that has the given batchId. Currently mappers of all nutch commands does not have filters.
> It is interesting to know if you can selects records with a given batchId in cassandra without iterating over all records.
>
>
> Alex.
>
Re: nutch with cassandra internal network usage
Posted by al...@aim.com.
Hi,
This is because fetch's mapper goes over all records and selects those that has the given batchId. Currently mappers of all nutch commands does not have filters.
It is interesting to know if you can selects records with a given batchId in cassandra without iterating over all records.
Alex.
-----Original Message-----
From: Roland <ro...@rvh-gmbh.de>
To: user <us...@nutch.apache.org>
Sent: Wed, Feb 20, 2013 10:56 am
Subject: Re: nutch with cassandra internal network usage
Hi Lewis,
the GeneratorJob takes only ~5 minutes.
I'm running it in standalone mode, like this:
./bin/nutch fetch 1361367698-1708119958 -threads 40
It's configured to fetch & parse, but it makes no difference if it only
fetches:
FetcherJob: starting
FetcherJob: batchId: 1361367698-1708119958
FetcherJob: threads: 40
FetcherJob: parsing: true
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
--Roland
Am 20.02.2013 19:44, schrieb Lewis John Mcgibbney:
> Hi Roland,
>
> You say you start a fetch run, does this mean the FetcherJob or
> GeneratorJob? What kind of settings do you run your zNutch server with?
Re: nutch with cassandra internal network usage
Posted by Lewis John Mcgibbney <le...@gmail.com>.
I am assuming that your generate.max.count property value is set to the
default -1? Have you tried configuring more, smaller batchId's (fetch
lists)?
I don't have an immediate answer as to why overall, the FetcherJob is
taking this amount of time and resources
On Wednesday, February 20, 2013, Roland <ro...@rvh-gmbh.de> wrote:
> Hi Lewis,
>
> the GeneratorJob takes only ~5 minutes.
> I'm running it in standalone mode, like this:
> ./bin/nutch fetch 1361367698-1708119958 -threads 40
>
> It's configured to fetch & parse, but it makes no difference if it only
fetches:
> FetcherJob: starting
> FetcherJob: batchId: 1361367698-1708119958
> FetcherJob: threads: 40
> FetcherJob: parsing: true
> FetcherJob: resuming: false
> FetcherJob : timelimit set for : -1
>
> --Roland
>
>
> Am 20.02.2013 19:44, schrieb Lewis John Mcgibbney:
>>
>> Hi Roland,
>>
>> You say you start a fetch run, does this mean the FetcherJob or
>> GeneratorJob? What kind of settings do you run your zNutch server with?
>
--
*Lewis*
Re: nutch with cassandra internal network usage
Posted by Roland <ro...@rvh-gmbh.de>.
Hi Lewis,
the GeneratorJob takes only ~5 minutes.
I'm running it in standalone mode, like this:
./bin/nutch fetch 1361367698-1708119958 -threads 40
It's configured to fetch & parse, but it makes no difference if it only
fetches:
FetcherJob: starting
FetcherJob: batchId: 1361367698-1708119958
FetcherJob: threads: 40
FetcherJob: parsing: true
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
--Roland
Am 20.02.2013 19:44, schrieb Lewis John Mcgibbney:
> Hi Roland,
>
> You say you start a fetch run, does this mean the FetcherJob or
> GeneratorJob? What kind of settings do you run your zNutch server with?
Re: nutch with cassandra internal network usage
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Roland,
You say you start a fetch run, does this mean the FetcherJob or
GeneratorJob? What kind of settings do you run your zNutch server with?
On Wednesday, February 20, 2013, Roland <ro...@rvh-gmbh.de> wrote:
> Hi list,
>
> we're experimenting with nutch 2.1 and cassandra 1.2.1 (on ? hosts).
> Our cassandra 'webpage' store has about 31GB right now on disk, we add
URLs by 'injecting' them, about 100k-300k per cycle.
> When starting a 'fetch' run, it now needs about an hour before the queues
are set up / the first page is fetched.
> During this time we can see about 180MBit/s network traffic from the
cassandra host to the nutch host (outgoing of cassandra).
> If I calculate the transferred data during this time (taking only
150Mbit/s into account):
> 150MBit/s*1000*1000/8/1024/1024/1024*3600sec ~= 62GB
>
> So, why does nutch load all data from the db, and not only the relevant
data of this fetch? And why does it happen twice?
>
> Thanks,
> Roland
>
--
*Lewis*