You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Jason Camp <jc...@vhosting.com> on 2006/04/10 00:46:52 UTC

Question about crawldb and segments

Hi,
    I've been using Nutch 7 for a few months, and recently started 
working with 8.  I'm testing everything right now on a single server, 
using the local file system.  I generated 10 segments with 100k urls in 
each, and fetched the content. Then I do the updatedb, but it looks like 
the crawldb isn't working properly. For example, I ran the updatedb 
command on one segment, and -stats shows this:

060409 140035 status 1 (DB_unfetched):  1732457
060409 140035 status 2 (DB_fetched):    82608
060409 140035 status 3 (DB_gone):       3447

I then ran the updatedb against the next segment, and -stats now shows this:

060409 150737 status 1 (DB_unfetched):  1777642
060409 150737 status 2 (DB_fetched):    81629
060409 150737 status 3 (DB_gone):       3377


Any idea why the number of fetched urls would actually go down? What I 
*think* is happening is that the crawldb only contains the data from the 
last crawl, not the subsequent crawls. Does this make sense? I ran the 
test doing each segment and running -stats, and they are all around 80k 
for fetched and 1.7m for unfetched, but the numbers dont seem to be 
accumulating.

Since readsegs is broken in 8, I can't really get an idea of what is 
actually in the segments. Is there an alternative way to see how many 
urls are actually in the segment and fetched?

If you have any ideas, please let me know. Thanks a lot!

Jason

Re: Question about crawldb and segments

Posted by Jason Camp <jc...@vhosting.com>.

Hey Doug,
    I'm finally picking this up again, and I believe I have everything 
configured as you suggested - I'm running dfs at the indexing 
datacenter, along with a job tracker/namenode/task trackers. I'm running 
a job tracker/namenode/task trackers at the crawling datacenter, and its 
all pointed to use dfs at the indexing datacenter.

If I have mapreduce set to run locally, everything seems to be working 
fine. But if I set mapred.job.tracker to the job tracker that's running 
in the crawl datacenter, I get this when trying to run a fetch:

060430 155743 Waiting to find target node: crawl01/XX.XX.XX.XX:50010

For some reason, even though I'm specifying the fs.default.name as my 
other datacenter, it's trying to make itself a dfs node also. I was 
experimenting with some of the shell scripts, and it seems that in any 
configuration where this server is a task tracker, it also tries to be a 
dfs node. Am I missing something, or is there no way to make this server 
only a task tracker and not a dfs node also?

Let me know if this makes sense, its even a little confusing for me, and 
I set it up :)

Thanks!

Jason


Doug Cutting wrote:

> Jason Camp wrote:
>
>> Unfortunately in our scenario, bw is cheap at our fetching datacenter,
>> but adding additional disk capacity is expensive - so we are fetching
>> the data and sending it back to another cluster (by exporting segments
>> from ndfs, copy, importing).
>
>
> But to perform the copies, you're using a lot of bandwidth to your 
> "indexing" datacenter, no?  Copying segments probably takes almost as 
> much bandwidth as fetching them...
>
>>     I know this sounds a bit messy, but it was the only way we could
>> come up with to utilize the benefits of both datacenters. Ideally, I'd
>> love to be able to have all of the servers in one cluster, and define
>> which servers I want to perform which tasks, so for instance we could
>> use the one group of servers to fetch the data, but the other group of
>> servers to store the data and perform the indexing/etc. If there's a
>> better way to do something like this than what we're doing, or if  you
>> think we're just insane for doing it this way, please let me know :) 
>> Thanks!
>
>
> You can use different sets of machines for dfs and MapReduce, by 
> starting them in differently configured installations.  So you could 
> run dfs only in your "indexing" datacenter, and MapReduce in both 
> datacenters configured to talk to the same dfs, at the "indexing" 
> datacenter.  Then your fetch tasks as the "fetch" datacenter would 
> write their output to the "indexing" datacenter's dfs.  And 
> parse/updatedb/generate/index/etc. could all run at the other 
> datacenter.  Does that make sense?
>
> Doug

Re: Question about crawldb and segments

Posted by Doug Cutting <cu...@apache.org>.

Jason Camp wrote:
> Unfortunately in our scenario, bw is cheap at our fetching datacenter,
> but adding additional disk capacity is expensive - so we are fetching
> the data and sending it back to another cluster (by exporting segments
> from ndfs, copy, importing).

But to perform the copies, you're using a lot of bandwidth to your 
"indexing" datacenter, no?  Copying segments probably takes almost as 
much bandwidth as fetching them...

>     I know this sounds a bit messy, but it was the only way we could
> come up with to utilize the benefits of both datacenters. Ideally, I'd
> love to be able to have all of the servers in one cluster, and define
> which servers I want to perform which tasks, so for instance we could
> use the one group of servers to fetch the data, but the other group of
> servers to store the data and perform the indexing/etc. If there's a
> better way to do something like this than what we're doing, or if  you
> think we're just insane for doing it this way, please let me know :) Thanks!

You can use different sets of machines for dfs and MapReduce, by 
starting them in differently configured installations.  So you could run 
dfs only in your "indexing" datacenter, and MapReduce in both 
datacenters configured to talk to the same dfs, at the "indexing" 
datacenter.  Then your fetch tasks as the "fetch" datacenter would write 
their output to the "indexing" datacenter's dfs.  And 
parse/updatedb/generate/index/etc. could all run at the other 
datacenter.  Does that make sense?

Doug

Re: Question about crawldb and segments

Posted by Jason Camp <jc...@vhosting.com>.

    There's one problem with this in my environment - I'm basically
running two datacenters with different uses. Hopefully this won't sound
too confusing :)  We're using one datacenter to do the fetching, with a
hadoop cluster of servers, and one datacenter to store content and make
available for searching, with another cluster of hadoop servers.  My
understanding is that when you make servers into slaves for a cluster,
there's no way to define specifically what they do - i.e., these servers
just do fetching, these servers just do indexing. Is that correct?
Unfortunately in our scenario, bw is cheap at our fetching datacenter,
but adding additional disk capacity is expensive - so we are fetching
the data and sending it back to another cluster (by exporting segments
from ndfs, copy, importing).
    I know this sounds a bit messy, but it was the only way we could
come up with to utilize the benefits of both datacenters. Ideally, I'd
love to be able to have all of the servers in one cluster, and define
which servers I want to perform which tasks, so for instance we could
use the one group of servers to fetch the data, but the other group of
servers to store the data and perform the indexing/etc. If there's a
better way to do something like this than what we're doing, or if  you
think we're just insane for doing it this way, please let me know :) Thanks!

Jason


Doug Cutting wrote:

> Jason Camp wrote:
>
>> I'd like to generate multiple segments in a row, and send them off to
>> another server, is this possible using the local file system?
>
>
> The Hadoop-based Nutch now automates multiple, parallel fetches for
> you.  So there is less need to manually generate multiple segments. 
> Try configuring your servers as slaves (by adding them to conf/slaves)
> and configuring a master (by setting fs.default.name and
> mapred.jobtracker in conf/hadoop-site.xml) then using bin/start-all.sh
> to start daemons. Then copy your root url directory to dfs with
> something like 'bin/hadoop dfs -put roots roots'.  Then you can run a
> multi-machine crawl with 'bin/nutch crawl'.  Or if you need
> finer-grained control, you can still step through the inject,
> generate, fetch, updatedb, generate, fetch, ... cycle, except now each
> step runs across all slave nodes.
>
> This is outlined in the Hadoop javadoc:
>
> http://lucene.apache.org/hadoop/docs/api/overview-summary.html#overview_description
>
>
> Doug

Re: Question about crawldb and segments

Posted by Doug Cutting <cu...@apache.org>.

Jason Camp wrote:
> I'd like to generate multiple 
> segments in a row, and send them off to another server, is this possible 
> using the local file system?

The Hadoop-based Nutch now automates multiple, parallel fetches for you. 
  So there is less need to manually generate multiple segments.  Try 
configuring your servers as slaves (by adding them to conf/slaves) and 
configuring a master (by setting fs.default.name and mapred.jobtracker 
in conf/hadoop-site.xml) then using bin/start-all.sh to start daemons. 
Then copy your root url directory to dfs with something like 'bin/hadoop 
dfs -put roots roots'.  Then you can run a multi-machine crawl with 
'bin/nutch crawl'.  Or if you need finer-grained control, you can still 
step through the inject, generate, fetch, updatedb, generate, fetch, ... 
cycle, except now each step runs across all slave nodes.

This is outlined in the Hadoop javadoc:

http://lucene.apache.org/hadoop/docs/api/overview-summary.html#overview_description

Doug

Re: Question about crawldb and segments

Posted by Andrzej Bialecki <ab...@getopt.org>.

Jason Camp wrote:
> Doh, I think I found out the problem. After using luke to dig through 
> the indexed segments, it looks like all of the segments that I 
> generated contain the same exact urls.  When you generate a segment 
> with the top 100k urls, I'm guessing they are not marked in any way to 
> prevent the next generate from grabbing the same urls? I'd like to 
> generate multiple segments in a row, and send them off to another 
> server, is this possible using the local file system?

No, at the moment they are not marked in any way. This is on my TODO 
list, but not with a high priority.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Question about crawldb and segments

Posted by Jason Camp <jc...@vhosting.com>.

Doh, I think I found out the problem. After using luke to dig through 
the indexed segments, it looks like all of the segments that I generated 
contain the same exact urls.  When you generate a segment with the top 
100k urls, I'm guessing they are not marked in any way to prevent the 
next generate from grabbing the same urls? I'd like to generate multiple 
segments in a row, and send them off to another server, is this possible 
using the local file system?


Jason


Jason Camp wrote:

> Hi,
>    I've been using Nutch 7 for a few months, and recently started 
> working with 8.  I'm testing everything right now on a single server, 
> using the local file system.  I generated 10 segments with 100k urls 
> in each, and fetched the content. Then I do the updatedb, but it looks 
> like the crawldb isn't working properly. For example, I ran the 
> updatedb command on one segment, and -stats shows this:
>
> 060409 140035 status 1 (DB_unfetched):  1732457
> 060409 140035 status 2 (DB_fetched):    82608
> 060409 140035 status 3 (DB_gone):       3447
>
> I then ran the updatedb against the next segment, and -stats now shows 
> this:
>
> 060409 150737 status 1 (DB_unfetched):  1777642
> 060409 150737 status 2 (DB_fetched):    81629
> 060409 150737 status 3 (DB_gone):       3377
>
>
> Any idea why the number of fetched urls would actually go down? What I 
> *think* is happening is that the crawldb only contains the data from 
> the last crawl, not the subsequent crawls. Does this make sense? I ran 
> the test doing each segment and running -stats, and they are all 
> around 80k for fetched and 1.7m for unfetched, but the numbers dont 
> seem to be accumulating.
>
> Since readsegs is broken in 8, I can't really get an idea of what is 
> actually in the segments. Is there an alternative way to see how many 
> urls are actually in the segment and fetched?
>
> If you have any ideas, please let me know. Thanks a lot!
>
> Jason
>