You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by kevin chen <ke...@bdsing.com> on 2009/10/25 03:36:47 UTC

Missing pages from Index in NUTCH 1.0

Hi,

I have been established a way to crawl in NUTCH 0,9, but it does not
work in NUTCH 1.0 anymore. Hope someone can shade some lights to this
problem.

This is what I do. I have grouped my set of URLs into few groups and
crawl them separately, so I can crawl them in different depths, filters,
and schedules. Some groups of urls are all from the same site. After I
am done with all groups, I copy all the segments together, do a crawldb
update, which will create a new crawldb, and then index.

This scheme worked well with nutch 0.9. But when I switch to nutch 1.0,
search results will miss urls of certain segments all together. I have
made sure that I am not filtering them out in any of the steps (crawldb
update and index).

Am I doing this totally wrong and just luck it worked in 0.9? Or
something changed in 1.0?

Thanks

Re: Missing pages from Index in NUTCH 1.0

Posted by reinhard schwab <re...@aon.at>.

paul tomblin has sent a patch at 14.10.2009
to filter out not modified pages makes sense for me if the index is
built incrementally and
if these pages are already in the index which is updated then....
lucene offers the option to update an index
but in my case i always build a new one.
you may review the code in IndexerMapReduce.java
i have not done until now.

Index: src/java/org/apache/nutch/indexer/IndexerMapReduce.java
===================================================================
--- src/java/org/apache/nutch/indexer/IndexerMapReduce.java	(revision 817382)
+++ src/java/org/apache/nutch/indexer/IndexerMapReduce.java	(working copy)
@@ -84,8 +84,10 @@
         if (CrawlDatum.hasDbStatus(datum))
           dbDatum = datum;
         else if (CrawlDatum.hasFetchStatus(datum)) {
-          // don't index unmodified (empty) pages
-          if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED)
+          /*
+           * Where did this person get the idea that unmodified pages
are empty?
+           // don't index unmodified (empty) pages
+          if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) */
             fetchDatum = datum;
         } else if (CrawlDatum.STATUS_LINKED == datum.getStatus() ||
                    CrawlDatum.STATUS_SIGNATURE == datum.getStatus()) {
@@ -108,7 +110,7 @@
     }

     if (!parseData.getStatus().isSuccess() ||
-        fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) {
+        (fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS &&
fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED)) {
       return;
     }


kevin chen schrieb:
> BTW, I forget to mention that I have checked the segment by dumping it's
> content, the content is there. And I also check the crawldb, the urls is
> in the db with status of db_fetched.
>
> On Sat, 2009-10-24 at 22:36 -0400, kevin chen wrote:
>   
>> Hi,
>>
>> I have been established a way to crawl in NUTCH 0,9, but it does not
>> work in NUTCH 1.0 anymore. Hope someone can shade some lights to this
>> problem.
>>
>> This is what I do. I have grouped my set of URLs into few groups and
>> crawl them separately, so I can crawl them in different depths, filters,
>> and schedules. Some groups of urls are all from the same site. After I
>> am done with all groups, I copy all the segments together, do a crawldb
>> update, which will create a new crawldb, and then index.
>>
>> This scheme worked well with nutch 0.9. But when I switch to nutch 1.0,
>> search results will miss urls of certain segments all together. I have
>> made sure that I am not filtering them out in any of the steps (crawldb
>> update and index).
>>
>> Am I doing this totally wrong and just luck it worked in 0.9? Or
>> something changed in 1.0?
>>
>> Thanks
>>
>>     
>
>
>

Re: Missing pages from Index in NUTCH 1.0

Posted by kevin chen <ke...@bdsing.com>.

BTW, I forget to mention that I have checked the segment by dumping it's
content, the content is there. And I also check the crawldb, the urls is
in the db with status of db_fetched.

On Sat, 2009-10-24 at 22:36 -0400, kevin chen wrote:
> Hi,
> 
> I have been established a way to crawl in NUTCH 0,9, but it does not
> work in NUTCH 1.0 anymore. Hope someone can shade some lights to this
> problem.
> 
> This is what I do. I have grouped my set of URLs into few groups and
> crawl them separately, so I can crawl them in different depths, filters,
> and schedules. Some groups of urls are all from the same site. After I
> am done with all groups, I copy all the segments together, do a crawldb
> update, which will create a new crawldb, and then index.
> 
> This scheme worked well with nutch 0.9. But when I switch to nutch 1.0,
> search results will miss urls of certain segments all together. I have
> made sure that I am not filtering them out in any of the steps (crawldb
> update and index).
> 
> Am I doing this totally wrong and just luck it worked in 0.9? Or
> something changed in 1.0?
> 
> Thanks
>