You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Andy Liu (JIRA)" <ji...@apache.org> on 2005/04/12 23:59:17 UTC

[jira] Updated: (NUTCH-5) Hit limiter off-by-one bug

     [ http://issues.apache.org/jira/browse/NUTCH-5?page=history ]

Andy Liu updated NUTCH-5:
-------------------------

    Attachment: fix-hitlimiting.patch

Patches NutchBean to fix hit limiting off-by-one issue.

> Hit limiter off-by-one bug
> --------------------------
>
>          Key: NUTCH-5
>          URL: http://issues.apache.org/jira/browse/NUTCH-5
>      Project: Nutch
>         Type: Bug
>   Components: searcher
>     Reporter: Andy Liu
>     Priority: Minor
>  Attachments: fix-hitlimiting.patch
>
> When re-searching for more raw hits, the first result of the next site is skipped.
> From NutchBean.java
> *snip*
> // get the next raw hit
>             if (rawHitNum >= hits.getLength()) {
>                 // optimize query by prohibiting more matches on some excluded sites
>                 Query optQuery = (Query) query.clone();
>                 for (int i = 0; i < excludedSites.size(); i++) {
>                     if (i == MAX_PROHIBITED_TERMS) {
>                         break;
>                     }
>                     optQuery.addProhibitedTerm(((String) excludedSites.get(i)),
>                         IndexSearcher.HIT_LIMIT_FIELD);
>                 }
>                 numHitsRaw = (int) (numHitsRaw * RAW_HITS_FACTOR);
>                 LOG.info("re-searching for " + numHitsRaw +
>                     " raw hits, query: " + optQuery);
>                 hits = searcher.search(optQuery, numHitsRaw);
>                 LOG.info("found " + hits.getTotal() + " raw hits");
>                 rawHitNum = 0;
>                 continue;
>             }
> *snip*
> rawHitNum is reset to 0, but the for loop increments it by one and skips the next result.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira

Re: [Nutch-dev] Re: NUTCH-7 - analyze tool takes up all the disk space when there are circular links

Posted by Massimo Miccoli <mm...@iltrovatore.it>.

Thanks Doug,
I will try. But no other way to detect crawler loop? How we can discover 
similar case with billion of pages?

Thanks
Massimo

Doug Cutting wrote:

> Massimo Miccoli wrote:
>
>> The general problem is urls like: 
>> http://www.agriturismo.pg.it/storia-citta-umbria/index.html
>> a custom not found pages that generate infinite crawler loop on site.  
>
>
> You're referring to error pages that do not return 404?
>
> In another thread I just suggested a way to handle these:
>
> http://www.mail-archive.com/nutch-user%40incubator.apache.org/msg00286.html 
>
>
> The url you mention is amenable to this solution.  It's title contains 
> the string "pagina di errore", but it does not return a 404.
>
> Doug
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: New Crystal Reports XI.
> Version 11 adds new functionality designed to reduce time involved in
> creating, integrating, and deploying reporting solutions. Free runtime 
> info,
> new features, or free trial, at: http://www.businessobjects.com/devxi/728
> _______________________________________________
> Nutch-developers mailing list
> Nutch-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
>

Re: [Nutch-dev] Re: NUTCH-7 - analyze tool takes up all the disk space when there are circular links

Posted by Doug Cutting <cu...@nutch.org>.

Massimo Miccoli wrote:
> The general problem is urls like: 
> http://www.agriturismo.pg.it/storia-citta-umbria/index.html
> a custom not found pages that generate infinite crawler loop on site.  

You're referring to error pages that do not return 404?

In another thread I just suggested a way to handle these:

http://www.mail-archive.com/nutch-user%40incubator.apache.org/msg00286.html

The url you mention is amenable to this solution.  It's title contains 
the string "pagina di errore", but it does not return a 404.

Doug

Re: [Nutch-dev] Re: NUTCH-7 - analyze tool takes up all the disk space when there are circular links

Posted by Massimo Miccoli <mm...@iltrovatore.it>.

The general problem is urls like: 
http://www.agriturismo.pg.it/storia-citta-umbria/index.html
a custom not found pages that generate infinite crawler loop on site.  
It's not a rare casse, whe one (like me) try to fetch
whole web.
BTW if you wanto you can test a Nutch search on 50.000.000 pages (not 
urls) at http://crawlers.iltrovatore.it:8088/search.jsp

massimo



Doug Cutting wrote:

> Massimo Miccoli wrote:
>
>> In any way the circular links is a big problem for Nutch. Not only 
>> for analyze tool, but also for fetcher speed and Wedb size.  Any 
>> solution?
>
>
> What is the general problem with Nutch's handling of circular links? 
> Nearly every site has them.  I am able to crawl, index and search 
> sites with circular links.
>
> Doug
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: New Crystal Reports XI.
> Version 11 adds new functionality designed to reduce time involved in
> creating, integrating, and deploying reporting solutions. Free runtime 
> info,
> new features, or free trial, at: http://www.businessobjects.com/devxi/728
> _______________________________________________
> Nutch-developers mailing list
> Nutch-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
>

Re: [Nutch-dev] Re: NUTCH-7 - analyze tool takes up all the disk space when there are circular links

Posted by Doug Cutting <cu...@nutch.org>.

Massimo Miccoli wrote:
> In any way the circular links is a big problem for Nutch. Not only for 
> analyze tool, but also for fetcher speed and Wedb size.  Any solution?

What is the general problem with Nutch's handling of circular links? 
Nearly every site has them.  I am able to crawl, index and search sites 
with circular links.

Doug

Re: [Nutch-dev] Re: NUTCH-7 - analyze tool takes up all the disk space when there are circular links

Posted by Massimo Miccoli <mm...@iltrovatore.it>.

In any way the circular links is a big problem for Nutch. Not only for 
analyze tool, but also for fetcher speed and Wedb size.  Any solution?

Thx Massimo

Doug Cutting wrote:

> yoursoft@freemail.hu wrote:
>
>> I have the same problem with latest nutch from svn. The reported 
>> solution doesn't work for me. I would like to fix this problem.
>> Have anyone to any idea how to start with it?
>
>
> Sorry, but I have not used the analyze tool in a while.  It is not 
> actively maintained.
>
> As a workaround, I recommend setting both 
> fetchlist.score.by.link.count and indexer.boost.by.link.count to 
> true.  This works well, provided you're not encountering a lot of link 
> spam.
>
> Doug
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: New Crystal Reports XI.
> Version 11 adds new functionality designed to reduce time involved in
> creating, integrating, and deploying reporting solutions. Free runtime 
> info,
> new features, or free trial, at: http://www.businessobjects.com/devxi/728
> _______________________________________________
> Nutch-developers mailing list
> Nutch-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
>

Re: [Nutch-dev] Re: NUTCH-7 - analyze tool takes up all the disk space when there are circular links

Posted by Doug Cutting <cu...@nutch.org>.

yoursoft@freemail.hu wrote:
> In my config both parameter are true, and I have problem with www.nb1.hu/*.

Sorry if I wasn't clear.  Once you set these parameters, you should no 
longer use the 'analyze' command at all.

Doug

Re: [Nutch-dev] Re: NUTCH-7 - analyze tool takes up all the disk space when there are circular links

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.

In my config both parameter are true, and I have problem with www.nb1.hu/*.
You will replace the analysis with map-reduce?

Doug Cutting wrotte:

> yoursoft@freemail.hu wrote:
>
>> I have the same problem with latest nutch from svn. The reported 
>> solution doesn't work for me. I would like to fix this problem.
>> Have anyone to any idea how to start with it?
>
>
> Sorry, but I have not used the analyze tool in a while.  It is not 
> actively maintained.
>
> As a workaround, I recommend setting both 
> fetchlist.score.by.link.count and indexer.boost.by.link.count to 
> true.  This works well, provided you're not encountering a lot of link 
> spam.
>
> Doug
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: New Crystal Reports XI.
> Version 11 adds new functionality designed to reduce time involved in
> creating, integrating, and deploying reporting solutions. Free runtime 
> info,
> new features, or free trial, at: http://www.businessobjects.com/devxi/728
> _______________________________________________
> Nutch-developers mailing list
> Nutch-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
>
>

Re: NUTCH-7 - analyze tool takes up all the disk space when there are circular links

Posted by Doug Cutting <cu...@nutch.org>.

yoursoft@freemail.hu wrote:
> I have the same problem with latest nutch from svn. The reported 
> solution doesn't work for me. I would like to fix this problem.
> Have anyone to any idea how to start with it?

Sorry, but I have not used the analyze tool in a while.  It is not 
actively maintained.

As a workaround, I recommend setting both fetchlist.score.by.link.count 
and indexer.boost.by.link.count to true.  This works well, provided 
you're not encountering a lot of link spam.

Doug

NUTCH-7 - analyze tool takes up all the disk space when there are circular links

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.

Dear Guys,

I have the same problem with latest nutch from svn. The reported 
solution doesn't work for me. I would like to fix this problem.
Have anyone to any idea how to start with it?

Best Regards,
    Ferenc