You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andy Liu (JIRA)" <ji...@apache.org> on 2005/04/12 23:59:17 UTC
[jira] Updated: (NUTCH-5) Hit limiter off-by-one bug
[ http://issues.apache.org/jira/browse/NUTCH-5?page=history ]
Andy Liu updated NUTCH-5:
-------------------------
Attachment: fix-hitlimiting.patch
Patches NutchBean to fix hit limiting off-by-one issue.
> Hit limiter off-by-one bug
> --------------------------
>
> Key: NUTCH-5
> URL: http://issues.apache.org/jira/browse/NUTCH-5
> Project: Nutch
> Type: Bug
> Components: searcher
> Reporter: Andy Liu
> Priority: Minor
> Attachments: fix-hitlimiting.patch
>
> When re-searching for more raw hits, the first result of the next site is skipped.
> From NutchBean.java
> *snip*
> // get the next raw hit
> if (rawHitNum >= hits.getLength()) {
> // optimize query by prohibiting more matches on some excluded sites
> Query optQuery = (Query) query.clone();
> for (int i = 0; i < excludedSites.size(); i++) {
> if (i == MAX_PROHIBITED_TERMS) {
> break;
> }
> optQuery.addProhibitedTerm(((String) excludedSites.get(i)),
> IndexSearcher.HIT_LIMIT_FIELD);
> }
> numHitsRaw = (int) (numHitsRaw * RAW_HITS_FACTOR);
> LOG.info("re-searching for " + numHitsRaw +
> " raw hits, query: " + optQuery);
> hits = searcher.search(optQuery, numHitsRaw);
> LOG.info("found " + hits.getTotal() + " raw hits");
> rawHitNum = 0;
> continue;
> }
> *snip*
> rawHitNum is reset to 0, but the for loop increments it by one and skips the next result.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
http://www.atlassian.com/software/jira
Re: [Nutch-dev] Re: NUTCH-7 - analyze tool takes up all the disk
space when there are circular links
Posted by Massimo Miccoli <mm...@iltrovatore.it>.
Thanks Doug,
I will try. But no other way to detect crawler loop? How we can discover
similar case with billion of pages?
Thanks
Massimo
Doug Cutting wrote:
> Massimo Miccoli wrote:
>
>> The general problem is urls like:
>> http://www.agriturismo.pg.it/storia-citta-umbria/index.html
>> a custom not found pages that generate infinite crawler loop on site.
>
>
> You're referring to error pages that do not return 404?
>
> In another thread I just suggested a way to handle these:
>
> http://www.mail-archive.com/nutch-user%40incubator.apache.org/msg00286.html
>
>
> The url you mention is amenable to this solution. It's title contains
> the string "pagina di errore", but it does not return a 404.
>
> Doug
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: New Crystal Reports XI.
> Version 11 adds new functionality designed to reduce time involved in
> creating, integrating, and deploying reporting solutions. Free runtime
> info,
> new features, or free trial, at: http://www.businessobjects.com/devxi/728
> _______________________________________________
> Nutch-developers mailing list
> Nutch-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
>
Re: [Nutch-dev] Re: NUTCH-7 - analyze tool takes up all the disk
space when there are circular links
Posted by Doug Cutting <cu...@nutch.org>.
Massimo Miccoli wrote:
> The general problem is urls like:
> http://www.agriturismo.pg.it/storia-citta-umbria/index.html
> a custom not found pages that generate infinite crawler loop on site.
You're referring to error pages that do not return 404?
In another thread I just suggested a way to handle these:
http://www.mail-archive.com/nutch-user%40incubator.apache.org/msg00286.html
The url you mention is amenable to this solution. It's title contains
the string "pagina di errore", but it does not return a 404.
Doug
Re: [Nutch-dev] Re: NUTCH-7 - analyze tool takes up all the disk
space when there are circular links
Posted by Massimo Miccoli <mm...@iltrovatore.it>.
The general problem is urls like:
http://www.agriturismo.pg.it/storia-citta-umbria/index.html
a custom not found pages that generate infinite crawler loop on site.
It's not a rare casse, whe one (like me) try to fetch
whole web.
BTW if you wanto you can test a Nutch search on 50.000.000 pages (not
urls) at http://crawlers.iltrovatore.it:8088/search.jsp
massimo
Doug Cutting wrote:
> Massimo Miccoli wrote:
>
>> In any way the circular links is a big problem for Nutch. Not only
>> for analyze tool, but also for fetcher speed and Wedb size. Any
>> solution?
>
>
> What is the general problem with Nutch's handling of circular links?
> Nearly every site has them. I am able to crawl, index and search
> sites with circular links.
>
> Doug
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: New Crystal Reports XI.
> Version 11 adds new functionality designed to reduce time involved in
> creating, integrating, and deploying reporting solutions. Free runtime
> info,
> new features, or free trial, at: http://www.businessobjects.com/devxi/728
> _______________________________________________
> Nutch-developers mailing list
> Nutch-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
>
Re: [Nutch-dev] Re: NUTCH-7 - analyze tool takes up all the disk
space when there are circular links
Posted by Doug Cutting <cu...@nutch.org>.
Massimo Miccoli wrote:
> In any way the circular links is a big problem for Nutch. Not only for
> analyze tool, but also for fetcher speed and Wedb size. Any solution?
What is the general problem with Nutch's handling of circular links?
Nearly every site has them. I am able to crawl, index and search sites
with circular links.
Doug
Re: [Nutch-dev] Re: NUTCH-7 - analyze tool takes up all the disk
space when there are circular links
Posted by Massimo Miccoli <mm...@iltrovatore.it>.
In any way the circular links is a big problem for Nutch. Not only for
analyze tool, but also for fetcher speed and Wedb size. Any solution?
Thx Massimo
Doug Cutting wrote:
> yoursoft@freemail.hu wrote:
>
>> I have the same problem with latest nutch from svn. The reported
>> solution doesn't work for me. I would like to fix this problem.
>> Have anyone to any idea how to start with it?
>
>
> Sorry, but I have not used the analyze tool in a while. It is not
> actively maintained.
>
> As a workaround, I recommend setting both
> fetchlist.score.by.link.count and indexer.boost.by.link.count to
> true. This works well, provided you're not encountering a lot of link
> spam.
>
> Doug
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: New Crystal Reports XI.
> Version 11 adds new functionality designed to reduce time involved in
> creating, integrating, and deploying reporting solutions. Free runtime
> info,
> new features, or free trial, at: http://www.businessobjects.com/devxi/728
> _______________________________________________
> Nutch-developers mailing list
> Nutch-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
>
Re: [Nutch-dev] Re: NUTCH-7 - analyze tool takes up all the disk
space when there are circular links
Posted by Doug Cutting <cu...@nutch.org>.
yoursoft@freemail.hu wrote:
> In my config both parameter are true, and I have problem with www.nb1.hu/*.
Sorry if I wasn't clear. Once you set these parameters, you should no
longer use the 'analyze' command at all.
Doug
Re: [Nutch-dev] Re: NUTCH-7 - analyze tool takes up all the disk
space when there are circular links
Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.
In my config both parameter are true, and I have problem with www.nb1.hu/*.
You will replace the analysis with map-reduce?
Doug Cutting wrotte:
> yoursoft@freemail.hu wrote:
>
>> I have the same problem with latest nutch from svn. The reported
>> solution doesn't work for me. I would like to fix this problem.
>> Have anyone to any idea how to start with it?
>
>
> Sorry, but I have not used the analyze tool in a while. It is not
> actively maintained.
>
> As a workaround, I recommend setting both
> fetchlist.score.by.link.count and indexer.boost.by.link.count to
> true. This works well, provided you're not encountering a lot of link
> spam.
>
> Doug
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: New Crystal Reports XI.
> Version 11 adds new functionality designed to reduce time involved in
> creating, integrating, and deploying reporting solutions. Free runtime
> info,
> new features, or free trial, at: http://www.businessobjects.com/devxi/728
> _______________________________________________
> Nutch-developers mailing list
> Nutch-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
>
>
Re: NUTCH-7 - analyze tool takes up all the disk space when there
are circular links
Posted by Doug Cutting <cu...@nutch.org>.
yoursoft@freemail.hu wrote:
> I have the same problem with latest nutch from svn. The reported
> solution doesn't work for me. I would like to fix this problem.
> Have anyone to any idea how to start with it?
Sorry, but I have not used the analyze tool in a while. It is not
actively maintained.
As a workaround, I recommend setting both fetchlist.score.by.link.count
and indexer.boost.by.link.count to true. This works well, provided
you're not encountering a lot of link spam.
Doug
NUTCH-7 - analyze tool takes up all the disk space when there are
circular links
Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.
Dear Guys,
I have the same problem with latest nutch from svn. The reported
solution doesn't work for me. I would like to fix this problem.
Have anyone to any idea how to start with it?
Best Regards,
Ferenc