You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/04/01 16:35:08 UTC

[jira] [Updated] (NUTCH-708) NutchBean: OOM due to searcher.max.hits and dedup.

     [ https://issues.apache.org/jira/browse/NUTCH-708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-708:
--------------------------------


Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

> NutchBean: OOM due to searcher.max.hits and dedup.
> --------------------------------------------------
>
>                 Key: NUTCH-708
>                 URL: https://issues.apache.org/jira/browse/NUTCH-708
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 1.0.0
>         Environment: Ubuntu Linux, Java 5.
>            Reporter: Aaron Binns
>
> When searching an index we built for the National Archives, this one in particular: http://webharvest.gov/collections/congress110th/
> We ran into an interesting situation.
> We were using searcher.max.hits=1000 in order to get faster searches.  Since our index is sorted, the "best" documents are "at the front" and setting searcher.max.hits=1000 would give us a nice trade-off of search quality vs. response time.
> What I discovered was that with dedup (on site) enabled, we would get into this loop where the searcher.max.hits would limit the raw hits to 1000 and the deduplication code would get to the end of those 1000 results and still need more as it hadn't found enough de-dup'd results to satisfy the query.
> The first 6 pages of results would be fine, but when we got to page 7, the NutchBean would need more than 1000 raw results in order to get 60 de-duped results.
> The code:
>     for (int rawHitNum = 0; rawHitNum < hits.getTotal(); rawHitNum++) {
>       // get the next raw hit                                                                                                                                                                                    
>       if (rawHitNum >= hits.getLength())
>         {
>         // optimize query by prohibiting more matches on some excluded values                                                                                                                                    
>         Query optQuery = (Query)query.clone();
>         for (int i = 0; i < excludedValues.size(); i++) {
>           if (i == MAX_PROHIBITED_TERMS)
>             break;
>           optQuery.addProhibitedTerm(((String)excludedValues.get(i)),
>                                      dedupField);
>         }
>         numHitsRaw = (int)(numHitsRaw * rawHitsFactor);
>         if (LOG.isInfoEnabled()) {
>           LOG.info("re-searching for "+numHitsRaw+" raw hits, query: "+optQuery);
>         }
>         hits = searcher.search(optQuery, numHitsRaw,
>                                dedupField, sortField, reverse);
>         if (LOG.isInfoEnabled()) {
>           LOG.info("found "+hits.getTotal()+" raw hits");
>         }
>         rawHitNum = -1;
>         continue;
>       }
> The loop constraints were never satisfied as rawHitNum and hits.getLength() are capped by searcher.max.hits (1000).  The numHitsRaw keeps increasing by a factor of 2 (rawHitsFactor) until it gets to 2^31 or so and deep down in the search library code an array is allocated using that value as the size and you get an OOM.
> We worked around the problem by abandoning the use of searcher.max.hits.  I suppose we could have increased the value, but the index was small enough (~10GB) that disabling searcher.max.hits didn't degrade the response time too much.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira