You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Aaron Binns (JIRA)" <ji...@apache.org> on 2009/03/01 21:13:12 UTC

[jira] Created: (NUTCH-708) NutchBean: OOM due to searcher.max.hits and dedup.

NutchBean: OOM due to searcher.max.hits and dedup.
--------------------------------------------------

                 Key: NUTCH-708
                 URL: https://issues.apache.org/jira/browse/NUTCH-708
             Project: Nutch
          Issue Type: Bug
          Components: searcher
    Affects Versions: 1.0.0
         Environment: Ubuntu Linux, Java 5.
            Reporter: Aaron Binns


When searching an index we built for the National Archives, this one in particular: http://webharvest.gov/collections/congress110th/
We ran into an interesting situation.

We were using searcher.max.hits=1000 in order to get faster searches.  Since our index is sorted, the "best" documents are "at the front" and setting searcher.max.hits=1000 would give us a nice trade-off of search quality vs. response time.

What I discovered was that with dedup (on site) enabled, we would get into this loop where the searcher.max.hits would limit the raw hits to 1000 and the deduplication code would get to the end of those 1000 results and still need more as it hadn't found enough de-dup'd results to satisfy the query.

The first 6 pages of results would be fine, but when we got to page 7, the NutchBean would need more than 1000 raw results in order to get 60 de-duped results.

The code:
    for (int rawHitNum = 0; rawHitNum < hits.getTotal(); rawHitNum++) {
      // get the next raw hit                                                                                                                                                                                    
      if (rawHitNum >= hits.getLength())
        {
        // optimize query by prohibiting more matches on some excluded values                                                                                                                                    
        Query optQuery = (Query)query.clone();
        for (int i = 0; i < excludedValues.size(); i++) {
          if (i == MAX_PROHIBITED_TERMS)
            break;
          optQuery.addProhibitedTerm(((String)excludedValues.get(i)),
                                     dedupField);
        }
        numHitsRaw = (int)(numHitsRaw * rawHitsFactor);
        if (LOG.isInfoEnabled()) {
          LOG.info("re-searching for "+numHitsRaw+" raw hits, query: "+optQuery);
        }
        hits = searcher.search(optQuery, numHitsRaw,
                               dedupField, sortField, reverse);
        if (LOG.isInfoEnabled()) {
          LOG.info("found "+hits.getTotal()+" raw hits");
        }
        rawHitNum = -1;
        continue;
      }

The loop constraints were never satisfied as rawHitNum and hits.getLength() are capped by searcher.max.hits (1000).  The numHitsRaw keeps increasing by a factor of 2 (rawHitsFactor) until it gets to 2^31 or so and deep down in the search library code an array is allocated using that value as the size and you get an OOM.

We worked around the problem by abandoning the use of searcher.max.hits.  I suppose we could have increased the value, but the index was small enough (~10GB) that disabling searcher.max.hits didn't degrade the response time too much.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.