You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/08/16 13:53:31 UTC

[jira] [Closed] (NUTCH-288) hitsPerSite-functionality "flawed": problems writing a page-navigation

     [ https://issues.apache.org/jira/browse/NUTCH-288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma closed NUTCH-288.
-------------------------------

    Resolution: Won't Fix

Legacy issue. Search is being delegated to Solr.

> hitsPerSite-functionality "flawed": problems writing a page-navigation
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-288
>                 URL: https://issues.apache.org/jira/browse/NUTCH-288
>             Project: Nutch
>          Issue Type: Bug
>          Components: web gui
>    Affects Versions: 0.8
>            Reporter: Stefan Neufeind
>         Attachments: NUTCH-288-OpenSearch-fix.patch
>
>
> The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads to problems when trying to offer a page-navigation (e.g. allow the user to jump to page 10). This is because dedup is done after fetching.
> RSS shows a maximum number of 7763 documents (that is without dedup!), I set it to display 10 items per page. My "naive" approach was to estimate I have 7763/10 = 777 pages. But already when moving to page 3 I got no more searchresults (I guess because of dedup). And when moving to page 10 I  got an exception (see below).
> 2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for servlet OpenSearch threw exception
> java.lang.NegativeArraySizeException
>         at org.apache.nutch.searcher.Hits.getHits(Hits.java:65)
>         at org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149)
>         at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
>         at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>         at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
>         at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
>         at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
>         at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
>         at org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)
>         at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
>         at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
>         at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
>         at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
>         at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
>         at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
>         at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
>         at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929)
>         at org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160)
>         at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799)
>         at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705)
>         at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577)
>         at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
>         at java.lang.Thread.run(Thread.java:595)
> Only workaround I see for the moment: Fetching RSS without duplication, dedup myself and cache the RSS-result to improve performance. But a cleaner solution would imho be nice. Is there a performant way of doing deduplication and knowing for sure how many documents are available to view? For sure this would mean to dedup all search-results first ...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira