You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Stefan Neufeind (JIRA)" <ji...@apache.org> on 2006/05/26 01:24:30 UTC
[jira] Updated: (NUTCH-288) hitsPerSite-functionality "flawed":
problems writing a page-navigation
[ http://issues.apache.org/jira/browse/NUTCH-288?page=all ]
Stefan Neufeind updated NUTCH-288:
----------------------------------
Attachment: NUTCH-288-OpenSearch-fix.patch
This patch includes Doug's one-line fix to prevent an exception.
Also it does go back page by page until you get to the last result-page. The start-value returned in the RSS-feed is correct afterwards(!). This easily allows you to check whether the requested result-start and the one received are identical - otherwise you are on the last page and were "redirected" - and now know that you don't need to display any pages in your page-navigation following this one :-)
Applies and works fine for me.
> hitsPerSite-functionality "flawed": problems writing a page-navigation
> ----------------------------------------------------------------------
>
> Key: NUTCH-288
> URL: http://issues.apache.org/jira/browse/NUTCH-288
> Project: Nutch
> Type: Bug
> Components: web gui
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
> Attachments: NUTCH-288-OpenSearch-fix.patch
>
> The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads to problems when trying to offer a page-navigation (e.g. allow the user to jump to page 10). This is because dedup is done after fetching.
> RSS shows a maximum number of 7763 documents (that is without dedup!), I set it to display 10 items per page. My "naive" approach was to estimate I have 7763/10 = 777 pages. But already when moving to page 3 I got no more searchresults (I guess because of dedup). And when moving to page 10 I got an exception (see below).
> 2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for servlet OpenSearch threw exception
> java.lang.NegativeArraySizeException
> at org.apache.nutch.searcher.Hits.getHits(Hits.java:65)
> at org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
> at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
> at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
> at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
> at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
> at org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)
> at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152)
> at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
> at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
> at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137)
> at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
> at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118)
> at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
> at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
> at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
> at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
> at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929)
> at org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160)
> at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799)
> at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705)
> at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577)
> at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
> at java.lang.Thread.run(Thread.java:595)
> Only workaround I see for the moment: Fetching RSS without duplication, dedup myself and cache the RSS-result to improve performance. But a cleaner solution would imho be nice. Is there a performant way of doing deduplication and knowing for sure how many documents are available to view? For sure this would mean to dedup all search-results first ...
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira