You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jason Tang <ja...@commcentral.com> on 2005/03/07 04:51:32 UTC

Re: [Nutch-general] Crowling specific pages

Hello Aldo,

I am working on the 2nd issue now.
And here I explain why you have to re-start app-server to re-load index.

--------NutchBean.java------------

  /** Cache in servlet context. */
  public static NutchBean get(ServletContext app) throws IOException {
    NutchBean bean = (NutchBean)app.getAttribute("nutchBean");
    if (bean == null) {
      LOG.info("creating new bean");
      bean = new NutchBean();
      app.setAttribute("nutchBean", bean);
    }
    return bean;
  }
the get() method takes no care changes on index directory.

And my solution is that:
  /** Cache in servlet context. But when any changes on index, reload the index */
  public static NutchBean get(ServletContext app) throws IOException {
  	
  	loadIndexAndWatch(app,1000);    //<-- If index modified, file watch will detect it and re-new NutchBean.
    NutchBean bean = (NutchBean)app.getAttribute("nutchBean");
    
    if (bean == null)
    {
      LOG.info("creating new bean");
      bean = new NutchBean();
      app.setAttribute("nutchBean", bean);
    }
    return bean;
  }

I plan to test my solution this week and submit it. Any good idea?
Yep, in another way, I'd like to write IndexLoaderServlet/Listener, which way better?
  

/Jack
======= At 2005-03-06, 06:18:14 you wrote: =======

>Hi All...
>
>I need to crawl about 20 sites. Sites structure is:
>
>ENTRY-PAGE (http://www.example.com/lists.html) with a list of links to sub-
>pages (http://www.example.com/sub-page1.html ... sub-pageN.html).
>
>I need to:
>
>- fetch _always_ the ENTRY-PAGE (list of links to sub-pages);
>- if a sub-pages (URL) is on DB, *don't fetch it* (for keep bandwith lower);
>- if a sub-pages is not on DB, *fetch it*.
>
>I need to run this about 2/3 times in a day for all 20 sites.
>
>Is possible to use nutch? If yes, how configure it for this scope?
>
>Thanks you,
>Duc.
>
>P.S. I notice that I need to restart apache tomcat to search in new segments. 
>How is possible without restart it?
>
>
>-------------------------------------------------------
>SF email is sponsored by - The IT Product Guide
>Read honest & candid reviews on hundreds of IT Products from real users.
>Discover which products truly live up to the hype. Start reading now.
>http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
>_______________________________________________
>Nutch-general mailing list
>Nutch-general@lists.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/nutch-general

= = = = = = = = = = = = = = = = = = = =