You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Ian.Priest" <Ia...@opsera.com> on 2007/05/08 15:41:37 UTC

Newbie hello and web-setup question

Hi all,

 

I'm just starting out using Nutch and have a (probably basic) question
about configuration.

 

I want to use Nutch to provide search facilities for a website. I have
installed it at c:/nutch-0.9, edited the files in c:/nuthc-0.9/conf  and
created a crawl index at c:/nutch-0.9/crawl.mysite.

 

Now I'm trying to use NutchBean to  return some search results to a page
on my site. I'm using Tomcat and have added nutch-0.9.jar to the
deployed war file. However, I'm having trouble getting Nutch to run
against the external-to-tomcat directory.

Specifically, I don't want to have copies of the Nutch config files in
the deployed webapp to avoid management overhead on the live site
associated with keeping two copies of the file in sync, so I want to
convince the webapp to use c:/nutch-0.9/conf to get nutch-site.xml and
so forth. 

 

I can get most of the config loaded by loading the xml files into a
configuration object like this...

 

          //conf = NutchConfiguration.create();

            conf = new Configuration();

          

          // Add Nutch config files using nutchPath as a base to
over-ride defaults

            File defaultFile = new File(nutchPath +
"/conf/nutch-default.xml");

            if ( defaultFile.exists() ) {

                  conf.addDefaultResource(defaultFile.toURL());

            }

            File siteFile = new File(nutchPath +
"/conf/nutch-site.xml");

            if ( siteFile.exists() ) {

                  conf.addFinalResource(siteFile.toURL());

            }

                

          bean = new NutchBean(conf);

  

but for some reason it won't pick up a file called common-terms.utf8. It
just reports:

 

2007-05-08 13:50:14,318 [main] INFO
[org.apache.hadoop.conf.Configuration]
C:/nutch-0.9/conf/common-terms.utf8 not found

 

Although that file is certainly present. And then it throws an NPE
because it can't find the file:

 

java.lang.NullPointerException

                at java.io.Reader.<init>(Reader.java:61)

                at java.io.BufferedReader.<init>(BufferedReader.java:76)

...

 

Anyone know where I'm going wrong and how I can configure it so I don't
have to include the config files in my war?

 

Cheers,

Ian.

 

 

 


RE: Newbie hello and web-setup question

Posted by "Ian.Priest" <Ia...@opsera.com>.
Looks like I'm out on my own here, so for anyone else who wants to set
up a minimal Nutch in a web-app to just use as a site search tool here's
what I did as a first step.

 - stripped the config files down the minimum required for a web-app:
nutch-site.xml and common-terms.utf8, both in WEB-INF/classes so that
they get picked up _before_ the contents of nutch-0.9.jar

 - In WEB-INF/classes/nutch-site.xml set up the searcher.dir and plugin
directory as hard-coded paths to the Nutch installation directory (this
needs to change - more in subsequent postings)
	<property>
  		<name>searcher.dir</name>
	  	<value>c:/nutch-0.9/crawl.mysite</value>
	...
	<property>
	  	<name>plugin.folders</name>
	  	<value>c:/nutch-0.9/plugins</value>
	...

- Add required jars to WEB-INF/lib: I needed nutch-09, lucene-core,
lucene-misc and hadoop jar files. (My app already contains a bunch of
other jars that are probably needed, but adding those listed got it
working).

Now I can initialize NutchBean in my viewController (yes _ I'm working
under JSF in Spring) and use it likes this..

	    conf = NutchConfiguration.create();
	    bean = new NutchBean(conf);

And it picks up my index and runs searches. However It's still loading
much more than I need -  the logs show a huge set of plugins being
picked up from the plugin directory. More about that in another e-mail
to keep things clear.

Cheers,
Ian.



> -----Original Message-----
> From: Ian.Priest [mailto:Ian.Priest@opsera.com]
> Sent: 08 May 2007 14:42
> To: nutch-user@lucene.apache.org
> Subject: Newbie hello and web-setup question
> 
> Hi all,
> 
> 
> 
> I'm just starting out using Nutch and have a (probably basic) question
> about configuration.
> 
> 
> 
> I want to use Nutch to provide search facilities for a website. I have
> installed it at c:/nutch-0.9, edited the files in c:/nuthc-0.9/conf
> and
> created a crawl index at c:/nutch-0.9/crawl.mysite.
> 
> 
> 
> Now I'm trying to use NutchBean to  return some search results to a
> page
> on my site. I'm using Tomcat and have added nutch-0.9.jar to the
> deployed war file. However, I'm having trouble getting Nutch to run
> against the external-to-tomcat directory.
> 
> Specifically, I don't want to have copies of the Nutch config files in
> the deployed webapp to avoid management overhead on the live site
> associated with keeping two copies of the file in sync, so I want to
> convince the webapp to use c:/nutch-0.9/conf to get nutch-site.xml and
> so forth.
> 
> 
> 
> I can get most of the config loaded by loading the xml files into a
> configuration object like this...
> 
> 
> 
>           //conf = NutchConfiguration.create();
> 
>             conf = new Configuration();
> 
> 
> 
>           // Add Nutch config files using nutchPath as a base to
> over-ride defaults
> 
>             File defaultFile = new File(nutchPath +
> "/conf/nutch-default.xml");
> 
>             if ( defaultFile.exists() ) {
> 
>                   conf.addDefaultResource(defaultFile.toURL());
> 
>             }
> 
>             File siteFile = new File(nutchPath +
> "/conf/nutch-site.xml");
> 
>             if ( siteFile.exists() ) {
> 
>                   conf.addFinalResource(siteFile.toURL());
> 
>             }
> 
> 
> 
>           bean = new NutchBean(conf);
> 
> 
> 
> but for some reason it won't pick up a file called common-terms.utf8.
> It
> just reports:
> 
> 
> 
> 2007-05-08 13:50:14,318 [main] INFO
> [org.apache.hadoop.conf.Configuration]
> C:/nutch-0.9/conf/common-terms.utf8 not found
> 
> 
> 
> Although that file is certainly present. And then it throws an NPE
> because it can't find the file:
> 
> 
> 
> java.lang.NullPointerException
> 
>                 at java.io.Reader.<init>(Reader.java:61)
> 
>                 at
> java.io.BufferedReader.<init>(BufferedReader.java:76)
> 
> ...
> 
> 
> 
> Anyone know where I'm going wrong and how I can configure it so I
don't
> have to include the config files in my war?
> 
> 
> 
> Cheers,
> 
> Ian.
> 
> 
> 
> 
> 
>