You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Björn Wilmsmann <bj...@wilmsmann.de> on 2008/01/29 02:37:29 UTC

common-terms.utf8 not found in class path when using Nutch from WAR file

Hello everybody,

I have run into a rather weird problem that occurs when deploying a  
Grails (http://grails.codehaus.org/) app as a WAR file in Tomcat. My  
app instantiates a NutchDocumentAnalyzer during startup as a Spring  
resource. The Nutch classes and config files are loaded from a JAR  
inside the lib directory of the app.
All of this works fine when running the app via 'grails run-app'.  
However, when running the app under Tomcat via the WAR file generated  
by 'grails war' I get the following stacktrace (excerpt):

Caused by: org.springframework.beans.BeanInstantiationException: Could  
not instantiate bean class  
[org.apache.nutch.analysis.NutchDocumentAnalyzer]: Constructor threw  
exception; nested exception is java.lang.NullPointerException
	at  
org.springframework.beans.BeanUtils.instantiateClass(BeanUtils.java:98)
	at  
org 
.springframework 
.beans 
.factory 
.support 
.SimpleInstantiationStrategy 
.instantiate(SimpleInstantiationStrategy.java:87)
	at  
org 
.springframework 
.beans 
.factory 
.support 
.ConstructorResolver.autowireConstructor(ConstructorResolver.java:233)
	... 63 more
Caused by: java.lang.NullPointerException
	at java.io.Reader.<init>(Reader.java:61)
	at java.io.BufferedReader.<init>(BufferedReader.java:76)
	at java.io.BufferedReader.<init>(BufferedReader.java:91)
	at org.apache.nutch.analysis.CommonGrams.init(CommonGrams.java:152)
	at org.apache.nutch.analysis.CommonGrams.<init>(CommonGrams.java:52)
	at org.apache.nutch.analysis.NutchDocumentAnalyzer 
$ContentAnalyzer.<init>(NutchDocumentAnalyzer.java:64)
	at  
org 
.apache 
.nutch 
.analysis.NutchDocumentAnalyzer.<init>(NutchDocumentAnalyzer.java:55)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native  
Method)
	at  
sun 
.reflect 
.NativeConstructorAccessorImpl 
.newInstance(NativeConstructorAccessorImpl.java:39)
	at  
sun 
.reflect 
.DelegatingConstructorAccessorImpl 
.newInstance(DelegatingConstructorAccessorImpl.java:27)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
	at  
org.springframework.beans.BeanUtils.instantiateClass(BeanUtils.java:83)
	... 65 more

This is caused by the common-terms.utf8 file not being found in line  
152 of org.apache.nutch.analysis.CommonGrams. However, this file is  
located on the root level of the nutch.jar in the lib directory that  
also contains the classes themselves. I have also tried copying the  
file to TOMCAT/webapps/MY_APP/WEB-INF/classes, TOMCAT/webapps/MY_APP/ 
WEB-INF/ and TOMCAT/webapps/MY_APP/WEB-INF/lib, all to no avail.

Does anybody know what this could possibly be caused by?

--
Best regards,
Bjoern Wilmsmann

Re: Can IndexReader be opened on a hadoop directory?

Posted by Andrzej Bialecki <ab...@getopt.org>.

Kenji wrote:
> I'm trying to open a Lucene index created on a hadoop dfs.
>    Configuration nutchConf = NutchConfiguration.create();
>    FileSystem fs = FileSystem.get(nutchConf);
>    Path lastIndex = this.dataConf.lastIndexDir();
>    IndexReader idxReader = IndexReader.open(fs.getUri().toString() +
>    lastIndex);
> 
> 
> This results in an exception:
> 
>    hdfs://localhost:9000/user/kenji/pages/lastIndex
>    Exception in thread "main" java.io.IOException: The filename,
>    directory name, or volume label syntax is incorrect
>            at java.io.WinNTFileSystem.canonicalize0(Native Method)
>            at java.io.Win32FileSystem.canonicalize(Unknown Source)
>            at java.io.File.getCanonicalPath(Unknown Source)
>            at
>    org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:168)
>            at
>    org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:139)
>            at
>    org.apache.lucene.index.IndexReader.open(IndexReader.java:148)
>            at ix.indexer.PageIndexer.test(Unknown Source)
>            at ix.indexer.PageIndexer.main(Unknown Source)
> 
> I've tried without the uri, then it assumes a local file system.  Is the 
> index reader supposed to work only locally?

If you pass the index path as a String, then IndexReader will silently 
attempt to create an FSDirectory to read from it. This works only on 
local filesystems. In order to use HDFS, you need to create an instance 
of org.apache.nutch.indexer.FsDirectory(Path), and create an IndexReader 
using this directory.

Please note that usually the performance of using Lucene indexes 
directly from HDFS is poor.



-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Can IndexReader be opened on a hadoop directory?

Posted by Kenji <ke...@trailfire.com>.

I'm trying to open a Lucene index created on a hadoop dfs. 

    Configuration nutchConf = NutchConfiguration.create();
    FileSystem fs = FileSystem.get(nutchConf);
    Path lastIndex = this.dataConf.lastIndexDir();
    IndexReader idxReader = IndexReader.open(fs.getUri().toString() +
    lastIndex);


This results in an exception:

    hdfs://localhost:9000/user/kenji/pages/lastIndex
    Exception in thread "main" java.io.IOException: The filename,
    directory name, or volume label syntax is incorrect
            at java.io.WinNTFileSystem.canonicalize0(Native Method)
            at java.io.Win32FileSystem.canonicalize(Unknown Source)
            at java.io.File.getCanonicalPath(Unknown Source)
            at
    org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:168)
            at
    org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:139)
            at
    org.apache.lucene.index.IndexReader.open(IndexReader.java:148)
            at ix.indexer.PageIndexer.test(Unknown Source)
            at ix.indexer.PageIndexer.main(Unknown Source)

I've tried without the uri, then it assumes a local file system.  Is the 
index reader supposed to work only locally?

Thanks.

-Kenji Kawai