You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Aïcha <ai...@yahoo.com> on 2006/10/20 17:52:22 UTC

problem parsing documents : word, rtf, excel, etc...

Hi, 

I have a lot of parsing problems when I try to index my directory, about only 50% of files where indexed!!!!

I ask the nutch-dev group but I try in the nutch-user, perhaps somebody had these problems and solved......

I put a list of the main problem the parsing encountred : 

  -  Error parsing: file:/C:/doc to index/conges.xls: failed(2,0): Can't be handled as micrsosoft document. org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance, the following exception occured: null 

  - Error parsing: file:/C:/docs_a_indexer/doc1/test.doc: failed(2,0): Can't be handled as micrsosoft document. java.util.NoSuchElementException 
  
  - Error parsing: file:/C:/docs_a_indexer/doc3/test.rtf: failed(2,0): Can't be handled as micrsosoft document. java.io.IOException: Invalid header signature; read 7015536635646467195, expected -2226271756974174256 

  - 2006-10-13 17:29:42,343 ERROR parse.OutlinkExtractor - getOutlinks 
java.net.MalformedURLException: unknown protocol: dsp 
        at java.net.URL.<init>(URL.java:574) 
        at java.net.URL.<init>(URL.java:464) 
        at java.net.URL.<init>(URL.java:413) 
        at org.apache.nutch.net.BasicUrlNormalizer.normalize(BasicUrlNormalizer.java:78) 
        at org.apache.nutch.parse.Outlink.<init>(Outlink.java:35) 
        at org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:111) 
        at org.apache.nutch.parse.ms.MSBaseParser.getParse(MSBaseParser.java:84) 
        at org.apache.nutch.parse.msword.MSWordParser.getParse(MSWordParser.java:43) 
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) 
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:276) 
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:152) 


In the last error, the string after "unknown protocol: " is not always dsp, it seems to be different in each case and I don't understand what mean this string. 

Thank in advance  

Best regards, 
Aïcha


	

	
		
___________________________________________________________________________ 
Découvrez une nouvelle façon d'obtenir des réponses à toutes vos questions ! 
Demandez à ceux qui savent sur Yahoo! Questions/Réponses
http://fr.answers.yahoo.com