You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by aicha BEN <ai...@yahoo.com> on 2006/07/05 17:24:42 UTC

problem with fetching PDF or word format

hello,
 
I am a new user on the Nutch application.
I configure my nutch-site.xml to index several type of format  from my file system :
<value>nutch-extensionpoints|protocol-file|urlfilter-regex|parse-(msword|xml|text|html|js|pdf)|index-basic|query-(basic|site|url)</value>
when I test with a ".txt" it works
but I make the crawl for  ".pdf" or ".doc" file, there is a problem in the fetch :
fetch of file:///C:/doc/test.pdf failed with: java.lang.Exception: org.apache.nutch.protocol.file.FileError: File Error: 404
 
could someone help me?
Aïcha

after mergesegs - updatedb?

Posted by Honda-Search Administrator <ad...@honda-search.com>.

I just merged all of my segments into one, which was fast considering I have 
only around 80k documents.  After I run mergesegs I am a little confused as 
to what to do.

I indexed the new master segment... now what?

I ran updatedb (command bin/nutch updatedb crawl/db 
crawl/segments/segment_name_here) and it told me my database now has 260k 
documents.  Did I just _add_ the new segment to the old database?  I'm a bit 
confused because there are only 80k documents inthe segment, how can there 
be 260k records in the database?

I also merged the indexes (ls -d crawl/segments/* | xargs bin/nutch merge 
crawl/index), but I'm afraid it will also retain the old records like the 
updatedb command seemed to do.  Is this true?

So my questions are as follows:  What's up with it saying I have 260k 
records in my index?  Also, will the merge command (see above) recreate the 
index each time it is run, or is it just adding new indexes to old ones?

Matt

Re : problem with fetching PDF or word format

Posted by aicha BEN <ai...@yahoo.com>.

I am sorry I work on 2 doc directory and the doc one I use really doesn't content my file effectively.....
Thanks a lot.


----- Message d'origine ----
De : Marko Bauhardt <mb...@media-style.com>
À : nutch-user@lucene.apache.org
Envoyé le : Mercredi, 5 Juillet 2006, 5h35mn 29s
Objet : Re: problem with fetching PDF or word format


Am 05.07.2006 um 17:24 schrieb aicha BEN:

> hello,

Hi,


> fetch of file:///C:/doc/test.pdf failed with: java.lang.Exception:  
> org.apache.nutch.protocol.file.FileError: File Error: 404

Exists the pdf file? Error Code 404 sounds like 'File Not Found'.

Marko

Re: problem with fetching PDF or word format

Posted by Marko Bauhardt <mb...@media-style.com>.

Am 05.07.2006 um 17:24 schrieb aicha BEN:

> hello,

Hi,


> fetch of file:///C:/doc/test.pdf failed with: java.lang.Exception:  
> org.apache.nutch.protocol.file.FileError: File Error: 404

Exists the pdf file? Error Code 404 sounds like 'File Not Found'.

Marko