You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sergio Morales <se...@yahoo.co.uk> on 2007/10/19 09:28:42 UTC

Fw: Indexer does not update the field "TITLE" of Lucene when processing specific html documents

Hi,
 
I have upgraded from NUTCH 9.0 to nutch-2007-09-30_04-01-28.tar.gz.
 
It seems the indexer is unable to update the field "TITLE" of the Lucene index when processing specific html documents.
 
 
Please find below a brief summay of this issue:
 
1.- Extracted this new version in a separate directory and copy across the following configuration files:
- {nutch_home_9.0}/bin/url folder, containing the urls
- {nutch_home_9.0}/conf/nutch-site.xml
- {nutch_home_9.0}/conf/crawl-urlfilter.txt
 
2.- To reproduce the issue, you would need to copy the attached html document to your webserver/filesytem.
 
3.- Run the crawl using the following command.
./nutch crawl urls -dir crawl -depth 22
 
4.- Open the index using Luke. 
 
5.- Select the "document" tab, move thru the docs until you find the above document.
You will see that the TITLE field is empty  --> INCORRECT because this html document contains a title.
 
6.- Now, open the html document, add a space anywhere then save it again.
 
7.- Repeat step 3 and 4.

You will notice that this time the field "TITLE" field contains the correct information
 
This problem does NOT occurs using NUTCH 9.0
 
Please advice,
 
Many thanks in advance for your support
 
Serg


      ___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/