You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sami Siren (JIRA)" <ji...@apache.org> on 2007/10/22 19:46:50 UTC

[jira] Commented: (NUTCH-568) Indexer does not update the Lucene "TITLE" field

    [ https://issues.apache.org/jira/browse/NUTCH-568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12536756 ] 

Sami Siren commented on NUTCH-568:
----------------------------------

There is a BOM (Byte Order Mark) in the beginning of the file [feff] that seems to confuse nutch. I did not track down the change that cased this.

> Indexer does not update the Lucene "TITLE" field
> ------------------------------------------------
>
>                 Key: NUTCH-568
>                 URL: https://issues.apache.org/jira/browse/NUTCH-568
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.0.0
>         Environment: Windows XP
>            Reporter: smorales
>         Attachments: RN-071018-000024.html
>
>
> Hi,
> The indexer is unable to update the field "TITLE" of the Lucene index when processing specific html documents.
> This issue has been reproduced using Nutch-Nightly Build #241 (Oct 19, 2007 4:01:28 AM)
> The problem does not occurs using NUTCH 9.0.
> Workflow:
> 1.- Extracted package and copy across the following configuration files from NUTCH 9.0
> - {nutch_home_9.0}/bin/url folder, containing the urls
> - {nutch_home_9.0}/conf/nutch-site.xml
> - {nutch_home_9.0}/conf/crawl-urlfilter.txt
> 2.- To reproduce the issue, you need to copy the attached html document to your webserver/filesytem.
> 3.- Run the crawl.
> For example: ./nutch crawl urls -dir crawl -depth 22
> 4.- Open the index using Luke.  For this test, I used lukeall-0.7.1.jar
> 5.- Select the window select the "document" tab, move thru the docs until you find our html document.
> You will see that the TITLE field is empty  --> INCORRECT because this html document contains a title.
> 6.- Now, open the html document, add a space anywhere then save it again.
> 7.- Repeat step 3 and 4.
> You will notice that this time the field "TITLE" field contains the correct information
> Please advice,
> Many thanks in advance for your support.
> Sergio

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.