You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "kiran (JIRA)" <ji...@apache.org> on 2012/09/11 22:40:07 UTC

[jira] [Resolved] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

     [ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

kiran resolved NUTCH-1467.
--------------------------

    Resolution: Implemented

I have made a patch file (attached below) which will solve the above problem. 
I do not think its the best method to do it but thats a temporary solution for me now and i am posting it here. 

For Example if there are two tags like this with the same name :

<meta name="DC.creator" content="R.L. Ticknor">
<meta name="DC.creator" content="J.E. Long">

The parser (after patch applied) will save the values as (metatag.dc.creator=R.L. Ticknor,J.E. Long), separated by commas . 

Previously only second value used to be saved since java properties class was used to save the names and values.

The patch is for the file HTMLMetaProcessor.java in the path ($NUTCH_HOME/src/plugin/parse-html/src/java/org/apache/nutch/parse/html). 

It would have been great if i could save the values as an array instead of comma but since properties was used to save names and values, i thought its best to keep it separated by commas.

Whoever will use the crawled meta values, please use split(',') function for the multi values.

Please let me know if you have any suggestions. 

                
> nutch 1.5.1 not able to parse mutliValued metatags
> --------------------------------------------------
>
>                 Key: NUTCH-1467
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1467
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.5.1
>            Reporter: kiran
>            Priority: Minor
>         Attachments: patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it work ?
> When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira