You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by kiran chitturi <ch...@gmail.com> on 2012/09/11 22:36:21 UTC

patch for parse-metatags to parse a multivalued tags

Hi,

I have posted in the mailing list before about the parse-metatags plugin
not able to parse mutilvalued tags and also created an issue (
https://issues.apache.org/jira/browse/NUTCH-1467)

I have made a patch file (attached below) which will solve that problem. I
do not think its the best method to do it but thats a temporary solution
for me now and i am posting it here.

For Example if there are two tags like this with the same name :

<meta name="DC.creator" content="R.L. Ticknor"> <meta name="DC.creator"
content="J.E. Long"> The parser (after patch applied) will save the values
as (metatag.dc.creator=R.L. Ticknor,J.E. Long), separated by commas .
Previously only second value used to be saved since java properties class
was used to save the names and values. The patch is for the file
HTMLMetaProcessor.java in the path
($NUTCH_HOME/src/plugin/parse-html/src/java/org/apache/nutch/parse/html).
 It would have been great if i could save the values as an array instead of
comma but since properties was used to save names and values, i thought its
best to keep it separated by commas. Whoever will use the crawled meta
values, please use split(',') function for the multi values. Please let me
know if you have any suggestions. Regards,--
Kiran Chitturi