You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2014/09/01 14:52:21 UTC

[jira] [Commented] (NUTCH-1815) Metadata Parsed with parse-tika is Duplicated

    [ https://issues.apache.org/jira/browse/NUTCH-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117394#comment-14117394 ] 

Julien Nioche commented on NUTCH-1815:
--------------------------------------

This problem comes from the fact that the TikaParser calls 

bq. HTMLMetaProcessor.getMetaTags(metaTags, root, base);

to store the metatags extracted from the DOM but also add to the parse metadata all the medata returned by Tika. In the case of descriptions, they are obtained and stored in both places.

The metatag parser ([https://github.com/apache/nutch/blob/trunk/src/plugin/parse-metatags/src/java/org/apache/nutch/parse/metatags/MetaTagsParser.java#L104]) adds both.

A quick fix would be to modify the SOLR schema to allow multiple values but ideally we'd want to fix the logic above. 

To solve this, one option would be to modify the MetaTagsParser so that it does not add to the prefixed medata [https://github.com/apache/nutch/blob/trunk/src/plugin/parse-metatags/src/java/org/apache/nutch/parse/metatags/MetaTagsParser.java#L91] if something has already been added there with the same key.
 

> Metadata Parsed with parse-tika is Duplicated
> ---------------------------------------------
>
>                 Key: NUTCH-1815
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1815
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer, parser
>    Affects Versions: 1.8
>            Reporter: Jonathan Cooper-Ellis
>            Assignee: Julien Nioche
>            Priority: Minor
>             Fix For: 1.10
>
>
> When Nutch is configured to parse metatags and index metadata from HTML documents, disabling parse-html (and using parse-tika instead) causes each metadata field to be indexed twice with identical content.
> I only modified plugin.includes (description and keywords metatags are included in nutch-site.xml by default, so I did not modify those):
> <property>
>   <name>plugin.includes</name>
>   <value>protocol-http|urlfilter-regex|parse-(tika|metatags)|index-(basic|anchor|metadata)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>   <description>...</description>
> </property>
> Sample output:
> $ bin/nutch indexchecker http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-did-get.html
> fetching: http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-did-get.html
> parsing: http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-did-get.html
> contentType: text/html
> content :	Commonwealth Fund survey: Obamacare helped 9.5 million Americans get health insurance, thanks to exc
> title :	Commonwealth Fund survey: Obamacare helped 9.5 million Americans get health insurance, thanks to exc
> host :	www.bizjournals.com
> tstamp :	Thu Jul 10 17:34:56 UTC 2014
> metatag.description :	A new survey by the Commonwealth Fund found that 9.5 million previously uninsured Americans got cove
> metatag.description :	A new survey by the Commonwealth Fund found that 9.5 million previously uninsured Americans got cove
> url :	http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-
> In this case, metatag.description appears twice. If parse-html is added back to plugin.includes and the same command is run, metatag.description will only appear once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)