You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (Jira)" <ji...@apache.org> on 2020/06/08 12:07:00 UTC
[jira] [Updated] (NUTCH-2567) parse-metatags writes all meta tags
twice
[ https://issues.apache.org/jira/browse/NUTCH-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-2567:
-----------------------------------
Fix Version/s: (was: 1.18)
1.17
> parse-metatags writes all meta tags twice
> -----------------------------------------
>
> Key: NUTCH-2567
> URL: https://issues.apache.org/jira/browse/NUTCH-2567
> Project: Nutch
> Issue Type: Bug
> Reporter: Gerard Bouchar
> Priority: Major
> Fix For: 1.17
>
>
> Using nutch witch the following configuration, MetaTagsParser writes HTML meta tags to the metadata twice:
> {code:java}
> <property>
> <name>plugin.includes</name>
> <value>protocol-http|parse-(tika|metatags)</value>
> </property>
> {code}
> The problem seems to come from [MetaTagsParser.java#L104-L111|https://github.com/apache/nutch/blob/929fc9c89afb9267e3116cdca874dbcf9e511430/src/plugin/parse-metatags/src/java/org/apache/nutch/parse/metatags/MetaTagsParser.java#L104-L111] :
> Both the meta tags from the existing ParseResult and from the HTMLMetaTags are added to the metadata with a "metatag." prefix. But the ParseResult object already contains the HTML meta tags, because they have been added by TikaParser here: [TikaParser.java#L198-L206|https://github.com/apache/nutch/blob/929fc9c89afb9267e3116cdca874dbcf9e511430/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java#L198-L206]
>
> This bug is concerning, because it makes the segments uselessly big, especially if we want to store all metatags (by default, only metatag.description and metatag.keywords are stored, and thus duplicated).
> I would also suggest making the output of [Metadata::toString|https://github.com/apache/nutch/blob/3e2d3d456489bf52bc586dae0e2e71fb7aad8fe7/src/java/org/apache/nutch/metadata/Metadata.java#L235-L245] more readable(for instance by adding a newline before each new metadata value). It would have made this bug way easier to spot inside the output of the parsechecker.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)