You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Jonathan Cooper-Ellis (JIRA)" <ji...@apache.org> on 2014/07/10 19:42:04 UTC

[jira] [Created] (NUTCH-1815) Metadata Parsed with parse-tika is Duplicated

Jonathan Cooper-Ellis created NUTCH-1815:
--------------------------------------------

             Summary: Metadata Parsed with parse-tika is Duplicated
                 Key: NUTCH-1815
                 URL: https://issues.apache.org/jira/browse/NUTCH-1815
             Project: Nutch
          Issue Type: Bug
          Components: indexer, parser
    Affects Versions: 1.8
            Reporter: Jonathan Cooper-Ellis
            Priority: Minor


When Nutch is configured to parse metatags and index metadata from HTML documents, disabling parse-html (and using parse-tika instead) causes each metadata field to be indexed twice with identical content.

I only modified plugin.includes (description and keywords metatags are included in nutch-site.xml by default, so I did not modify those):

<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(tika|metatags)|index-(basic|anchor|metadata)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>...</description>
</property>


Sample output:

$ bin/nutch indexchecker http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-did-get.html

fetching: http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-did-get.html
parsing: http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-did-get.html
contentType: text/html
content :	Commonwealth Fund survey: Obamacare helped 9.5 million Americans get health insurance, thanks to exc
title :	Commonwealth Fund survey: Obamacare helped 9.5 million Americans get health insurance, thanks to exc
host :	www.bizjournals.com
tstamp :	Thu Jul 10 17:34:56 UTC 2014
metatag.description :	A new survey by the Commonwealth Fund found that 9.5 million previously uninsured Americans got cove
metatag.description :	A new survey by the Commonwealth Fund found that 9.5 million previously uninsured Americans got cove
url :	http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-


In this case, metatag.description appears twice. If parse-html is added back to plugin.includes and the same command is run, metatag.description will only appear once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)