You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Nick (JIRA)" <ji...@apache.org> on 2013/10/19 03:59:42 UTC

[jira] [Commented] (NUTCH-1478) Parse-metatags and index-metadata plugin for Nutch 2.x series

    [ https://issues.apache.org/jira/browse/NUTCH-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799732#comment-13799732 ] 

Nick commented on NUTCH-1478:
-----------------------------

This plugin works great if the page has the metatags mentioned in the index.content.md but breaks if they are missing.  How do I go about making the fields optional?

<property>
	<name>index.content.md</name>
	<value>description,keywords,author</value>
</property>

bin/nutch indexchecker http://localhost/stories/
fetching: http://localhost/stories/
parsing: http://localhost/stories/
contentType: text/html
Exception in thread "main" java.lang.NullPointerException
	at org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:95)
	at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:107)
	at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:127)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:151)


bin/nutch indexchecker http://localhost/stories/cant-be-satisfied
[1] 5726
fetching: http://localhost/stories/cant-be-satisfied
parsing: http://localhost/stories/cant-be-satisfied
contentType: text/html
content :	Can't be Satisfied
author :	Robert Gordon
title : Can't be Satisfied
keywords :	blues, music, muddy water	
host :	localhost
description :	Life and Times of Muddy Waters
tstamp :	2013-10-19T01:34:41.440Z
url :	http://localhost/stories/cant-be-satisfied

> Parse-metatags and index-metadata plugin for Nutch 2.x series 
> --------------------------------------------------------------
>
>                 Key: NUTCH-1478
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1478
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.1
>            Reporter: kiran
>             Fix For: 2.3
>
>         Attachments: metadata_parseChecker_sites.png, Nutch1478.patch, Nutch1478.zip
>
>
> I have ported parse-metatags and index-metadata plugin to Nutch 2.x series.  This will take multiple values of same tag and index in Solr as i patched before (https://issues.apache.org/jira/browse/NUTCH-1467).
> The usage is same as described here (http://wiki.apache.org/nutch/IndexMetatags) but one change is that there is no need to give 'metatag' keyword before metatag names. For example my configuration looks like this (https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml) 
> This is only the first version and does not include the junit test. I will update the new version soon.
> This will parse the tags and index the tags in Solr. Make sure you create the fields in 'index.parse.md' in nutch-site.xml in schema.xml in Solr.
> Please let me know if you have any suggestions
> This is supported by DLA (Digital Library and Archives) of Virginia Tech.



--
This message was sent by Atlassian JIRA
(v6.1#6144)