You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Anton (JIRA)" <ji...@apache.org> on 2014/02/04 10:24:11 UTC

[jira] [Comment Edited] (NUTCH-1478) Parse-metatags and index-metadata plugin for Nutch 2.x series

    [ https://issues.apache.org/jira/browse/NUTCH-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13890514#comment-13890514 ] 

Anton  edited comment on NUTCH-1478 at 2/4/14 9:22 AM:
-------------------------------------------------------

I try NUTCH-1478v4.patch

When I confure  index.content.md = description or metatag.description in nutch-default.xml
{code:xml}
    <property>
        <name>index.parse.md</name>
        <value>metatag.description</value>
        <description>
            Comma-separated list of keys to be taken from the parse metadata to generate fields.
            Can be used e.g. for 'description' or 'keywords' provided that these values are generated
            by a parser (see parse-metatags plugin)
        </description>
    </property>

    <property>
        <name>index.content.md</name>
        <value>description</value>
        <description>
            Comma-separated list of keys to be taken from the content metadata to generate fields.
        </description>
    </property>

    <property>
        <name>index.db.md</name>
        <value></value>
        <description>
            Comma-separated list of keys to be taken from the crawldb metadata to generate fields.
            Can be used to index values propagated from the seeds with the plugin urlmeta
        </description>
    </property>

    <!-- parse-metatags plugin properties -->
    <property>
        <name>metatags.names</name>
        <value>description</value>
        <description> Names of the metatags to extract, separated by;.
            Use '*' to extract all metatags. Prefixes the names with 'metatag.'
            in the parse-metadata. For instance to index description and keywords,
            you need to activate the plugin index-metadata and set the value of the
            parameter 'index.parse.md' to 'metatag.description;metatag.keywords'.
        </description>
    </property>
{code}

I got NPE
{code:java}
14/02/04 13:00:47 WARN mapred.LocalJobRunner: job_local1932930342_0001
java.lang.Exception: java.lang.NullPointerException
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.NullPointerException
	at org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:95)
	at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:107)
	at org.apache.nutch.indexer.IndexUtil.index(IndexUtil.java:77)
	at org.apache.nutch.indexer.IndexerJob$IndexerMapper.map(IndexerJob.java:103)
	at org.apache.nutch.indexer.IndexerJob$IndexerMapper.map(IndexerJob.java:61)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
	at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:744)
{code}


was (Author: popalka):
I try NUTCH-1478v4.patch

When I confure  index.content.md in nutch-default.xml
{code:xml}
    <property>
        <name>index.content.md</name>
        <value>description</value>
        <description>
            Comma-separated list of keys to be taken from the content metadata to generate fields.
        </description>
    </property>
{code}

I got NPE
{code:java}
14/02/04 13:00:47 WARN mapred.LocalJobRunner: job_local1932930342_0001
java.lang.Exception: java.lang.NullPointerException
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.NullPointerException
	at org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:95)
	at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:107)
	at org.apache.nutch.indexer.IndexUtil.index(IndexUtil.java:77)
	at org.apache.nutch.indexer.IndexerJob$IndexerMapper.map(IndexerJob.java:103)
	at org.apache.nutch.indexer.IndexerJob$IndexerMapper.map(IndexerJob.java:61)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
	at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:744)
{code}

> Parse-metatags and index-metadata plugin for Nutch 2.x series 
> --------------------------------------------------------------
>
>                 Key: NUTCH-1478
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1478
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.1
>            Reporter: kiran
>             Fix For: 2.3
>
>         Attachments: NUTCH-1478-parse-v2.patch, NUTCH-1478v3.patch, NUTCH-1478v4.patch, Nutch1478.patch, Nutch1478.zip, metadata_parseChecker_sites.png
>
>
> I have ported parse-metatags and index-metadata plugin to Nutch 2.x series.  This will take multiple values of same tag and index in Solr as i patched before (https://issues.apache.org/jira/browse/NUTCH-1467).
> The usage is same as described here (http://wiki.apache.org/nutch/IndexMetatags) but one change is that there is no need to give 'metatag' keyword before metatag names. For example my configuration looks like this (https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml) 
> This is only the first version and does not include the junit test. I will update the new version soon.
> This will parse the tags and index the tags in Solr. Make sure you create the fields in 'index.parse.md' in nutch-site.xml in schema.xml in Solr.
> Please let me know if you have any suggestions
> This is supported by DLA (Digital Library and Archives) of Virginia Tech.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)