You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2012/11/05 15:14:14 UTC
[jira] [Updated] (NUTCH-747) inject&Index metadatas and inherit
these metadatas to all matching suburls
[ https://issues.apache.org/jira/browse/NUTCH-747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lewis John McGibbney updated NUTCH-747:
---------------------------------------
Description:
Hi.
the following two patches supports
+ inject metadatas to url's into a metadatadb
url.com <TAB> <METAKEY> : <TAB> <METAVALUE> <TAB> <METAVALUE> <METAKEY> : <METAVALUE> ...
...
+ updates the parse_data metadata from a shard and write the metadatas to all fetched urls that starts with an url from the metadatadb
+ this patch support's metadata to all matching suburls inheritance
the second patch implements a index-metadata plugin.
+ this plugin extract all metadats from the parse_data of a shard and index it. which metadats you can configure in the plugin.properties.
+ to index for example the lang you have to configure the plugin.properties: lang=STORE,UNTOKENIZED
+ that means that the index plugin exract metadata values with key "lang". if exists, all values are indexed stored and untokenized
Example
create start url's in "/tmp/urls/start/urls.txt"
http://lucene.apache.org/nutch/apidocs-1.0/index.html
http://lucene.apache.org/nutch/apidocs-0.9/index.html
create metadata url's in "/tmp/urls/metadata/urls.txt"
http://lucene.apache.org/nutch/apidocs-1.0/ version: 1.0
http://lucene.apache.org/nutch/apidocs-0.9/ version: 0.9
Inject Urls
bin/nutch inject crawldb /tmp/urls/start/
bin/nutch org.apache.nutch.crawl.metadata.MetadataInjector metadatadb /tmp/urls/metadata/
Fetch & Parse & Update
bin/nutch generate crawldb segments
bin/nutch fetch segments/20090806105717/
bin/nutch org.apache.nutch.crawl.metadata.ParseDataUpdater metadatadb segments/20090806105717
bin/nutch updatedb crawldb/ segments/20090806105717/
Fetch & Parse & Update Again
...
Index
bin/nutch invertlinks linkdb -dir segments/
bin/nutch index index crawldb/ linkdb/ segments/20090806105717 segments/20090806110127
Check your Index
All urls starting with "http://lucene.apache.org/nutch/apidocs-1.0/ " are indexed with "version:1.0".
All urls starting with "http://lucene.apache.org/nutch/apidocs-0.9/ " are indexed with "version:0.9".
This issue is some related to NUTCH-655
was:
Hi.
the following two patches supports
+ inject metadatas to url's into a metadatadb
url.com <TAB> <METAKEY> : <TAB> <METAVALUE> <TAB> <METAVALUE> <METAKEY> : <METAVALUE> ...
...
+ updates the parse_data metadata from a shard and write the metadatas to all fetched urls that starts with an url from the metadatadb
+ this patch support's metadata to all matching suburls inheritance
the second patch implements a index-metadata plugin.
+ this plugin extract all metadats from the parse_data of a shard and index it. which metadats you can configure in the plugin.properties.
+ to index for example the lang you have to configure the plugin.properties: lang=STORE,UNTOKENIZED
+ that means that the index plugin exract metadata values with key "lang". if exists, all values are indexed stored and untokenized
Example
create start url's in "/tmp/urls/start/urls.txt"
http://lucene.apache.org/nutch/apidocs-1.0/index.html
http://lucene.apache.org/nutch/apidocs-0.9/index.html
create metadata url's in "/tmp/urls/metadata/urls.txt"
http://lucene.apache.org/nutch/apidocs-1.0/ version: 1.0
http://lucene.apache.org/nutch/apidocs-0.9/ version: 0.9
Inject Urls
bin/nutch inject crawldb /tmp/urls/start/
bin/nutch org.apache.nutch.crawl.metadata.MetadataInjector metadatadb /tmp/urls/metadata/
Fetch & Parse & Update
bin/nutch generate crawldb segments
bin/nutch fetch segments/20090806105717/
bin/nutch org.apache.nutch.crawl.metadata.ParseDataUpdater metadatadb segments/20090806105717
bin/nutch updatedb crawldb/ segments/20090806105717/
Fetch & Parse & Update Again
...
Index
bin/nutch invertlinks linkdb -dir segments/
bin/nutch index index crawldb/ linkdb/ segments/20090806105717 segments/20090806110127
Check your Index
All urls starting with "http://lucene.apache.org/nutch/apidocs-1.0/ " are indexed with "version:1.0".
All urls starting with "http://lucene.apache.org/nutch/apidocs-0.9/ " are indexed with "version:0.9".
This issue is some related to http://issues.apache.org/jira/browse/NUTCH-655
> inject&Index metadatas and inherit these metadatas to all matching suburls
> --------------------------------------------------------------------------
>
> Key: NUTCH-747
> URL: https://issues.apache.org/jira/browse/NUTCH-747
> Project: Nutch
> Issue Type: Improvement
> Components: indexer, injector
> Reporter: Marko Bauhardt
> Attachments: index-metadata.patch, metadata.patch
>
>
> Hi.
> the following two patches supports
> + inject metadatas to url's into a metadatadb
> url.com <TAB> <METAKEY> : <TAB> <METAVALUE> <TAB> <METAVALUE> <METAKEY> : <METAVALUE> ...
> ...
> + updates the parse_data metadata from a shard and write the metadatas to all fetched urls that starts with an url from the metadatadb
> + this patch support's metadata to all matching suburls inheritance
> the second patch implements a index-metadata plugin.
> + this plugin extract all metadats from the parse_data of a shard and index it. which metadats you can configure in the plugin.properties.
> + to index for example the lang you have to configure the plugin.properties: lang=STORE,UNTOKENIZED
> + that means that the index plugin exract metadata values with key "lang". if exists, all values are indexed stored and untokenized
> Example
> create start url's in "/tmp/urls/start/urls.txt"
> http://lucene.apache.org/nutch/apidocs-1.0/index.html
> http://lucene.apache.org/nutch/apidocs-0.9/index.html
> create metadata url's in "/tmp/urls/metadata/urls.txt"
> http://lucene.apache.org/nutch/apidocs-1.0/ version: 1.0
> http://lucene.apache.org/nutch/apidocs-0.9/ version: 0.9
> Inject Urls
> bin/nutch inject crawldb /tmp/urls/start/
> bin/nutch org.apache.nutch.crawl.metadata.MetadataInjector metadatadb /tmp/urls/metadata/
> Fetch & Parse & Update
> bin/nutch generate crawldb segments
> bin/nutch fetch segments/20090806105717/
> bin/nutch org.apache.nutch.crawl.metadata.ParseDataUpdater metadatadb segments/20090806105717
> bin/nutch updatedb crawldb/ segments/20090806105717/
> Fetch & Parse & Update Again
> ...
> Index
> bin/nutch invertlinks linkdb -dir segments/
> bin/nutch index index crawldb/ linkdb/ segments/20090806105717 segments/20090806110127
> Check your Index
> All urls starting with "http://lucene.apache.org/nutch/apidocs-1.0/ " are indexed with "version:1.0".
> All urls starting with "http://lucene.apache.org/nutch/apidocs-0.9/ " are indexed with "version:0.9".
> This issue is some related to NUTCH-655
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira