You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by abhishek tiwari <ab...@gmail.com> on 2012/05/23 11:30:04 UTC

Need help

Hi, i am new for nutch.



i want to use urlmeta plugin  bt not able to fetch meta tags .


1) Added folllowing in nutch-site.xml

  <property>
  <name>plugin.includes</name>
 <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)|urlmeta</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>
<property>
  <name>urlmeta.tags</name>
  <value></value>
  <description>

  </description>
</property>


2) Added  <field name="keywords" type="string" stored="true"
indexed="true"/>  in solr schema.xml

3) run  bin/nutch crawl urls -solr http://localhost:8080/solr -depth 3 -topN 5

url and other stuffs also done

but keyword field is not getting populated .

please suggest what i am missing.

Re: Need help

Posted by abhishek tiwari <ab...@gmail.com>.

not able to see
metadata plugin while registering

012-05-29 18:45:07,701 INFO  plugin.PluginRepository - Plugins: looking in:
/var/www/html/nutch/runtime/local/plugins
2012-05-29 18:45:07,759 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2012-05-29 18:45:07,759 INFO  plugin.PluginRepository - Registered Plugins:
2012-05-29 18:45:07,759 INFO  plugin.PluginRepository -         the nutch
core extension points (nutch-extensionpoints)
2012-05-29 18:45:07,759 INFO  plugin.PluginRepository -         Basic URL
Normalizer (urlnormalizer-basic)
2012-05-29 18:45:07,759 INFO  plugin.PluginRepository -         Html Parse
Plug-in (parse-html)
2012-05-29 18:45:07,759 INFO  plugin.PluginRepository -         Basic
Indexing Filter (index-basic)
2012-05-29 18:45:07,759 INFO  plugin.PluginRepository -         HTTP
Framework (lib-http)
2012-05-29 18:45:07,759 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2012-05-29 18:45:07,759 INFO  plugin.PluginRepository -         Regex URL
Filter (urlfilter-regex)
2012-05-29 18:45:07,759 INFO  plugin.PluginRepository -         Http
Protocol Plug-in (protocol-http)
2012-05-29 18:45:07,760 INFO  plugin.PluginRepository -         Regex URL
Normalizer (urlnormalizer-regex)
2012-05-29 18:45:07,760 INFO  plugin.PluginRepository -         Tika Parser
Plug-in (parse-tika)
2012-05-29 18:45:07,760 INFO  plugin.PluginRepository -         OPIC
Scoring Plug-in (scoring-opic)
2012-05-29 18:45:07,760 INFO  plugin.PluginRepository -         CyberNeko
HTML Parser (lib-nekohtml)
2012-05-29 18:45:07,760 INFO  plugin.PluginRepository -         Anchor
Indexing Filter (index-anchor)
2012-05-29 18:45:07,760 INFO  plugin.PluginRepository -         Regex URL
Filter Framework (lib-regex-filter)
2012-05-29 18:45:07,760 INFO  plugin.PluginRepository - Registered
Extension-Points:
2012-05-29 18:45:07,760 INFO  plugin.PluginRepository -         Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2012-05-29 18:45:07,760 INFO  plugin.PluginRepository -         Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2012-05-29 18:45:07,760 INFO  plugin.PluginRepository -         Nutch
Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2012-05-29 18:45:07,760 INFO  plugin.PluginRepository -         Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2012-05-29 18:45:07,760 INFO  plugin.PluginRepository -         Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2012-05-29 18:45:07,760 INFO  plugin.PluginRepository -         HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2012-05-29 18:45:07,760 INFO  plugin.PluginRepository -         Nutch
Content Parser (org.apache.nutch.parse.Parser)
2012-05-29 18:45:07,760 INFO  plugin.PluginRepository -         Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)


On Tue, May 29, 2012 at 6:37 PM, abhishek tiwari <abhishek.tiwaree@gmail.com
> wrote:

> Thanks for replying ..I am not able to fetch keyword
> my nutch-site.xml is
>
> <configuration>
> <property>
>  <name>http.agent.name</name>
>  <value>My Nutch Spider</value>
> </property>
> <property>
> <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> </property>
> <property>
> <name>metatags.names</name>
> <value>description;keywords</value>
> <description> Names of the metatags to extract, separated by;.
>   Use '*' to extract all metatags. Prefixes the names with 'metatag.'
>   in the parse-metadata. For instance to index description and keywords,
>   you need to activate the plugin index-metadata and set the value of the
>   parameter 'index.parse.md' to 'metatag.description;metatag.keywords'.
> </description>
> </property>
> <property>
>   <name>index.parse.md</name>
>   <value>metatag.description,metatag.keywords</value>
>   <description>
>   Comma-separated list of keys to be taken from the parse metadata to
> generate fields.
>   Can be used e.g. for 'description' or 'keywords' provided that these
> values are generated
>   by a parser (see parse-metatags plugin)
>   </description>
> </property>
>
> </configuration>
>
> and solr schema has following field
>
> ~                 <fields><field name="id" type="string" stored="true"
> indexed="true"/><!-- core fields --><field name="segment" type="string"
> stored="true" indexed="false"/><field name="digest" type="string"
> stored="true" indexed="false"/><field name="boost" type="float"
> stored="true" indexed="false"/><!-- fields for index-basic plugin --><field
> name="host" type="url" stored="false" indexed="true"/><field name="site"
> type="string" stored="false" indexed="true"/><field name="url" type="url"
> stored="true" indexed="true" required="true"/><field name="content"
> type="text" stored="true" indexed="true"/><field name="title" type="text"
> stored="true" indexed="true"/><field name="cache" type="string"
> stored="true" indexed="false"/><field name="tstamp" type="date"
> stored="true" indexed="false"/><!-- fields for index-anchor plugin
> --><field name="anchor" type="string" stored="true" indexed="true"
> multiValued="true"/><!-- fields for index-more plugin --><field name="type"
> type="string" stored="true" indexed="true" multiValued="true"/><field
> name="contentLength" type="long" stored="true" indexed="false"/><field
> name="lastModified" type="date" stored="true" indexed="false"/><field
> name="date" type="date" stored="true" indexed="true"/><!-- fields for
> languageidentifier plugin --><field name="lang" type="string" stored="true"
> indexed="true"/><!-- fields for subcollection plugin --><field
> name="subcollection" type="string" stored="true" indexed="true"
> multiValued="true"/><!-- fields for feed plugin (tag is also used by
> microformats-reltag)--><field name="author" type="string" stored="true"
> indexed="true"/><field name="tag" type="string" stored="true"
> indexed="true" multiValued="true"/><field name="feed" type="string"
> stored="true" indexed="true"/><field name="publishedDate" type="date"
> stored="true" indexed="true"/><field name="updatedDate" type="date"
> stored="true" indexed="true"/><!-- fields for creativecommons plugin
> --><field name="cc" type="string" stored="true" indexed="true"
> multiValued="true"/><!-- fields for the metatags plugin --><field
> name="metatag.description" type="text" stored="true" indexed="true"/><field
> name="metatag.keywords" type="text" stored="true" indexed="true"/></fields>
>
>
> i am not able to get the problem .
>
> Ihave created the own plugin bt it is not populated . when we crawl.
>
> please help me to find the reason.
>
>
> On Wed, May 23, 2012 at 3:47 PM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
>> the urlmeta plugin is not what you are after. see instructions on
>> http://wiki.apache.org/nutch/IndexMetatags
>>
>> On 23 May 2012 10:30, abhishek tiwari <ab...@gmail.com> wrote:
>>
>> > Hi, i am new for nutch.
>> >
>> >
>> >
>> > i want to use urlmeta plugin  bt not able to fetch meta tags .
>> >
>> >
>> > 1) Added folllowing in nutch-site.xml
>> >
>> >  <property>
>> >  <name>plugin.includes</name>
>> >
>> >
>>  <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)|urlmeta</value>
>> >  <description>Regular expression naming plugin directory names to
>> >  include.  Any plugin not matching this expression is excluded.
>> >  In any case you need at least include the nutch-extensionpoints
>> plugin. By
>> >  default Nutch includes crawling just HTML and plain text via HTTP,
>> >  and basic indexing and search plugins.
>> >  </description>
>> > </property>
>> > <property>
>> >  <name>urlmeta.tags</name>
>> >  <value></value>
>> >  <description>
>> >
>> >  </description>
>> > </property>
>> >
>> >
>> > 2) Added  <field name="keywords" type="string" stored="true"
>> > indexed="true"/>  in solr schema.xml
>> >
>> > 3) run  bin/nutch crawl urls -solr http://localhost:8080/solr -depth 3
>> > -topN 5
>> >
>> > url and other stuffs also done
>> >
>> > but keyword field is not getting populated .
>> >
>> > please suggest what i am missing.
>> >
>>
>>
>>
>> --
>> *
>> *Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>>
>
>

Re: Need help

Posted by abhishek tiwari <ab...@gmail.com>.

Thanks for replying ..I am not able to fetch keyword
my nutch-site.xml is

<configuration>
<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
<property>
<name>metatags.names</name>
<value>description;keywords</value>
<description> Names of the metatags to extract, separated by;.
  Use '*' to extract all metatags. Prefixes the names with 'metatag.'
  in the parse-metadata. For instance to index description and keywords,
  you need to activate the plugin index-metadata and set the value of the
  parameter 'index.parse.md' to 'metatag.description;metatag.keywords'.
</description>
</property>
<property>
  <name>index.parse.md</name>
  <value>metatag.description,metatag.keywords</value>
  <description>
  Comma-separated list of keys to be taken from the parse metadata to
generate fields.
  Can be used e.g. for 'description' or 'keywords' provided that these
values are generated
  by a parser (see parse-metatags plugin)
  </description>
</property>

</configuration>

and solr schema has following field

~                 <fields><field name="id" type="string" stored="true"
indexed="true"/><!-- core fields --><field name="segment" type="string"
stored="true" indexed="false"/><field name="digest" type="string"
stored="true" indexed="false"/><field name="boost" type="float"
stored="true" indexed="false"/><!-- fields for index-basic plugin --><field
name="host" type="url" stored="false" indexed="true"/><field name="site"
type="string" stored="false" indexed="true"/><field name="url" type="url"
stored="true" indexed="true" required="true"/><field name="content"
type="text" stored="true" indexed="true"/><field name="title" type="text"
stored="true" indexed="true"/><field name="cache" type="string"
stored="true" indexed="false"/><field name="tstamp" type="date"
stored="true" indexed="false"/><!-- fields for index-anchor plugin
--><field name="anchor" type="string" stored="true" indexed="true"
multiValued="true"/><!-- fields for index-more plugin --><field name="type"
type="string" stored="true" indexed="true" multiValued="true"/><field
name="contentLength" type="long" stored="true" indexed="false"/><field
name="lastModified" type="date" stored="true" indexed="false"/><field
name="date" type="date" stored="true" indexed="true"/><!-- fields for
languageidentifier plugin --><field name="lang" type="string" stored="true"
indexed="true"/><!-- fields for subcollection plugin --><field
name="subcollection" type="string" stored="true" indexed="true"
multiValued="true"/><!-- fields for feed plugin (tag is also used by
microformats-reltag)--><field name="author" type="string" stored="true"
indexed="true"/><field name="tag" type="string" stored="true"
indexed="true" multiValued="true"/><field name="feed" type="string"
stored="true" indexed="true"/><field name="publishedDate" type="date"
stored="true" indexed="true"/><field name="updatedDate" type="date"
stored="true" indexed="true"/><!-- fields for creativecommons plugin
--><field name="cc" type="string" stored="true" indexed="true"
multiValued="true"/><!-- fields for the metatags plugin --><field
name="metatag.description" type="text" stored="true" indexed="true"/><field
name="metatag.keywords" type="text" stored="true" indexed="true"/></fields>


i am not able to get the problem .

Ihave created the own plugin bt it is not populated . when we crawl.

please help me to find the reason.

On Wed, May 23, 2012 at 3:47 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> the urlmeta plugin is not what you are after. see instructions on
> http://wiki.apache.org/nutch/IndexMetatags
>
> On 23 May 2012 10:30, abhishek tiwari <ab...@gmail.com> wrote:
>
> > Hi, i am new for nutch.
> >
> >
> >
> > i want to use urlmeta plugin  bt not able to fetch meta tags .
> >
> >
> > 1) Added folllowing in nutch-site.xml
> >
> >  <property>
> >  <name>plugin.includes</name>
> >
> >
>  <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)|urlmeta</value>
> >  <description>Regular expression naming plugin directory names to
> >  include.  Any plugin not matching this expression is excluded.
> >  In any case you need at least include the nutch-extensionpoints plugin.
> By
> >  default Nutch includes crawling just HTML and plain text via HTTP,
> >  and basic indexing and search plugins.
> >  </description>
> > </property>
> > <property>
> >  <name>urlmeta.tags</name>
> >  <value></value>
> >  <description>
> >
> >  </description>
> > </property>
> >
> >
> > 2) Added  <field name="keywords" type="string" stored="true"
> > indexed="true"/>  in solr schema.xml
> >
> > 3) run  bin/nutch crawl urls -solr http://localhost:8080/solr -depth 3
> > -topN 5
> >
> > url and other stuffs also done
> >
> > but keyword field is not getting populated .
> >
> > please suggest what i am missing.
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: Need help

Posted by Julien Nioche <li...@gmail.com>.

the urlmeta plugin is not what you are after. see instructions on
http://wiki.apache.org/nutch/IndexMetatags

On 23 May 2012 10:30, abhishek tiwari <ab...@gmail.com> wrote:

> Hi, i am new for nutch.
>
>
>
> i want to use urlmeta plugin  bt not able to fetch meta tags .
>
>
> 1) Added folllowing in nutch-site.xml
>
>  <property>
>  <name>plugin.includes</name>
>
>  <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)|urlmeta</value>
>  <description>Regular expression naming plugin directory names to
>  include.  Any plugin not matching this expression is excluded.
>  In any case you need at least include the nutch-extensionpoints plugin. By
>  default Nutch includes crawling just HTML and plain text via HTTP,
>  and basic indexing and search plugins.
>  </description>
> </property>
> <property>
>  <name>urlmeta.tags</name>
>  <value></value>
>  <description>
>
>  </description>
> </property>
>
>
> 2) Added  <field name="keywords" type="string" stored="true"
> indexed="true"/>  in solr schema.xml
>
> 3) run  bin/nutch crawl urls -solr http://localhost:8080/solr -depth 3
> -topN 5
>
> url and other stuffs also done
>
> but keyword field is not getting populated .
>
> please suggest what i am missing.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble