You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2012/05/01 12:15:24 UTC

Re: Indexing meta tags in Nutch 1.4

We use ant for building Nutch source.

Once you've ensured that the configuration has been added to the
/src/plugin/build.xml, also you've added the plugin name to the
plugin.includes property in conf/nutch-site.xml, as well as any
particular configuration this plugin requires, you can go to your
Nutch top level directory and do 'ant runtime'

you wil;l then be ready to rock and roll

hth

Lewis

On Mon, Apr 30, 2012 at 2:20 PM, ML mail <ml...@yahoo.com> wrote:
> Hi Julien,
>
> Thanks for the hint, I have downloaded the patch and applied it to my Nutch 1.4 installation. Now I see that it created the source plugin files in src/plugin/parse-metatags and I was wondering how do I compile this source to a usable plugin by nutch? Sorry I don't have much clue about java...
>
> Regards
>
>
> ________________________________
>  From: Julien Nioche <li...@gmail.com>
> To: user@nutch.apache.org; ML mail <ml...@yahoo.com>
> Sent: Monday, April 30, 2012 2:48 PM
> Subject: Re: Indexing meta tags in Nutch 1.4
>
>
> http://wiki.apache.org/nutch/IndexMetatagsrefers to the code available in the trunk which is different from the zip you downloaded. You can use the patch https://issues.apache.org/jira/secure/attachment/12519226/NUTCH-809-trunk.patch corresponding to what I committed if you can't use Nutch trunk
>
> HTH
>
> Julien
>
>
> On 30 April 2012 13:35, ML mail <ml...@yahoo.com> wrote:
>
> Hi,
>>
>>I would like to index the typical description and keywords HTML meta tags using my stable installation of Nutch 1.4. For that, I have followed the instructions from the wiki (http://wiki.apache.org/nutch/IndexMetatags) and downloaded the metatags+plugins_tutorial.zip file from the #NUTCH-809 jira issue. This ZIP file contains the index-metatags plugin but I do believe that one plugin is still missing: parse-metatags. Does anyone know where I can find that plugin? Or am I mistaken and having the index-metatags plugin is enough?
>>
>>Thanks for your feedback.
>>
>>Regards,
>>M.L.
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble



-- 
Lewis

Re: Indexing meta tags in Nutch 1.4

Posted by ML mail <ml...@yahoo.com>.
Now I get it... I will anyway not use the indexchecker as I use Solr do the indexing.


________________________________
 From: Markus Jelsma <ma...@openindex.io>
To: user@nutch.apache.org 
Sent: Thursday, May 3, 2012 10:59 AM
Subject: Re: Indexing meta tags in Nutch 1.4
 
On Thursday 03 May 2012 10:52:25 ML mail wrote:
> Thanks Markus for your tip. I now tried the "parsechecker" and it works
> perfectly, I can see the "Parse Metadata" info which contains the keywrods
> and description. I then suppose the documentation on the
> wiki http://wiki.apache.org/nutch/IndexMetatags is wrong as it mentions
> using "indexchecker" instead...

The docs are correct. With parsechecker you can see the output of a parse 
filter. With indexchecker you can only see output of a index filter. You need 
both a parse filter and an index filter to complete the chain from web page to 
an indexed document.

> 
> 
> 
> 
> ________________________________
>  From: Markus Jelsma <ma...@openindex.io>
> To: ML mail <ml...@yahoo.com>
> Cc: Lewis John Mcgibbney <le...@gmail.com>; user@nutch.apache.org
> Sent: Thursday, May 3, 2012 9:32 AM
> Subject: Re: Indexing meta tags in Nutch 1.4
> 
> You should see it with the parsechecker tool but not with the indexchecker
> because you don't have an indexing filter plugin included that reads and
> emits what's output but the parse filter. Use the index-metadata plugin.
> 
> On Thu, 3 May 2012 00:25:42 -0700 (PDT), ML mail <ml...@yahoo.com> wrote:
> > Dear Lewis,
> > 
> > Thanks for the README about the parse-metatags plugin. I have now
> > double checked and I have the metatags.names property in my
> > nutch-site.xml config file as well as the other required properties.
> > Still when running "nutch indexchecker URL" I don't see any
> > description or keywords fields :( 
> > 
> > Below I have pasted the relevant parts of my nutch-site.xml config file:
> > 
> > <property>
> >         <name>index.parse.md</name>
> >         <value>metatag.description,metatag.keywords</value>
> > </property>
> > 
> > 
> > <property>
> >         <name>metatags.names</name>
> >         <value>description;keywords</value>
> > </property>
> > 
> > 
> > <property>
> >         <name>plugin.includes</name>
> >        
> > 
> > <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(ba
> > sic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic)</value
> > > </property>
> > 
> > As far as I know this all looks correct but maybe you can see
> > something wrong? or anything else I might check?
> > 
> > Regards
> > 
> > 
> > 
> > ________________________________
> >
> >  From: Lewis John Mcgibbney <le...@gmail.com>
> >
> > To: user@nutch.apache.org; ML mail <ml...@yahoo.com>
> > Sent: Wednesday, May 2, 2012 12:49 PM
> > Subject: Re: Indexing meta tags in Nutch 1.4
> > 
> > Hi,
> > 
> > Please also see the README Julien kindly provided with the
> > parse-metatags plugin.
> > 
> > 
> > https://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-metatags/READM
> > E.txt?view=markup
> > 
> > I'm hoping there should be enough info to get it working flawlessly.
> > Remember, any changes you make to your config files should really be
> > recompiled before moving on to a more serious deployment.
> > 
> > On Tue, May 1, 2012 at 12:38 PM, ML mail <ml...@yahoo.com> wrote:
> >> Hi Lewis,
> >> 
> >> Thanks to your explanations, I managed to get the parse-metatags plugin
> >> built and installed into the runtime/local/plugins directory. So no I
> >> have the index-metatags from the ZIP file as well as the parse-metatags
> >> plugin from the patch installed and wanted to check if they are
> >> working. I followed step-by-step the guide
> >> on http://wiki.apache.org/nutch/IndexMetatags and came to the part
> >> where you check with the "nutch indexchecker URL" command for the
> >> metatag fields. Unfortunately, in the output of that command I don't
> >> see any keywords or description fields :( just the usual ones
> >> (site,title,content,etc).
> >> 
> >> Am I missing something here?
> >> 
> >> Also let me know if you need more details or my nutch-site.xml config
> >> file...
> >> 
> >> Regards
> 
> -- Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536600 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex

Re: Indexing meta tags in Nutch 1.4

Posted by Markus Jelsma <ma...@openindex.io>.
On Thursday 03 May 2012 10:52:25 ML mail wrote:
> Thanks Markus for your tip. I now tried the "parsechecker" and it works
> perfectly, I can see the "Parse Metadata" info which contains the keywrods
> and description. I then suppose the documentation on the
> wiki http://wiki.apache.org/nutch/IndexMetatags is wrong as it mentions
> using "indexchecker" instead...

The docs are correct. With parsechecker you can see the output of a parse 
filter. With indexchecker you can only see output of a index filter. You need 
both a parse filter and an index filter to complete the chain from web page to 
an indexed document.

> 
> 
> 
> 
> ________________________________
>  From: Markus Jelsma <ma...@openindex.io>
> To: ML mail <ml...@yahoo.com>
> Cc: Lewis John Mcgibbney <le...@gmail.com>; user@nutch.apache.org
> Sent: Thursday, May 3, 2012 9:32 AM
> Subject: Re: Indexing meta tags in Nutch 1.4
> 
> You should see it with the parsechecker tool but not with the indexchecker
> because you don't have an indexing filter plugin included that reads and
> emits what's output but the parse filter. Use the index-metadata plugin.
> 
> On Thu, 3 May 2012 00:25:42 -0700 (PDT), ML mail <ml...@yahoo.com> wrote:
> > Dear Lewis,
> > 
> > Thanks for the README about the parse-metatags plugin. I have now
> > double checked and I have the metatags.names property in my
> > nutch-site.xml config file as well as the other required properties.
> > Still when running "nutch indexchecker URL" I don't see any
> > description or keywords fields :( 
> > 
> > Below I have pasted the relevant parts of my nutch-site.xml config file:
> > 
> > <property>
> >         <name>index.parse.md</name>
> >         <value>metatag.description,metatag.keywords</value>
> > </property>
> > 
> > 
> > <property>
> >         <name>metatags.names</name>
> >         <value>description;keywords</value>
> > </property>
> > 
> > 
> > <property>
> >         <name>plugin.includes</name>
> >        
> > 
> > <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(ba
> > sic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic)</value
> > > </property>
> > 
> > As far as I know this all looks correct but maybe you can see
> > something wrong? or anything else I might check?
> > 
> > Regards
> > 
> > 
> > 
> > ________________________________
> >
> >  From: Lewis John Mcgibbney <le...@gmail.com>
> >
> > To: user@nutch.apache.org; ML mail <ml...@yahoo.com>
> > Sent: Wednesday, May 2, 2012 12:49 PM
> > Subject: Re: Indexing meta tags in Nutch 1.4
> > 
> > Hi,
> > 
> > Please also see the README Julien kindly provided with the
> > parse-metatags plugin.
> > 
> > 
> > https://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-metatags/READM
> > E.txt?view=markup
> > 
> > I'm hoping there should be enough info to get it working flawlessly.
> > Remember, any changes you make to your config files should really be
> > recompiled before moving on to a more serious deployment.
> > 
> > On Tue, May 1, 2012 at 12:38 PM, ML mail <ml...@yahoo.com> wrote:
> >> Hi Lewis,
> >> 
> >> Thanks to your explanations, I managed to get the parse-metatags plugin
> >> built and installed into the runtime/local/plugins directory. So no I
> >> have the index-metatags from the ZIP file as well as the parse-metatags
> >> plugin from the patch installed and wanted to check if they are
> >> working. I followed step-by-step the guide
> >> on http://wiki.apache.org/nutch/IndexMetatags and came to the part
> >> where you check with the "nutch indexchecker URL" command for the
> >> metatag fields. Unfortunately, in the output of that command I don't
> >> see any keywords or description fields :( just the usual ones
> >> (site,title,content,etc).
> >> 
> >> Am I missing something here?
> >> 
> >> Also let me know if you need more details or my nutch-site.xml config
> >> file...
> >> 
> >> Regards
> 
> -- Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536600 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex

Re: Indexing meta tags in Nutch 1.4

Posted by ML mail <ml...@yahoo.com>.
Thanks Markus for your tip. I now tried the "parsechecker" and it works perfectly, I can see the "Parse Metadata" info which contains the keywrods and description. I then suppose the documentation on the wiki http://wiki.apache.org/nutch/IndexMetatags is wrong as it mentions using "indexchecker" instead...




________________________________
 From: Markus Jelsma <ma...@openindex.io>
To: ML mail <ml...@yahoo.com> 
Cc: Lewis John Mcgibbney <le...@gmail.com>; user@nutch.apache.org 
Sent: Thursday, May 3, 2012 9:32 AM
Subject: Re: Indexing meta tags in Nutch 1.4
 
You should see it with the parsechecker tool but not with the indexchecker because you don't have an indexing filter plugin included that reads and emits what's output but the parse filter. Use the index-metadata plugin.

On Thu, 3 May 2012 00:25:42 -0700 (PDT), ML mail <ml...@yahoo.com> wrote:
> Dear Lewis,
> 
> Thanks for the README about the parse-metatags plugin. I have now
> double checked and I have the metatags.names property in my
> nutch-site.xml config file as well as the other required properties.
> Still when running "nutch indexchecker URL" I don't see any
> description or keywords fields :( 
> 
> Below I have pasted the relevant parts of my nutch-site.xml config file:
> 
> <property>
>         <name>index.parse.md</name>
>         <value>metatag.description,metatag.keywords</value>
> </property>
> 
> 
> <property>
>         <name>metatags.names</name>
>         <value>description;keywords</value>
> </property>
> 
> 
> <property>
>         <name>plugin.includes</name>
>        
> 
> <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> </property>
> 
> As far as I know this all looks correct but maybe you can see
> something wrong? or anything else I might check?
> 
> Regards
> 
> 
> 
> ________________________________
>  From: Lewis John Mcgibbney <le...@gmail.com>
> To: user@nutch.apache.org; ML mail <ml...@yahoo.com>
> Sent: Wednesday, May 2, 2012 12:49 PM
> Subject: Re: Indexing meta tags in Nutch 1.4
> 
> Hi,
> 
> Please also see the README Julien kindly provided with the
> parse-metatags plugin.
> 
> 
> https://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-metatags/README.txt?view=markup
> 
> I'm hoping there should be enough info to get it working flawlessly.
> Remember, any changes you make to your config files should really be
> recompiled before moving on to a more serious deployment.
> 
> On Tue, May 1, 2012 at 12:38 PM, ML mail <ml...@yahoo.com> wrote:
>> Hi Lewis,
>> 
>> Thanks to your explanations, I managed to get the parse-metatags plugin built and installed into the runtime/local/plugins directory. So no I have the index-metatags from the ZIP file as well as the parse-metatags plugin from the patch installed and wanted to check if they are working. I followed step-by-step the guide on http://wiki.apache.org/nutch/IndexMetatags and came to the part where you check with the "nutch indexchecker URL" command for the metatag fields. Unfortunately, in the output of that command I don't see any keywords or description fields :( just the usual ones (site,title,content,etc).
>> 
>> Am I missing something here?
>> 
>> Also let me know if you need more details or my nutch-site.xml config file...
>> 
>> Regards

-- Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350

Re: Indexing meta tags in Nutch 1.4

Posted by Markus Jelsma <ma...@openindex.io>.
 You should see it with the parsechecker tool but not with the 
 indexchecker because you don't have an indexing filter plugin included 
 that reads and emits what's output but the parse filter. Use the 
 index-metadata plugin.

 On Thu, 3 May 2012 00:25:42 -0700 (PDT), ML mail <ml...@yahoo.com> 
 wrote:
> Dear Lewis,
>
> Thanks for the README about the parse-metatags plugin. I have now
> double checked and I have the metatags.names property in my
> nutch-site.xml config file as well as the other required properties.
> Still when running "nutch indexchecker URL" I don't see any
> description or keywords fields :( 
>
> Below I have pasted the relevant parts of my nutch-site.xml config 
> file:
>
> <property>
>         <name>index.parse.md</name>
>         <value>metatag.description,metatag.keywords</value>
> </property>
>
>
> <property>
>         <name>metatags.names</name>
>         <value>description;keywords</value>
> </property>
>
>
> <property>
>         <name>plugin.includes</name>
>        
> 
> <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> </property>
>
> As far as I know this all looks correct but maybe you can see
> something wrong? or anything else I might check?
>
> Regards
>
>
>
> ________________________________
>  From: Lewis John Mcgibbney <le...@gmail.com>
> To: user@nutch.apache.org; ML mail <ml...@yahoo.com>
> Sent: Wednesday, May 2, 2012 12:49 PM
> Subject: Re: Indexing meta tags in Nutch 1.4
>
> Hi,
>
> Please also see the README Julien kindly provided with the
> parse-metatags plugin.
>
> 
> https://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-metatags/README.txt?view=markup
>
> I'm hoping there should be enough info to get it working flawlessly.
> Remember, any changes you make to your config files should really be
> recompiled before moving on to a more serious deployment.
>
> On Tue, May 1, 2012 at 12:38 PM, ML mail <ml...@yahoo.com> wrote:
>> Hi Lewis,
>>
>> Thanks to your explanations, I managed to get the parse-metatags 
>> plugin built and installed into the runtime/local/plugins directory. 
>> So no I have the index-metatags from the ZIP file as well as the 
>> parse-metatags plugin from the patch installed and wanted to check if 
>> they are working. I followed step-by-step the guide 
>> on http://wiki.apache.org/nutch/IndexMetatags and came to the part 
>> where you check with the "nutch indexchecker URL" command for the 
>> metatag fields. Unfortunately, in the output of that command I don't 
>> see any keywords or description fields :( just the usual ones 
>> (site,title,content,etc).
>>
>> Am I missing something here?
>>
>> Also let me know if you need more details or my nutch-site.xml 
>> config file...
>>
>> Regards

-- 
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536600 / 06-50258350

Re: Indexing meta tags in Nutch 1.4

Posted by ML mail <ml...@yahoo.com>.
Dear Lewis,

Thanks for the README about the parse-metatags plugin. I have now double checked and I have the metatags.names property in my nutch-site.xml config file as well as the other required properties. Still when running "nutch indexchecker URL" I don't see any description or keywords fields :( 

Below I have pasted the relevant parts of my nutch-site.xml config file:

<property>
        <name>index.parse.md</name>
        <value>metatag.description,metatag.keywords</value>
</property>


<property>
        <name>metatags.names</name>
        <value>description;keywords</value>
</property>


<property>
        <name>plugin.includes</name>
        <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>

As far as I know this all looks correct but maybe you can see something wrong? or anything else I might check?

Regards



________________________________
 From: Lewis John Mcgibbney <le...@gmail.com>
To: user@nutch.apache.org; ML mail <ml...@yahoo.com> 
Sent: Wednesday, May 2, 2012 12:49 PM
Subject: Re: Indexing meta tags in Nutch 1.4
 
Hi,

Please also see the README Julien kindly provided with the
parse-metatags plugin.

https://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-metatags/README.txt?view=markup

I'm hoping there should be enough info to get it working flawlessly.
Remember, any changes you make to your config files should really be
recompiled before moving on to a more serious deployment.

On Tue, May 1, 2012 at 12:38 PM, ML mail <ml...@yahoo.com> wrote:
> Hi Lewis,
>
> Thanks to your explanations, I managed to get the parse-metatags plugin built and installed into the runtime/local/plugins directory. So no I have the index-metatags from the ZIP file as well as the parse-metatags plugin from the patch installed and wanted to check if they are working. I followed step-by-step the guide on http://wiki.apache.org/nutch/IndexMetatags and came to the part where you check with the "nutch indexchecker URL" command for the metatag fields. Unfortunately, in the output of that command I don't see any keywords or description fields :( just the usual ones (site,title,content,etc).
>
> Am I missing something here?
>
> Also let me know if you need more details or my nutch-site.xml config file...
>
> Regards

Re: Indexing meta tags in Nutch 1.4

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi,

Please also see the README Julien kindly provided with the
parse-metatags plugin.

https://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-metatags/README.txt?view=markup

I'm hoping there should be enough info to get it working flawlessly.
Remember, any changes you make to your config files should really be
recompiled before moving on to a more serious deployment.

On Tue, May 1, 2012 at 12:38 PM, ML mail <ml...@yahoo.com> wrote:
> Hi Lewis,
>
> Thanks to your explanations, I managed to get the parse-metatags plugin built and installed into the runtime/local/plugins directory. So no I have the index-metatags from the ZIP file as well as the parse-metatags plugin from the patch installed and wanted to check if they are working. I followed step-by-step the guide on http://wiki.apache.org/nutch/IndexMetatags and came to the part where you check with the "nutch indexchecker URL" command for the metatag fields. Unfortunately, in the output of that command I don't see any keywords or description fields :( just the usual ones (site,title,content,etc).
>
> Am I missing something here?
>
> Also let me know if you need more details or my nutch-site.xml config file...
>
> Regards

Re: Indexing meta tags in Nutch 1.4

Posted by ML mail <ml...@yahoo.com>.
Hi Lewis,

Thanks to your explanations, I managed to get the parse-metatags plugin built and installed into the runtime/local/plugins directory. So no I have the index-metatags from the ZIP file as well as the parse-metatags plugin from the patch installed and wanted to check if they are working. I followed step-by-step the guide on http://wiki.apache.org/nutch/IndexMetatags and came to the part where you check with the "nutch indexchecker URL" command for the metatag fields. Unfortunately, in the output of that command I don't see any keywords or description fields :( just the usual ones (site,title,content,etc). 

Am I missing something here?

Also let me know if you need more details or my nutch-site.xml config file...

Regards



________________________________
 From: Lewis John Mcgibbney <le...@gmail.com>
To: user@nutch.apache.org; ML mail <ml...@yahoo.com> 
Cc: Julien Nioche <li...@gmail.com> 
Sent: Tuesday, May 1, 2012 12:15 PM
Subject: Re: Indexing meta tags in Nutch 1.4
 
We use ant for building Nutch source.

Once you've ensured that the configuration has been added to the
/src/plugin/build.xml, also you've added the plugin name to the
plugin.includes property in conf/nutch-site.xml, as well as any
particular configuration this plugin requires, you can go to your
Nutch top level directory and do 'ant runtime'

you wil;l then be ready to rock and roll

hth

Lewis

On Mon, Apr 30, 2012 at 2:20 PM, ML mail <ml...@yahoo.com> wrote:
> Hi Julien,
>
> Thanks for the hint, I have downloaded the patch and applied it to my Nutch 1.4 installation. Now I see that it created the source plugin files in src/plugin/parse-metatags and I was wondering how do I compile this source to a usable plugin by nutch? Sorry I don't have much clue about java...
>
> Regards
>
>
> ________________________________
>  From: Julien Nioche <li...@gmail.com>
> To: user@nutch.apache.org; ML mail <ml...@yahoo.com>
> Sent: Monday, April 30, 2012 2:48 PM
> Subject: Re: Indexing meta tags in Nutch 1.4
>
>
> http://wiki.apache.org/nutch/IndexMetatagsrefers to the code available in the trunk which is different from the zip you downloaded. You can use the patch https://issues.apache.org/jira/secure/attachment/12519226/NUTCH-809-trunk.patch corresponding to what I committed if you can't use Nutch trunk
>
> HTH
>
> Julien
>
>
> On 30 April 2012 13:35, ML mail <ml...@yahoo.com> wrote:
>
> Hi,
>>
>>I would like to index the typical description and keywords HTML meta tags using my stable installation of Nutch 1.4. For that, I have followed the instructions from the wiki (http://wiki.apache.org/nutch/IndexMetatags) and downloaded the metatags+plugins_tutorial.zip file from the #NUTCH-809 jira issue. This ZIP file contains the index-metatags plugin but I do believe that one plugin is still missing: parse-metatags. Does anyone know where I can find that plugin? Or am I mistaken and having the index-metatags plugin is enough?
>>
>>Thanks for your feedback.
>>
>>Regards,
>>M.L.
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble



-- 
Lewis