You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Wilson, Matt" <Ma...@salliemae.com> on 2011/09/26 20:07:22 UTC

Indexing specific metadata tags with urlmeta

I am attempting to crawl a corporate intranet site and allow it to be searched in solr.  As part of the requirements I have to be able to index certain metadata tags as their own field in solr (for faceted search).  For example, the pages being crawled contain the following meta tag:

<meta id="ctl00_hdrKeywords" name="Keywords" content="Banking, Savings, Student Loans, CDs, Certificates of Deposit, Smart Option Loan, 529 Plans" />

I have updated the nutch-site.xml with the following:

<property>
    <name>plugin.includes</name>
    <value>urlmeta|protocol-httpclient|... </value>
</property>
<property>
    <name>urlmeta.tags</name>
    <value>keywords</value>
</property>

I have updated the solr schema.xml with the following addition:

<field name="keywords" type="string" stored="true" indexed="true" multiValued="true"/>

I can see that the field has been created in Solr via the admin interface.  I also see that nutch is loading the urlmeta plugin and adding the indexfilters etc in the hadroop.log.  The problem is that nutch does not appear to be indexing the keywords field.  All of the pages crawled have the tag present and I am receiving no errors in the nutch log.  I am unsure as to what I am missing.  This seems to be pretty straightforward; however, I must be misunderstanding either the urlmeta plugin or missing something in the configuration.

RE: Indexing specific metadata tags with urlmeta

Posted by "Wilson, Matt" <Ma...@salliemae.com>.

Lewis, 

Thanks you for your reply.  I changed the capitalization in the solr/conf/schema.xml file to match that of the field in the crawled html and the other entries in nutch-site.xml.  I had already added the urlmeta.tags property.  Unfortunately I get the same results.  After a successful crawl I execute a query in solr requesting the Keywords field be returned and it appears to have no value.  Any ideas on how I can debug where the issue is? 

Thanks, 

Matt Wilson

-----Original Message-----
From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com] 
Sent: Monday, September 26, 2011 3:04 PM
To: user@nutch.apache.org
Subject: Re: Indexing specific metadata tags with urlmeta

Hi Matt,

Try changing

<field name="keywords" type="string" stored="true" indexed="true"
multiValued="true"/>

to

<field name="Keywords" type="string" stored="true" indexed="true"
multiValued="true"/> as per your metadata tags.

We also have a configuration option in nutch-site.xml which you could check
out.

<property>
  <name>urlmeta.tags</name>
  <value></value>
  <description>
    To be used in conjunction with features introduced in NUTCH-655, which
allows
    for custom metatags to be injected alongside your crawl URLs. Specifying
those
    custom tags here will allow for their propagation into a pages outlinks,
as
    well as allow for them to be included as part of an index.
    Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad the tags
with
    white-space at their boundaries, if you are using anything earlier than
Hadoop-0.21.
  </description>
</property>

On Mon, Sep 26, 2011 at 7:07 PM, Wilson, Matt
<Ma...@salliemae.com>wrote:

> I am attempting to crawl a corporate intranet site and allow it to be
> searched in solr.  As part of the requirements I have to be able to index
> certain metadata tags as their own field in solr (for faceted search).  For
> example, the pages being crawled contain the following meta tag:
>
> <meta id="ctl00_hdrKeywords" name="Keywords" content="Banking, Savings,
> Student Loans, CDs, Certificates of Deposit, Smart Option Loan, 529 Plans"
> />
>
> I have updated the nutch-site.xml with the following:
>
> <property>
>    <name>plugin.includes</name>
>    <value>urlmeta|protocol-httpclient|... </value>
> </property>
> <property>
>    <name>urlmeta.tags</name>
>    <value>keywords</value>
> </property>
>
> I have updated the solr schema.xml with the following addition:
>
> <field name="keywords" type="string" stored="true" indexed="true"
> multiValued="true"/>
>
> I can see that the field has been created in Solr via the admin interface.
>  I also see that nutch is loading the urlmeta plugin and adding the
> indexfilters etc in the hadroop.log.  The problem is that nutch does not
> appear to be indexing the keywords field.  All of the pages crawled have the
> tag present and I am receiving no errors in the nutch log.  I am unsure as
> to what I am missing.  This seems to be pretty straightforward; however, I
> must be misunderstanding either the urlmeta plugin or missing something in
> the configuration.
>



-- 
*Lewis*


This E-Mail has been scanned for viruses.

un-suscribe

Posted by Marlen <zm...@facinf.uho.edu.cu>.

Re: Indexing specific metadata tags with urlmeta

Posted by Vijith <vi...@gmail.com>.

Im indexing it right away when I am crawling ( using -solr ). Iam
using the 'crawl' command. should I use individual commands for
inject, fetch etc..
l clear off the crawl data and solr index before I crawl. Any clue ?

On Mon, Jan 16, 2012 at 1:48 PM, Lewis John Mcgibbney
<le...@gmail.com> wrote:
> If this was done after you indexed your content then you will need to
> reindex all of your content to make this field searchable in your solr
> index.
>
> On Mon, Jan 16, 2012 at 5:31 AM, Vijith <vi...@gmail.com> wrote:
>
>> Hi Lewis,
>>
>> Ya it was when I added a field like -
>> <field dest="keywords" source="keywords"/>
>> in the solr-mapping.xml, the "keywords" field got indexed.
>>
>> And I have updated the schema.xml with this -
>> <field name="keywords" type="string" stored="true" indexed="true"/>
>>
>> I thought any field that got indexed should be searchable. Is that true.
>>
>> I have the following added to the html page -
>> <meta name="keywords" content="plugin" />
>>
>> So I believe giving a query for 'plugin' should give me this page in
>> results. (the page content is nothing related to plugins)
>> Please correct me if I am wrong.
>>
>>
>> On Fri, Jan 13, 2012 at 6:09 PM, Lewis John Mcgibbney
>> <le...@gmail.com> wrote:
>> >
>> > I haven't been working on this, but how does your schema configure these
>> > fields? Have you configured it to store and index the new metadata
>> > field(s)? Also you may wish to set it to some kind of custom setting via
>> > conf/solr-mapping.xml
>> >
>> > Only thoughts so please ignore if out of context.
>> >
>> > Lewis
>> >
>> > On Fri, Jan 13, 2012 at 6:44 AM, Vijith <vi...@gmail.com> wrote:
>> >
>> > > Tried it once again... now "keywords" field is showing up in the index
>> > > (from Luke) but not searchable using solr..
>> > > any thing I should do to make it searchable ??? Im using Nutch 1.4...
>> > >
>> > >
>> > > --
>> > > *Thanks & Regards*
>> > > *
>> > > *
>> > > *Vijith V*
>> > >
>> >
>> >
>> >
>> > --
>> > *Lewis*
>>
>>
>>
>>
>> --
>> Thanks & Regards
>>
>> Vijith V
>>
>
>
>
> --
> *Lewis*



-- 
Thanks & Regards

Vijith V

Re: Indexing specific metadata tags with urlmeta

Posted by Lewis John Mcgibbney <le...@gmail.com>.

If this was done after you indexed your content then you will need to
reindex all of your content to make this field searchable in your solr
index.

On Mon, Jan 16, 2012 at 5:31 AM, Vijith <vi...@gmail.com> wrote:

> Hi Lewis,
>
> Ya it was when I added a field like -
> <field dest="keywords" source="keywords"/>
> in the solr-mapping.xml, the "keywords" field got indexed.
>
> And I have updated the schema.xml with this -
> <field name="keywords" type="string" stored="true" indexed="true"/>
>
> I thought any field that got indexed should be searchable. Is that true.
>
> I have the following added to the html page -
> <meta name="keywords" content="plugin" />
>
> So I believe giving a query for 'plugin' should give me this page in
> results. (the page content is nothing related to plugins)
> Please correct me if I am wrong.
>
>
> On Fri, Jan 13, 2012 at 6:09 PM, Lewis John Mcgibbney
> <le...@gmail.com> wrote:
> >
> > I haven't been working on this, but how does your schema configure these
> > fields? Have you configured it to store and index the new metadata
> > field(s)? Also you may wish to set it to some kind of custom setting via
> > conf/solr-mapping.xml
> >
> > Only thoughts so please ignore if out of context.
> >
> > Lewis
> >
> > On Fri, Jan 13, 2012 at 6:44 AM, Vijith <vi...@gmail.com> wrote:
> >
> > > Tried it once again... now "keywords" field is showing up in the index
> > > (from Luke) but not searchable using solr..
> > > any thing I should do to make it searchable ??? Im using Nutch 1.4...
> > >
> > >
> > > --
> > > *Thanks & Regards*
> > > *
> > > *
> > > *Vijith V*
> > >
> >
> >
> >
> > --
> > *Lewis*
>
>
>
>
> --
> Thanks & Regards
>
> Vijith V
>



-- 
*Lewis*

Re: Indexing specific metadata tags with urlmeta

Posted by Vijith <vi...@gmail.com>.

Hi Lewis,

Ya it was when I added a field like -
<field dest="keywords" source="keywords"/>
in the solr-mapping.xml, the "keywords" field got indexed.

And I have updated the schema.xml with this -
<field name="keywords" type="string" stored="true" indexed="true"/>

I thought any field that got indexed should be searchable. Is that true.

I have the following added to the html page -
<meta name="keywords" content="plugin" />

So I believe giving a query for 'plugin' should give me this page in
results. (the page content is nothing related to plugins)
Please correct me if I am wrong.


On Fri, Jan 13, 2012 at 6:09 PM, Lewis John Mcgibbney
<le...@gmail.com> wrote:
>
> I haven't been working on this, but how does your schema configure these
> fields? Have you configured it to store and index the new metadata
> field(s)? Also you may wish to set it to some kind of custom setting via
> conf/solr-mapping.xml
>
> Only thoughts so please ignore if out of context.
>
> Lewis
>
> On Fri, Jan 13, 2012 at 6:44 AM, Vijith <vi...@gmail.com> wrote:
>
> > Tried it once again... now "keywords" field is showing up in the index
> > (from Luke) but not searchable using solr..
> > any thing I should do to make it searchable ??? Im using Nutch 1.4...
> >
> >
> > --
> > *Thanks & Regards*
> > *
> > *
> > *Vijith V*
> >
>
>
>
> --
> *Lewis*




--
Thanks & Regards

Vijith V

Re: Indexing specific metadata tags with urlmeta

Posted by Lewis John Mcgibbney <le...@gmail.com>.

I haven't been working on this, but how does your schema configure these
fields? Have you configured it to store and index the new metadata
field(s)? Also you may wish to set it to some kind of custom setting via
conf/solr-mapping.xml

Only thoughts so please ignore if out of context.

Lewis

On Fri, Jan 13, 2012 at 6:44 AM, Vijith <vi...@gmail.com> wrote:

> Tried it once again... now "keywords" field is showing up in the index
> (from Luke) but not searchable using solr..
> any thing I should do to make it searchable ??? Im using Nutch 1.4...
>
>
> --
> *Thanks & Regards*
> *
> *
> *Vijith V*
>

-- 
*Lewis*

Re: Indexing specific metadata tags with urlmeta

Posted by Vijith <vi...@gmail.com>.

Tried it once again... now "keywords" field is showing up in the index
(from Luke) but not searchable using solr..
any thing I should do to make it searchable ??? Im using Nutch 1.4...


-- 
*Thanks & Regards*
*
*
*Vijith V*

Re: Indexing specific metadata tags with urlmeta

Posted by Vijith Kumar V <vi...@gmail.com>.

Hi all,

I am facing a few problems in using the urlmeta plugin as described in the
pluginwiki.

What I tried so far :-

- followed the wiki page to use urlmeta plugin with a <meta > tag in one of
my webpages and haven't got indexed as I expected.
was not even showing the metatag in the readseg dumps

- then I tried giving the metatags along with the urls in the seed file (
tab seperated ). Meta tags showed up in the dump but querying
solr gave no results.

- then I applied the NUTCH-809 patch >> built nutch etc ... to see if that
works. Same result as the first case ...not in metadata field (readseg dump
) and not in solr results.

- Checking the index with Luke showed NO field named "keywords" (my metatag
name)

So what could be the issue here.
Also I want to know what is the difference between urlmeta and
index-metatags plugins and their exact uses.
I am bit confused when urlmeta wiki tells about adding a <meta> tag to your
html page and still not indexing it
and there exists another plugin index-metatags for the same.

I am a newbie to this please help..


On Thu, Jan 12, 2012 at 5:20 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Elisabeth please see my comments on issue.
>
> Thanks again
>
> Lewis
>
> On Thu, Jan 12, 2012 at 9:15 AM, Elisabeth Adler
> <el...@gmail.com>wrote:
>
> > Hi Dean,
> > I added my documentation and bundled plugin to jira (
> > https://issues.apache.org/**jira/browse/NUTCH-809<
> https://issues.apache.org/jira/browse/NUTCH-809>),
> > hope this helps.
> >
> >
> > On 11.01.2012 22:44, Dean Del Ponte wrote:
> >
> >> Thank-you for your response.
> >>
> >> My goal is to get Nutch to index meta tags.  It's been quite an
> adventure
> >> so far!
> >>
> >> On Wed, Jan 11, 2012 at 3:30 PM, Lewis John Mcgibbney<
> >> lewis.mcgibbney@gmail.com>  wrote:
> >>
> >>  Hi Dean,
> >>>
> >>> Unfortunately nothing official. If you look you will see that this
> plugin
> >>> (if eventually integrated), will combine with two other issues which
> all
> >>> revolve roughly around the same area.
> >>>
> >>> I have never used this patch or any of the others.
> >>>
> >>> Anyone else?
> >>>
> >>> On Wed, Jan 11, 2012 at 8:54 PM, Dean Del Ponte<
> dean.delponte@gmail.com
> >>>
> >>>> wrote:
> >>>> Any documentation on how to use the patch at
> >>>> https://issues.apache.org/**jira/browse/NUTCH-809<
> https://issues.apache.org/jira/browse/NUTCH-809>
> >>>> ?
> >>>>
> >>>> My apologies for the newbie question.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Dean Del Ponte
> >>>>
> >>>> On Mon, Sep 26, 2011 at 4:08 PM, Julien Nioche<
> >>>> lists.digitalpebble@gmail.com>  wrote:
> >>>>
> >>>>  Hi Matt,
> >>>>>
> >>>>> The plugin urlmeta does NOT extract the metadata from HTML pages. The
> >>>>> 'meta'
> >>>>> in its name means 'crawldb metadata'
> >>>>>
> >>>>> You need to use the patch in
> >>>>> https://issues.apache.org/**jira/browse/NUTCH-809<
> https://issues.apache.org/jira/browse/NUTCH-809>
> >>>>>
> >>>>> HTH
> >>>>>
> >>>>> Julien
> >>>>>
> >>>>>
> >>>>> On 26 September 2011 21:18, Wilson, Matt<Matthew.Wilson@salliemae.**
> >>>>> com <Ma...@salliemae.com>
> >>>>>
> >>>>>> wrote:
> >>>>>> Also,
> >>>>>>
> >>>>>> In case this helps.  I removed the Keywords field from the solr
> >>>>>>
> >>>>> schema
> >>>
> >>>> to
> >>>>
> >>>>> see if it would generate an error when the SolrIndexer runs and it
> >>>>>>
> >>>>> does
> >>>
> >>>> not.
> >>>>>
> >>>>>>  This has lead me to believe that nutch is either not indexing the
> >>>>>>
> >>>>> meta
> >>>
> >>>> content or it is not sending the update to solr when SolrIndexer
> >>>>>>
> >>>>> runs.
> >>>
> >>>> Matt Wilson
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.**com<
> lewis.mcgibbney@gmail.com>
> >>>>>> ]
> >>>>>> Sent: Monday, September 26, 2011 3:04 PM
> >>>>>> To: user@nutch.apache.org
> >>>>>> Subject: Re: Indexing specific metadata tags with urlmeta
> >>>>>>
> >>>>>> Hi Matt,
> >>>>>>
> >>>>>> Try changing
> >>>>>>
> >>>>>> <field name="keywords" type="string" stored="true" indexed="true"
> >>>>>> multiValued="true"/>
> >>>>>>
> >>>>>> to
> >>>>>>
> >>>>>> <field name="Keywords" type="string" stored="true" indexed="true"
> >>>>>> multiValued="true"/>  as per your metadata tags.
> >>>>>>
> >>>>>> We also have a configuration option in nutch-site.xml which you
> could
> >>>>>>
> >>>>> check
> >>>>>
> >>>>>> out.
> >>>>>>
> >>>>>> <property>
> >>>>>>  <name>urlmeta.tags</name>
> >>>>>>  <value></value>
> >>>>>>  <description>
> >>>>>>    To be used in conjunction with features introduced in NUTCH-655,
> >>>>>>
> >>>>> which
> >>>>
> >>>>> allows
> >>>>>>    for custom metatags to be injected alongside your crawl URLs.
> >>>>>>
> >>>>> Specifying
> >>>>>
> >>>>>> those
> >>>>>>    custom tags here will allow for their propagation into a pages
> >>>>>>
> >>>>> outlinks,
> >>>>>
> >>>>>> as
> >>>>>>    well as allow for them to be included as part of an index.
> >>>>>>    Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad
> >>>>>>
> >>>>> the
> >>>
> >>>> tags
> >>>>>
> >>>>>> with
> >>>>>>    white-space at their boundaries, if you are using anything
> earlier
> >>>>>>
> >>>>> than
> >>>>>
> >>>>>> Hadoop-0.21.
> >>>>>>  </description>
> >>>>>> </property>
> >>>>>>
> >>>>>> On Mon, Sep 26, 2011 at 7:07 PM, Wilson, Matt
> >>>>>> <Ma...@salliemae.com>**wrote:
> >>>>>>
> >>>>>>  I am attempting to crawl a corporate intranet site and allow it to
> >>>>>>>
> >>>>>> be
> >>>
> >>>> searched in solr.  As part of the requirements I have to be able to
> >>>>>>>
> >>>>>> index
> >>>>>
> >>>>>> certain metadata tags as their own field in solr (for faceted
> >>>>>>>
> >>>>>> search).
> >>>>
> >>>>>  For
> >>>>>>
> >>>>>>> example, the pages being crawled contain the following meta tag:
> >>>>>>>
> >>>>>>> <meta id="ctl00_hdrKeywords" name="Keywords" content="Banking,
> >>>>>>>
> >>>>>> Savings,
> >>>>
> >>>>> Student Loans, CDs, Certificates of Deposit, Smart Option Loan, 529
> >>>>>>>
> >>>>>> Plans"
> >>>>>>
> >>>>>>> />
> >>>>>>>
> >>>>>>> I have updated the nutch-site.xml with the following:
> >>>>>>>
> >>>>>>> <property>
> >>>>>>>    <name>plugin.includes</name>
> >>>>>>>    <value>urlmeta|protocol-**httpclient|...</value>
> >>>>>>> </property>
> >>>>>>> <property>
> >>>>>>>    <name>urlmeta.tags</name>
> >>>>>>>    <value>keywords</value>
> >>>>>>> </property>
> >>>>>>>
> >>>>>>> I have updated the solr schema.xml with the following addition:
> >>>>>>>
> >>>>>>> <field name="keywords" type="string" stored="true" indexed="true"
> >>>>>>> multiValued="true"/>
> >>>>>>>
> >>>>>>> I can see that the field has been created in Solr via the admin
> >>>>>>>
> >>>>>> interface.
> >>>>>>
> >>>>>>>  I also see that nutch is loading the urlmeta plugin and adding the
> >>>>>>> indexfilters etc in the hadroop.log.  The problem is that nutch
> >>>>>>>
> >>>>>> does
> >>>
> >>>> not
> >>>>>
> >>>>>> appear to be indexing the keywords field.  All of the pages crawled
> >>>>>>>
> >>>>>> have
> >>>>>
> >>>>>> the
> >>>>>>
> >>>>>>> tag present and I am receiving no errors in the nutch log.  I am
> >>>>>>>
> >>>>>> unsure
> >>>>
> >>>>> as
> >>>>>>
> >>>>>>> to what I am missing.  This seems to be pretty straightforward;
> >>>>>>>
> >>>>>> however,
> >>>>>
> >>>>>> I
> >>>>>>
> >>>>>>> must be misunderstanding either the urlmeta plugin or missing
> >>>>>>>
> >>>>>> something
> >>>>
> >>>>> in
> >>>>>>
> >>>>>>> the configuration.
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> *Lewis*
> >>>>>>
> >>>>>>
> >>>>>> This E-Mail has been scanned for viruses.
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> *
> >>>>> *Open Source Solutions for Text Engineering
> >>>>>
> >>>>> http://digitalpebble.blogspot.**com/<
> http://digitalpebble.blogspot.com/>
> >>>>> http://www.digitalpebble.com
> >>>>>
> >>>>>
> >>>
> >>> --
> >>> *Lewis*
> >>>
> >>>
>
>
> --
> *Lewis*
>



-- 
*Thanks & Regards*
*
*
*Vijith V*

Re: Indexing specific metadata tags with urlmeta

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Elisabeth please see my comments on issue.

Thanks again

Lewis

On Thu, Jan 12, 2012 at 9:15 AM, Elisabeth Adler
<el...@gmail.com>wrote:

> Hi Dean,
> I added my documentation and bundled plugin to jira (
> https://issues.apache.org/**jira/browse/NUTCH-809<https://issues.apache.org/jira/browse/NUTCH-809>),
> hope this helps.
>
>
> On 11.01.2012 22:44, Dean Del Ponte wrote:
>
>> Thank-you for your response.
>>
>> My goal is to get Nutch to index meta tags.  It's been quite an adventure
>> so far!
>>
>> On Wed, Jan 11, 2012 at 3:30 PM, Lewis John Mcgibbney<
>> lewis.mcgibbney@gmail.com>  wrote:
>>
>>  Hi Dean,
>>>
>>> Unfortunately nothing official. If you look you will see that this plugin
>>> (if eventually integrated), will combine with two other issues which all
>>> revolve roughly around the same area.
>>>
>>> I have never used this patch or any of the others.
>>>
>>> Anyone else?
>>>
>>> On Wed, Jan 11, 2012 at 8:54 PM, Dean Del Ponte<dean.delponte@gmail.com
>>>
>>>> wrote:
>>>> Any documentation on how to use the patch at
>>>> https://issues.apache.org/**jira/browse/NUTCH-809<https://issues.apache.org/jira/browse/NUTCH-809>
>>>> ?
>>>>
>>>> My apologies for the newbie question.
>>>>
>>>> Thanks,
>>>>
>>>> Dean Del Ponte
>>>>
>>>> On Mon, Sep 26, 2011 at 4:08 PM, Julien Nioche<
>>>> lists.digitalpebble@gmail.com>  wrote:
>>>>
>>>>  Hi Matt,
>>>>>
>>>>> The plugin urlmeta does NOT extract the metadata from HTML pages. The
>>>>> 'meta'
>>>>> in its name means 'crawldb metadata'
>>>>>
>>>>> You need to use the patch in
>>>>> https://issues.apache.org/**jira/browse/NUTCH-809<https://issues.apache.org/jira/browse/NUTCH-809>
>>>>>
>>>>> HTH
>>>>>
>>>>> Julien
>>>>>
>>>>>
>>>>> On 26 September 2011 21:18, Wilson, Matt<Matthew.Wilson@salliemae.**
>>>>> com <Ma...@salliemae.com>
>>>>>
>>>>>> wrote:
>>>>>> Also,
>>>>>>
>>>>>> In case this helps.  I removed the Keywords field from the solr
>>>>>>
>>>>> schema
>>>
>>>> to
>>>>
>>>>> see if it would generate an error when the SolrIndexer runs and it
>>>>>>
>>>>> does
>>>
>>>> not.
>>>>>
>>>>>>  This has lead me to believe that nutch is either not indexing the
>>>>>>
>>>>> meta
>>>
>>>> content or it is not sending the update to solr when SolrIndexer
>>>>>>
>>>>> runs.
>>>
>>>> Matt Wilson
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.**com<le...@gmail.com>
>>>>>> ]
>>>>>> Sent: Monday, September 26, 2011 3:04 PM
>>>>>> To: user@nutch.apache.org
>>>>>> Subject: Re: Indexing specific metadata tags with urlmeta
>>>>>>
>>>>>> Hi Matt,
>>>>>>
>>>>>> Try changing
>>>>>>
>>>>>> <field name="keywords" type="string" stored="true" indexed="true"
>>>>>> multiValued="true"/>
>>>>>>
>>>>>> to
>>>>>>
>>>>>> <field name="Keywords" type="string" stored="true" indexed="true"
>>>>>> multiValued="true"/>  as per your metadata tags.
>>>>>>
>>>>>> We also have a configuration option in nutch-site.xml which you could
>>>>>>
>>>>> check
>>>>>
>>>>>> out.
>>>>>>
>>>>>> <property>
>>>>>>  <name>urlmeta.tags</name>
>>>>>>  <value></value>
>>>>>>  <description>
>>>>>>    To be used in conjunction with features introduced in NUTCH-655,
>>>>>>
>>>>> which
>>>>
>>>>> allows
>>>>>>    for custom metatags to be injected alongside your crawl URLs.
>>>>>>
>>>>> Specifying
>>>>>
>>>>>> those
>>>>>>    custom tags here will allow for their propagation into a pages
>>>>>>
>>>>> outlinks,
>>>>>
>>>>>> as
>>>>>>    well as allow for them to be included as part of an index.
>>>>>>    Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad
>>>>>>
>>>>> the
>>>
>>>> tags
>>>>>
>>>>>> with
>>>>>>    white-space at their boundaries, if you are using anything earlier
>>>>>>
>>>>> than
>>>>>
>>>>>> Hadoop-0.21.
>>>>>>  </description>
>>>>>> </property>
>>>>>>
>>>>>> On Mon, Sep 26, 2011 at 7:07 PM, Wilson, Matt
>>>>>> <Ma...@salliemae.com>**wrote:
>>>>>>
>>>>>>  I am attempting to crawl a corporate intranet site and allow it to
>>>>>>>
>>>>>> be
>>>
>>>> searched in solr.  As part of the requirements I have to be able to
>>>>>>>
>>>>>> index
>>>>>
>>>>>> certain metadata tags as their own field in solr (for faceted
>>>>>>>
>>>>>> search).
>>>>
>>>>>  For
>>>>>>
>>>>>>> example, the pages being crawled contain the following meta tag:
>>>>>>>
>>>>>>> <meta id="ctl00_hdrKeywords" name="Keywords" content="Banking,
>>>>>>>
>>>>>> Savings,
>>>>
>>>>> Student Loans, CDs, Certificates of Deposit, Smart Option Loan, 529
>>>>>>>
>>>>>> Plans"
>>>>>>
>>>>>>> />
>>>>>>>
>>>>>>> I have updated the nutch-site.xml with the following:
>>>>>>>
>>>>>>> <property>
>>>>>>>    <name>plugin.includes</name>
>>>>>>>    <value>urlmeta|protocol-**httpclient|...</value>
>>>>>>> </property>
>>>>>>> <property>
>>>>>>>    <name>urlmeta.tags</name>
>>>>>>>    <value>keywords</value>
>>>>>>> </property>
>>>>>>>
>>>>>>> I have updated the solr schema.xml with the following addition:
>>>>>>>
>>>>>>> <field name="keywords" type="string" stored="true" indexed="true"
>>>>>>> multiValued="true"/>
>>>>>>>
>>>>>>> I can see that the field has been created in Solr via the admin
>>>>>>>
>>>>>> interface.
>>>>>>
>>>>>>>  I also see that nutch is loading the urlmeta plugin and adding the
>>>>>>> indexfilters etc in the hadroop.log.  The problem is that nutch
>>>>>>>
>>>>>> does
>>>
>>>> not
>>>>>
>>>>>> appear to be indexing the keywords field.  All of the pages crawled
>>>>>>>
>>>>>> have
>>>>>
>>>>>> the
>>>>>>
>>>>>>> tag present and I am receiving no errors in the nutch log.  I am
>>>>>>>
>>>>>> unsure
>>>>
>>>>> as
>>>>>>
>>>>>>> to what I am missing.  This seems to be pretty straightforward;
>>>>>>>
>>>>>> however,
>>>>>
>>>>>> I
>>>>>>
>>>>>>> must be misunderstanding either the urlmeta plugin or missing
>>>>>>>
>>>>>> something
>>>>
>>>>> in
>>>>>>
>>>>>>> the configuration.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Lewis*
>>>>>>
>>>>>>
>>>>>> This E-Mail has been scanned for viruses.
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> *
>>>>> *Open Source Solutions for Text Engineering
>>>>>
>>>>> http://digitalpebble.blogspot.**com/<http://digitalpebble.blogspot.com/>
>>>>> http://www.digitalpebble.com
>>>>>
>>>>>
>>>
>>> --
>>> *Lewis*
>>>
>>>


-- 
*Lewis*

Re: Indexing specific metadata tags with urlmeta

Posted by Elisabeth Adler <el...@gmail.com>.

I haven't tested it with 1.4 myself since we're still working with 1.3, 
but I don't think there should be any issues.

On 12.01.2012 17:38, Dean Del Ponte wrote:
> Thanks Elisabeth.  Will this patch work with Nutch 1.4?
>
> On Thu, Jan 12, 2012 at 3:15 AM, Elisabeth Adler
> <el...@gmail.com>wrote:
>
>> Hi Dean,
>> I added my documentation and bundled plugin to jira (
>> https://issues.apache.org/**jira/browse/NUTCH-809<https://issues.apache.org/jira/browse/NUTCH-809>),
>> hope this helps.
>>
>>
>> On 11.01.2012 22:44, Dean Del Ponte wrote:
>>
>>> Thank-you for your response.
>>>
>>> My goal is to get Nutch to index meta tags.  It's been quite an adventure
>>> so far!
>>>
>>> On Wed, Jan 11, 2012 at 3:30 PM, Lewis John Mcgibbney<
>>> lewis.mcgibbney@gmail.com>   wrote:
>>>
>>>   Hi Dean,
>>>> Unfortunately nothing official. If you look you will see that this plugin
>>>> (if eventually integrated), will combine with two other issues which all
>>>> revolve roughly around the same area.
>>>>
>>>> I have never used this patch or any of the others.
>>>>
>>>> Anyone else?
>>>>
>>>> On Wed, Jan 11, 2012 at 8:54 PM, Dean Del Ponte<dean.delponte@gmail.com
>>>>
>>>>> wrote:
>>>>> Any documentation on how to use the patch at
>>>>> https://issues.apache.org/**jira/browse/NUTCH-809<https://issues.apache.org/jira/browse/NUTCH-809>
>>>>> ?
>>>>>
>>>>> My apologies for the newbie question.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Dean Del Ponte
>>>>>
>>>>> On Mon, Sep 26, 2011 at 4:08 PM, Julien Nioche<
>>>>> lists.digitalpebble@gmail.com>   wrote:
>>>>>
>>>>>   Hi Matt,
>>>>>> The plugin urlmeta does NOT extract the metadata from HTML pages. The
>>>>>> 'meta'
>>>>>> in its name means 'crawldb metadata'
>>>>>>
>>>>>> You need to use the patch in
>>>>>> https://issues.apache.org/**jira/browse/NUTCH-809<https://issues.apache.org/jira/browse/NUTCH-809>
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Julien
>>>>>>
>>>>>>
>>>>>> On 26 September 2011 21:18, Wilson, Matt<Matthew.Wilson@salliemae.**
>>>>>> com<Ma...@salliemae.com>
>>>>>>
>>>>>>> wrote:
>>>>>>> Also,
>>>>>>>
>>>>>>> In case this helps.  I removed the Keywords field from the solr
>>>>>>>
>>>>>> schema
>>>>> to
>>>>>
>>>>>> see if it would generate an error when the SolrIndexer runs and it
>>>>>> does
>>>>> not.
>>>>>>>   This has lead me to believe that nutch is either not indexing the
>>>>>>>
>>>>>> meta
>>>>> content or it is not sending the update to solr when SolrIndexer
>>>>>> runs.
>>>>> Matt Wilson
>>>>>>> -----Original Message-----
>>>>>>> From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.**com<le...@gmail.com>
>>>>>>> ]
>>>>>>> Sent: Monday, September 26, 2011 3:04 PM
>>>>>>> To: user@nutch.apache.org
>>>>>>> Subject: Re: Indexing specific metadata tags with urlmeta
>>>>>>>
>>>>>>> Hi Matt,
>>>>>>>
>>>>>>> Try changing
>>>>>>>
>>>>>>> <field name="keywords" type="string" stored="true" indexed="true"
>>>>>>> multiValued="true"/>
>>>>>>>
>>>>>>> to
>>>>>>>
>>>>>>> <field name="Keywords" type="string" stored="true" indexed="true"
>>>>>>> multiValued="true"/>   as per your metadata tags.
>>>>>>>
>>>>>>> We also have a configuration option in nutch-site.xml which you could
>>>>>>>
>>>>>> check
>>>>>>
>>>>>>> out.
>>>>>>>
>>>>>>> <property>
>>>>>>>   <name>urlmeta.tags</name>
>>>>>>>   <value></value>
>>>>>>>   <description>
>>>>>>>     To be used in conjunction with features introduced in NUTCH-655,
>>>>>>>
>>>>>> which
>>>>>> allows
>>>>>>>     for custom metatags to be injected alongside your crawl URLs.
>>>>>>>
>>>>>> Specifying
>>>>>>
>>>>>>> those
>>>>>>>     custom tags here will allow for their propagation into a pages
>>>>>>>
>>>>>> outlinks,
>>>>>>
>>>>>>> as
>>>>>>>     well as allow for them to be included as part of an index.
>>>>>>>     Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad
>>>>>>>
>>>>>> the
>>>>> tags
>>>>>>> with
>>>>>>>     white-space at their boundaries, if you are using anything earlier
>>>>>>>
>>>>>> than
>>>>>>
>>>>>>> Hadoop-0.21.
>>>>>>>   </description>
>>>>>>> </property>
>>>>>>>
>>>>>>> On Mon, Sep 26, 2011 at 7:07 PM, Wilson, Matt
>>>>>>> <Ma...@salliemae.com>**wrote:
>>>>>>>
>>>>>>>   I am attempting to crawl a corporate intranet site and allow it to
>>>>>>> be
>>>>> searched in solr.  As part of the requirements I have to be able to
>>>>>>> index
>>>>>>> certain metadata tags as their own field in solr (for faceted
>>>>>>> search).
>>>>>>   For
>>>>>>>> example, the pages being crawled contain the following meta tag:
>>>>>>>>
>>>>>>>> <meta id="ctl00_hdrKeywords" name="Keywords" content="Banking,
>>>>>>>>
>>>>>>> Savings,
>>>>>> Student Loans, CDs, Certificates of Deposit, Smart Option Loan, 529
>>>>>>> Plans"
>>>>>>>
>>>>>>>> />
>>>>>>>>
>>>>>>>> I have updated the nutch-site.xml with the following:
>>>>>>>>
>>>>>>>> <property>
>>>>>>>>     <name>plugin.includes</name>
>>>>>>>>     <value>urlmeta|protocol-**httpclient|...</value>
>>>>>>>> </property>
>>>>>>>> <property>
>>>>>>>>     <name>urlmeta.tags</name>
>>>>>>>>     <value>keywords</value>
>>>>>>>> </property>
>>>>>>>>
>>>>>>>> I have updated the solr schema.xml with the following addition:
>>>>>>>>
>>>>>>>> <field name="keywords" type="string" stored="true" indexed="true"
>>>>>>>> multiValued="true"/>
>>>>>>>>
>>>>>>>> I can see that the field has been created in Solr via the admin
>>>>>>>>
>>>>>>> interface.
>>>>>>>
>>>>>>>>   I also see that nutch is loading the urlmeta plugin and adding the
>>>>>>>> indexfilters etc in the hadroop.log.  The problem is that nutch
>>>>>>>>
>>>>>>> does
>>>>> not
>>>>>>> appear to be indexing the keywords field.  All of the pages crawled
>>>>>>> have
>>>>>>> the
>>>>>>>
>>>>>>>> tag present and I am receiving no errors in the nutch log.  I am
>>>>>>>>
>>>>>>> unsure
>>>>>> as
>>>>>>>> to what I am missing.  This seems to be pretty straightforward;
>>>>>>>>
>>>>>>> however,
>>>>>>> I
>>>>>>>
>>>>>>>> must be misunderstanding either the urlmeta plugin or missing
>>>>>>>>
>>>>>>> something
>>>>>> in
>>>>>>>> the configuration.
>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>> *Lewis*
>>>>>>>
>>>>>>>
>>>>>>> This E-Mail has been scanned for viruses.
>>>>>>>
>>>>>>>
>>>>>> --
>>>>>> *
>>>>>> *Open Source Solutions for Text Engineering
>>>>>>
>>>>>> http://digitalpebble.blogspot.**com/<http://digitalpebble.blogspot.com/>
>>>>>> http://www.digitalpebble.com
>>>>>>
>>>>>>
>>>> --
>>>> *Lewis*
>>>>
>>>>

Re: Indexing specific metadata tags with urlmeta

Posted by Dean Del Ponte <de...@gmail.com>.

Thanks Elisabeth.  Will this patch work with Nutch 1.4?

On Thu, Jan 12, 2012 at 3:15 AM, Elisabeth Adler
<el...@gmail.com>wrote:

> Hi Dean,
> I added my documentation and bundled plugin to jira (
> https://issues.apache.org/**jira/browse/NUTCH-809<https://issues.apache.org/jira/browse/NUTCH-809>),
> hope this helps.
>
>
> On 11.01.2012 22:44, Dean Del Ponte wrote:
>
>> Thank-you for your response.
>>
>> My goal is to get Nutch to index meta tags.  It's been quite an adventure
>> so far!
>>
>> On Wed, Jan 11, 2012 at 3:30 PM, Lewis John Mcgibbney<
>> lewis.mcgibbney@gmail.com>  wrote:
>>
>>  Hi Dean,
>>>
>>> Unfortunately nothing official. If you look you will see that this plugin
>>> (if eventually integrated), will combine with two other issues which all
>>> revolve roughly around the same area.
>>>
>>> I have never used this patch or any of the others.
>>>
>>> Anyone else?
>>>
>>> On Wed, Jan 11, 2012 at 8:54 PM, Dean Del Ponte<dean.delponte@gmail.com
>>>
>>>> wrote:
>>>> Any documentation on how to use the patch at
>>>> https://issues.apache.org/**jira/browse/NUTCH-809<https://issues.apache.org/jira/browse/NUTCH-809>
>>>> ?
>>>>
>>>> My apologies for the newbie question.
>>>>
>>>> Thanks,
>>>>
>>>> Dean Del Ponte
>>>>
>>>> On Mon, Sep 26, 2011 at 4:08 PM, Julien Nioche<
>>>> lists.digitalpebble@gmail.com>  wrote:
>>>>
>>>>  Hi Matt,
>>>>>
>>>>> The plugin urlmeta does NOT extract the metadata from HTML pages. The
>>>>> 'meta'
>>>>> in its name means 'crawldb metadata'
>>>>>
>>>>> You need to use the patch in
>>>>> https://issues.apache.org/**jira/browse/NUTCH-809<https://issues.apache.org/jira/browse/NUTCH-809>
>>>>>
>>>>> HTH
>>>>>
>>>>> Julien
>>>>>
>>>>>
>>>>> On 26 September 2011 21:18, Wilson, Matt<Matthew.Wilson@salliemae.**
>>>>> com <Ma...@salliemae.com>
>>>>>
>>>>>> wrote:
>>>>>> Also,
>>>>>>
>>>>>> In case this helps.  I removed the Keywords field from the solr
>>>>>>
>>>>> schema
>>>
>>>> to
>>>>
>>>>> see if it would generate an error when the SolrIndexer runs and it
>>>>>>
>>>>> does
>>>
>>>> not.
>>>>>
>>>>>>  This has lead me to believe that nutch is either not indexing the
>>>>>>
>>>>> meta
>>>
>>>> content or it is not sending the update to solr when SolrIndexer
>>>>>>
>>>>> runs.
>>>
>>>> Matt Wilson
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.**com<le...@gmail.com>
>>>>>> ]
>>>>>> Sent: Monday, September 26, 2011 3:04 PM
>>>>>> To: user@nutch.apache.org
>>>>>> Subject: Re: Indexing specific metadata tags with urlmeta
>>>>>>
>>>>>> Hi Matt,
>>>>>>
>>>>>> Try changing
>>>>>>
>>>>>> <field name="keywords" type="string" stored="true" indexed="true"
>>>>>> multiValued="true"/>
>>>>>>
>>>>>> to
>>>>>>
>>>>>> <field name="Keywords" type="string" stored="true" indexed="true"
>>>>>> multiValued="true"/>  as per your metadata tags.
>>>>>>
>>>>>> We also have a configuration option in nutch-site.xml which you could
>>>>>>
>>>>> check
>>>>>
>>>>>> out.
>>>>>>
>>>>>> <property>
>>>>>>  <name>urlmeta.tags</name>
>>>>>>  <value></value>
>>>>>>  <description>
>>>>>>    To be used in conjunction with features introduced in NUTCH-655,
>>>>>>
>>>>> which
>>>>
>>>>> allows
>>>>>>    for custom metatags to be injected alongside your crawl URLs.
>>>>>>
>>>>> Specifying
>>>>>
>>>>>> those
>>>>>>    custom tags here will allow for their propagation into a pages
>>>>>>
>>>>> outlinks,
>>>>>
>>>>>> as
>>>>>>    well as allow for them to be included as part of an index.
>>>>>>    Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad
>>>>>>
>>>>> the
>>>
>>>> tags
>>>>>
>>>>>> with
>>>>>>    white-space at their boundaries, if you are using anything earlier
>>>>>>
>>>>> than
>>>>>
>>>>>> Hadoop-0.21.
>>>>>>  </description>
>>>>>> </property>
>>>>>>
>>>>>> On Mon, Sep 26, 2011 at 7:07 PM, Wilson, Matt
>>>>>> <Ma...@salliemae.com>**wrote:
>>>>>>
>>>>>>  I am attempting to crawl a corporate intranet site and allow it to
>>>>>>>
>>>>>> be
>>>
>>>> searched in solr.  As part of the requirements I have to be able to
>>>>>>>
>>>>>> index
>>>>>
>>>>>> certain metadata tags as their own field in solr (for faceted
>>>>>>>
>>>>>> search).
>>>>
>>>>>  For
>>>>>>
>>>>>>> example, the pages being crawled contain the following meta tag:
>>>>>>>
>>>>>>> <meta id="ctl00_hdrKeywords" name="Keywords" content="Banking,
>>>>>>>
>>>>>> Savings,
>>>>
>>>>> Student Loans, CDs, Certificates of Deposit, Smart Option Loan, 529
>>>>>>>
>>>>>> Plans"
>>>>>>
>>>>>>> />
>>>>>>>
>>>>>>> I have updated the nutch-site.xml with the following:
>>>>>>>
>>>>>>> <property>
>>>>>>>    <name>plugin.includes</name>
>>>>>>>    <value>urlmeta|protocol-**httpclient|...</value>
>>>>>>> </property>
>>>>>>> <property>
>>>>>>>    <name>urlmeta.tags</name>
>>>>>>>    <value>keywords</value>
>>>>>>> </property>
>>>>>>>
>>>>>>> I have updated the solr schema.xml with the following addition:
>>>>>>>
>>>>>>> <field name="keywords" type="string" stored="true" indexed="true"
>>>>>>> multiValued="true"/>
>>>>>>>
>>>>>>> I can see that the field has been created in Solr via the admin
>>>>>>>
>>>>>> interface.
>>>>>>
>>>>>>>  I also see that nutch is loading the urlmeta plugin and adding the
>>>>>>> indexfilters etc in the hadroop.log.  The problem is that nutch
>>>>>>>
>>>>>> does
>>>
>>>> not
>>>>>
>>>>>> appear to be indexing the keywords field.  All of the pages crawled
>>>>>>>
>>>>>> have
>>>>>
>>>>>> the
>>>>>>
>>>>>>> tag present and I am receiving no errors in the nutch log.  I am
>>>>>>>
>>>>>> unsure
>>>>
>>>>> as
>>>>>>
>>>>>>> to what I am missing.  This seems to be pretty straightforward;
>>>>>>>
>>>>>> however,
>>>>>
>>>>>> I
>>>>>>
>>>>>>> must be misunderstanding either the urlmeta plugin or missing
>>>>>>>
>>>>>> something
>>>>
>>>>> in
>>>>>>
>>>>>>> the configuration.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Lewis*
>>>>>>
>>>>>>
>>>>>> This E-Mail has been scanned for viruses.
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> *
>>>>> *Open Source Solutions for Text Engineering
>>>>>
>>>>> http://digitalpebble.blogspot.**com/<http://digitalpebble.blogspot.com/>
>>>>> http://www.digitalpebble.com
>>>>>
>>>>>
>>>
>>> --
>>> *Lewis*
>>>
>>>

Re: Indexing specific metadata tags with urlmeta

Posted by Elisabeth Adler <el...@gmail.com>.

Hi Dean,
I added my documentation and bundled plugin to jira 
(https://issues.apache.org/jira/browse/NUTCH-809), hope this helps.

On 11.01.2012 22:44, Dean Del Ponte wrote:
> Thank-you for your response.
>
> My goal is to get Nutch to index meta tags.  It's been quite an adventure
> so far!
>
> On Wed, Jan 11, 2012 at 3:30 PM, Lewis John Mcgibbney<
> lewis.mcgibbney@gmail.com>  wrote:
>
>> Hi Dean,
>>
>> Unfortunately nothing official. If you look you will see that this plugin
>> (if eventually integrated), will combine with two other issues which all
>> revolve roughly around the same area.
>>
>> I have never used this patch or any of the others.
>>
>> Anyone else?
>>
>> On Wed, Jan 11, 2012 at 8:54 PM, Dean Del Ponte<dean.delponte@gmail.com
>>> wrote:
>>> Any documentation on how to use the patch at
>>> https://issues.apache.org/jira/browse/NUTCH-809?
>>>
>>> My apologies for the newbie question.
>>>
>>> Thanks,
>>>
>>> Dean Del Ponte
>>>
>>> On Mon, Sep 26, 2011 at 4:08 PM, Julien Nioche<
>>> lists.digitalpebble@gmail.com>  wrote:
>>>
>>>> Hi Matt,
>>>>
>>>> The plugin urlmeta does NOT extract the metadata from HTML pages. The
>>>> 'meta'
>>>> in its name means 'crawldb metadata'
>>>>
>>>> You need to use the patch in
>>>> https://issues.apache.org/jira/browse/NUTCH-809
>>>>
>>>> HTH
>>>>
>>>> Julien
>>>>
>>>>
>>>> On 26 September 2011 21:18, Wilson, Matt<Matthew.Wilson@salliemae.com
>>>>> wrote:
>>>>> Also,
>>>>>
>>>>> In case this helps.  I removed the Keywords field from the solr
>> schema
>>> to
>>>>> see if it would generate an error when the SolrIndexer runs and it
>> does
>>>> not.
>>>>>   This has lead me to believe that nutch is either not indexing the
>> meta
>>>>> content or it is not sending the update to solr when SolrIndexer
>> runs.
>>>>> Matt Wilson
>>>>>
>>>>> -----Original Message-----
>>>>> From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
>>>>> Sent: Monday, September 26, 2011 3:04 PM
>>>>> To: user@nutch.apache.org
>>>>> Subject: Re: Indexing specific metadata tags with urlmeta
>>>>>
>>>>> Hi Matt,
>>>>>
>>>>> Try changing
>>>>>
>>>>> <field name="keywords" type="string" stored="true" indexed="true"
>>>>> multiValued="true"/>
>>>>>
>>>>> to
>>>>>
>>>>> <field name="Keywords" type="string" stored="true" indexed="true"
>>>>> multiValued="true"/>  as per your metadata tags.
>>>>>
>>>>> We also have a configuration option in nutch-site.xml which you could
>>>> check
>>>>> out.
>>>>>
>>>>> <property>
>>>>>   <name>urlmeta.tags</name>
>>>>>   <value></value>
>>>>>   <description>
>>>>>     To be used in conjunction with features introduced in NUTCH-655,
>>> which
>>>>> allows
>>>>>     for custom metatags to be injected alongside your crawl URLs.
>>>> Specifying
>>>>> those
>>>>>     custom tags here will allow for their propagation into a pages
>>>> outlinks,
>>>>> as
>>>>>     well as allow for them to be included as part of an index.
>>>>>     Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad
>> the
>>>> tags
>>>>> with
>>>>>     white-space at their boundaries, if you are using anything earlier
>>>> than
>>>>> Hadoop-0.21.
>>>>>   </description>
>>>>> </property>
>>>>>
>>>>> On Mon, Sep 26, 2011 at 7:07 PM, Wilson, Matt
>>>>> <Ma...@salliemae.com>wrote:
>>>>>
>>>>>> I am attempting to crawl a corporate intranet site and allow it to
>> be
>>>>>> searched in solr.  As part of the requirements I have to be able to
>>>> index
>>>>>> certain metadata tags as their own field in solr (for faceted
>>> search).
>>>>>   For
>>>>>> example, the pages being crawled contain the following meta tag:
>>>>>>
>>>>>> <meta id="ctl00_hdrKeywords" name="Keywords" content="Banking,
>>> Savings,
>>>>>> Student Loans, CDs, Certificates of Deposit, Smart Option Loan, 529
>>>>> Plans"
>>>>>> />
>>>>>>
>>>>>> I have updated the nutch-site.xml with the following:
>>>>>>
>>>>>> <property>
>>>>>>     <name>plugin.includes</name>
>>>>>>     <value>urlmeta|protocol-httpclient|...</value>
>>>>>> </property>
>>>>>> <property>
>>>>>>     <name>urlmeta.tags</name>
>>>>>>     <value>keywords</value>
>>>>>> </property>
>>>>>>
>>>>>> I have updated the solr schema.xml with the following addition:
>>>>>>
>>>>>> <field name="keywords" type="string" stored="true" indexed="true"
>>>>>> multiValued="true"/>
>>>>>>
>>>>>> I can see that the field has been created in Solr via the admin
>>>>> interface.
>>>>>>   I also see that nutch is loading the urlmeta plugin and adding the
>>>>>> indexfilters etc in the hadroop.log.  The problem is that nutch
>> does
>>>> not
>>>>>> appear to be indexing the keywords field.  All of the pages crawled
>>>> have
>>>>> the
>>>>>> tag present and I am receiving no errors in the nutch log.  I am
>>> unsure
>>>>> as
>>>>>> to what I am missing.  This seems to be pretty straightforward;
>>>> however,
>>>>> I
>>>>>> must be misunderstanding either the urlmeta plugin or missing
>>> something
>>>>> in
>>>>>> the configuration.
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Lewis*
>>>>>
>>>>>
>>>>> This E-Mail has been scanned for viruses.
>>>>>
>>>>
>>>>
>>>> --
>>>> *
>>>> *Open Source Solutions for Text Engineering
>>>>
>>>> http://digitalpebble.blogspot.com/
>>>> http://www.digitalpebble.com
>>>>
>>
>>
>> --
>> *Lewis*
>>

Re: Indexing specific metadata tags with urlmeta

Posted by Dean Del Ponte <de...@gmail.com>.

Thank-you for your response.

My goal is to get Nutch to index meta tags.  It's been quite an adventure
so far!

On Wed, Jan 11, 2012 at 3:30 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Dean,
>
> Unfortunately nothing official. If you look you will see that this plugin
> (if eventually integrated), will combine with two other issues which all
> revolve roughly around the same area.
>
> I have never used this patch or any of the others.
>
> Anyone else?
>
> On Wed, Jan 11, 2012 at 8:54 PM, Dean Del Ponte <dean.delponte@gmail.com
> >wrote:
>
> > Any documentation on how to use the patch at
> > https://issues.apache.org/jira/browse/NUTCH-809?
> >
> > My apologies for the newbie question.
> >
> > Thanks,
> >
> > Dean Del Ponte
> >
> > On Mon, Sep 26, 2011 at 4:08 PM, Julien Nioche <
> > lists.digitalpebble@gmail.com> wrote:
> >
> > > Hi Matt,
> > >
> > > The plugin urlmeta does NOT extract the metadata from HTML pages. The
> > > 'meta'
> > > in its name means 'crawldb metadata'
> > >
> > > You need to use the patch in
> > > https://issues.apache.org/jira/browse/NUTCH-809
> > >
> > > HTH
> > >
> > > Julien
> > >
> > >
> > > On 26 September 2011 21:18, Wilson, Matt <Matthew.Wilson@salliemae.com
> > > >wrote:
> > >
> > > > Also,
> > > >
> > > > In case this helps.  I removed the Keywords field from the solr
> schema
> > to
> > > > see if it would generate an error when the SolrIndexer runs and it
> does
> > > not.
> > > >  This has lead me to believe that nutch is either not indexing the
> meta
> > > > content or it is not sending the update to solr when SolrIndexer
> runs.
> > > >
> > > > Matt Wilson
> > > >
> > > > -----Original Message-----
> > > > From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> > > > Sent: Monday, September 26, 2011 3:04 PM
> > > > To: user@nutch.apache.org
> > > > Subject: Re: Indexing specific metadata tags with urlmeta
> > > >
> > > > Hi Matt,
> > > >
> > > > Try changing
> > > >
> > > > <field name="keywords" type="string" stored="true" indexed="true"
> > > > multiValued="true"/>
> > > >
> > > > to
> > > >
> > > > <field name="Keywords" type="string" stored="true" indexed="true"
> > > > multiValued="true"/> as per your metadata tags.
> > > >
> > > > We also have a configuration option in nutch-site.xml which you could
> > > check
> > > > out.
> > > >
> > > > <property>
> > > >  <name>urlmeta.tags</name>
> > > >  <value></value>
> > > >  <description>
> > > >    To be used in conjunction with features introduced in NUTCH-655,
> > which
> > > > allows
> > > >    for custom metatags to be injected alongside your crawl URLs.
> > > Specifying
> > > > those
> > > >    custom tags here will allow for their propagation into a pages
> > > outlinks,
> > > > as
> > > >    well as allow for them to be included as part of an index.
> > > >    Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad
> the
> > > tags
> > > > with
> > > >    white-space at their boundaries, if you are using anything earlier
> > > than
> > > > Hadoop-0.21.
> > > >  </description>
> > > > </property>
> > > >
> > > > On Mon, Sep 26, 2011 at 7:07 PM, Wilson, Matt
> > > > <Ma...@salliemae.com>wrote:
> > > >
> > > > > I am attempting to crawl a corporate intranet site and allow it to
> be
> > > > > searched in solr.  As part of the requirements I have to be able to
> > > index
> > > > > certain metadata tags as their own field in solr (for faceted
> > search).
> > > >  For
> > > > > example, the pages being crawled contain the following meta tag:
> > > > >
> > > > > <meta id="ctl00_hdrKeywords" name="Keywords" content="Banking,
> > Savings,
> > > > > Student Loans, CDs, Certificates of Deposit, Smart Option Loan, 529
> > > > Plans"
> > > > > />
> > > > >
> > > > > I have updated the nutch-site.xml with the following:
> > > > >
> > > > > <property>
> > > > >    <name>plugin.includes</name>
> > > > >    <value>urlmeta|protocol-httpclient|... </value>
> > > > > </property>
> > > > > <property>
> > > > >    <name>urlmeta.tags</name>
> > > > >    <value>keywords</value>
> > > > > </property>
> > > > >
> > > > > I have updated the solr schema.xml with the following addition:
> > > > >
> > > > > <field name="keywords" type="string" stored="true" indexed="true"
> > > > > multiValued="true"/>
> > > > >
> > > > > I can see that the field has been created in Solr via the admin
> > > > interface.
> > > > >  I also see that nutch is loading the urlmeta plugin and adding the
> > > > > indexfilters etc in the hadroop.log.  The problem is that nutch
> does
> > > not
> > > > > appear to be indexing the keywords field.  All of the pages crawled
> > > have
> > > > the
> > > > > tag present and I am receiving no errors in the nutch log.  I am
> > unsure
> > > > as
> > > > > to what I am missing.  This seems to be pretty straightforward;
> > > however,
> > > > I
> > > > > must be misunderstanding either the urlmeta plugin or missing
> > something
> > > > in
> > > > > the configuration.
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > *Lewis*
> > > >
> > > >
> > > > This E-Mail has been scanned for viruses.
> > > >
> > >
> > >
> > >
> > > --
> > > *
> > > *Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > >
> >
>
>
>
> --
> *Lewis*
>

Re: Indexing specific metadata tags with urlmeta

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Dean,

Unfortunately nothing official. If you look you will see that this plugin
(if eventually integrated), will combine with two other issues which all
revolve roughly around the same area.

I have never used this patch or any of the others.

Anyone else?

On Wed, Jan 11, 2012 at 8:54 PM, Dean Del Ponte <de...@gmail.com>wrote:

> Any documentation on how to use the patch at
> https://issues.apache.org/jira/browse/NUTCH-809?
>
> My apologies for the newbie question.
>
> Thanks,
>
> Dean Del Ponte
>
> On Mon, Sep 26, 2011 at 4:08 PM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
> > Hi Matt,
> >
> > The plugin urlmeta does NOT extract the metadata from HTML pages. The
> > 'meta'
> > in its name means 'crawldb metadata'
> >
> > You need to use the patch in
> > https://issues.apache.org/jira/browse/NUTCH-809
> >
> > HTH
> >
> > Julien
> >
> >
> > On 26 September 2011 21:18, Wilson, Matt <Matthew.Wilson@salliemae.com
> > >wrote:
> >
> > > Also,
> > >
> > > In case this helps.  I removed the Keywords field from the solr schema
> to
> > > see if it would generate an error when the SolrIndexer runs and it does
> > not.
> > >  This has lead me to believe that nutch is either not indexing the meta
> > > content or it is not sending the update to solr when SolrIndexer runs.
> > >
> > > Matt Wilson
> > >
> > > -----Original Message-----
> > > From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> > > Sent: Monday, September 26, 2011 3:04 PM
> > > To: user@nutch.apache.org
> > > Subject: Re: Indexing specific metadata tags with urlmeta
> > >
> > > Hi Matt,
> > >
> > > Try changing
> > >
> > > <field name="keywords" type="string" stored="true" indexed="true"
> > > multiValued="true"/>
> > >
> > > to
> > >
> > > <field name="Keywords" type="string" stored="true" indexed="true"
> > > multiValued="true"/> as per your metadata tags.
> > >
> > > We also have a configuration option in nutch-site.xml which you could
> > check
> > > out.
> > >
> > > <property>
> > >  <name>urlmeta.tags</name>
> > >  <value></value>
> > >  <description>
> > >    To be used in conjunction with features introduced in NUTCH-655,
> which
> > > allows
> > >    for custom metatags to be injected alongside your crawl URLs.
> > Specifying
> > > those
> > >    custom tags here will allow for their propagation into a pages
> > outlinks,
> > > as
> > >    well as allow for them to be included as part of an index.
> > >    Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad the
> > tags
> > > with
> > >    white-space at their boundaries, if you are using anything earlier
> > than
> > > Hadoop-0.21.
> > >  </description>
> > > </property>
> > >
> > > On Mon, Sep 26, 2011 at 7:07 PM, Wilson, Matt
> > > <Ma...@salliemae.com>wrote:
> > >
> > > > I am attempting to crawl a corporate intranet site and allow it to be
> > > > searched in solr.  As part of the requirements I have to be able to
> > index
> > > > certain metadata tags as their own field in solr (for faceted
> search).
> > >  For
> > > > example, the pages being crawled contain the following meta tag:
> > > >
> > > > <meta id="ctl00_hdrKeywords" name="Keywords" content="Banking,
> Savings,
> > > > Student Loans, CDs, Certificates of Deposit, Smart Option Loan, 529
> > > Plans"
> > > > />
> > > >
> > > > I have updated the nutch-site.xml with the following:
> > > >
> > > > <property>
> > > >    <name>plugin.includes</name>
> > > >    <value>urlmeta|protocol-httpclient|... </value>
> > > > </property>
> > > > <property>
> > > >    <name>urlmeta.tags</name>
> > > >    <value>keywords</value>
> > > > </property>
> > > >
> > > > I have updated the solr schema.xml with the following addition:
> > > >
> > > > <field name="keywords" type="string" stored="true" indexed="true"
> > > > multiValued="true"/>
> > > >
> > > > I can see that the field has been created in Solr via the admin
> > > interface.
> > > >  I also see that nutch is loading the urlmeta plugin and adding the
> > > > indexfilters etc in the hadroop.log.  The problem is that nutch does
> > not
> > > > appear to be indexing the keywords field.  All of the pages crawled
> > have
> > > the
> > > > tag present and I am receiving no errors in the nutch log.  I am
> unsure
> > > as
> > > > to what I am missing.  This seems to be pretty straightforward;
> > however,
> > > I
> > > > must be misunderstanding either the urlmeta plugin or missing
> something
> > > in
> > > > the configuration.
> > > >
> > >
> > >
> > >
> > > --
> > > *Lewis*
> > >
> > >
> > > This E-Mail has been scanned for viruses.
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> >
>



-- 
*Lewis*

Re: Indexing specific metadata tags with urlmeta

Posted by Dean Del Ponte <de...@gmail.com>.

Any documentation on how to use the patch at
https://issues.apache.org/jira/browse/NUTCH-809?

My apologies for the newbie question.

Thanks,

Dean Del Ponte

On Mon, Sep 26, 2011 at 4:08 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Hi Matt,
>
> The plugin urlmeta does NOT extract the metadata from HTML pages. The
> 'meta'
> in its name means 'crawldb metadata'
>
> You need to use the patch in
> https://issues.apache.org/jira/browse/NUTCH-809
>
> HTH
>
> Julien
>
>
> On 26 September 2011 21:18, Wilson, Matt <Matthew.Wilson@salliemae.com
> >wrote:
>
> > Also,
> >
> > In case this helps.  I removed the Keywords field from the solr schema to
> > see if it would generate an error when the SolrIndexer runs and it does
> not.
> >  This has lead me to believe that nutch is either not indexing the meta
> > content or it is not sending the update to solr when SolrIndexer runs.
> >
> > Matt Wilson
> >
> > -----Original Message-----
> > From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> > Sent: Monday, September 26, 2011 3:04 PM
> > To: user@nutch.apache.org
> > Subject: Re: Indexing specific metadata tags with urlmeta
> >
> > Hi Matt,
> >
> > Try changing
> >
> > <field name="keywords" type="string" stored="true" indexed="true"
> > multiValued="true"/>
> >
> > to
> >
> > <field name="Keywords" type="string" stored="true" indexed="true"
> > multiValued="true"/> as per your metadata tags.
> >
> > We also have a configuration option in nutch-site.xml which you could
> check
> > out.
> >
> > <property>
> >  <name>urlmeta.tags</name>
> >  <value></value>
> >  <description>
> >    To be used in conjunction with features introduced in NUTCH-655, which
> > allows
> >    for custom metatags to be injected alongside your crawl URLs.
> Specifying
> > those
> >    custom tags here will allow for their propagation into a pages
> outlinks,
> > as
> >    well as allow for them to be included as part of an index.
> >    Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad the
> tags
> > with
> >    white-space at their boundaries, if you are using anything earlier
> than
> > Hadoop-0.21.
> >  </description>
> > </property>
> >
> > On Mon, Sep 26, 2011 at 7:07 PM, Wilson, Matt
> > <Ma...@salliemae.com>wrote:
> >
> > > I am attempting to crawl a corporate intranet site and allow it to be
> > > searched in solr.  As part of the requirements I have to be able to
> index
> > > certain metadata tags as their own field in solr (for faceted search).
> >  For
> > > example, the pages being crawled contain the following meta tag:
> > >
> > > <meta id="ctl00_hdrKeywords" name="Keywords" content="Banking, Savings,
> > > Student Loans, CDs, Certificates of Deposit, Smart Option Loan, 529
> > Plans"
> > > />
> > >
> > > I have updated the nutch-site.xml with the following:
> > >
> > > <property>
> > >    <name>plugin.includes</name>
> > >    <value>urlmeta|protocol-httpclient|... </value>
> > > </property>
> > > <property>
> > >    <name>urlmeta.tags</name>
> > >    <value>keywords</value>
> > > </property>
> > >
> > > I have updated the solr schema.xml with the following addition:
> > >
> > > <field name="keywords" type="string" stored="true" indexed="true"
> > > multiValued="true"/>
> > >
> > > I can see that the field has been created in Solr via the admin
> > interface.
> > >  I also see that nutch is loading the urlmeta plugin and adding the
> > > indexfilters etc in the hadroop.log.  The problem is that nutch does
> not
> > > appear to be indexing the keywords field.  All of the pages crawled
> have
> > the
> > > tag present and I am receiving no errors in the nutch log.  I am unsure
> > as
> > > to what I am missing.  This seems to be pretty straightforward;
> however,
> > I
> > > must be misunderstanding either the urlmeta plugin or missing something
> > in
> > > the configuration.
> > >
> >
> >
> >
> > --
> > *Lewis*
> >
> >
> > This E-Mail has been scanned for viruses.
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>

Re: Indexing specific metadata tags with urlmeta

Posted by Julien Nioche <li...@gmail.com>.

Hi Matt,

The plugin urlmeta does NOT extract the metadata from HTML pages. The 'meta'
in its name means 'crawldb metadata'

You need to use the patch in https://issues.apache.org/jira/browse/NUTCH-809

HTH

Julien


On 26 September 2011 21:18, Wilson, Matt <Ma...@salliemae.com>wrote:

> Also,
>
> In case this helps.  I removed the Keywords field from the solr schema to
> see if it would generate an error when the SolrIndexer runs and it does not.
>  This has lead me to believe that nutch is either not indexing the meta
> content or it is not sending the update to solr when SolrIndexer runs.
>
> Matt Wilson
>
> -----Original Message-----
> From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> Sent: Monday, September 26, 2011 3:04 PM
> To: user@nutch.apache.org
> Subject: Re: Indexing specific metadata tags with urlmeta
>
> Hi Matt,
>
> Try changing
>
> <field name="keywords" type="string" stored="true" indexed="true"
> multiValued="true"/>
>
> to
>
> <field name="Keywords" type="string" stored="true" indexed="true"
> multiValued="true"/> as per your metadata tags.
>
> We also have a configuration option in nutch-site.xml which you could check
> out.
>
> <property>
>  <name>urlmeta.tags</name>
>  <value></value>
>  <description>
>    To be used in conjunction with features introduced in NUTCH-655, which
> allows
>    for custom metatags to be injected alongside your crawl URLs. Specifying
> those
>    custom tags here will allow for their propagation into a pages outlinks,
> as
>    well as allow for them to be included as part of an index.
>    Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad the tags
> with
>    white-space at their boundaries, if you are using anything earlier than
> Hadoop-0.21.
>  </description>
> </property>
>
> On Mon, Sep 26, 2011 at 7:07 PM, Wilson, Matt
> <Ma...@salliemae.com>wrote:
>
> > I am attempting to crawl a corporate intranet site and allow it to be
> > searched in solr.  As part of the requirements I have to be able to index
> > certain metadata tags as their own field in solr (for faceted search).
>  For
> > example, the pages being crawled contain the following meta tag:
> >
> > <meta id="ctl00_hdrKeywords" name="Keywords" content="Banking, Savings,
> > Student Loans, CDs, Certificates of Deposit, Smart Option Loan, 529
> Plans"
> > />
> >
> > I have updated the nutch-site.xml with the following:
> >
> > <property>
> >    <name>plugin.includes</name>
> >    <value>urlmeta|protocol-httpclient|... </value>
> > </property>
> > <property>
> >    <name>urlmeta.tags</name>
> >    <value>keywords</value>
> > </property>
> >
> > I have updated the solr schema.xml with the following addition:
> >
> > <field name="keywords" type="string" stored="true" indexed="true"
> > multiValued="true"/>
> >
> > I can see that the field has been created in Solr via the admin
> interface.
> >  I also see that nutch is loading the urlmeta plugin and adding the
> > indexfilters etc in the hadroop.log.  The problem is that nutch does not
> > appear to be indexing the keywords field.  All of the pages crawled have
> the
> > tag present and I am receiving no errors in the nutch log.  I am unsure
> as
> > to what I am missing.  This seems to be pretty straightforward; however,
> I
> > must be misunderstanding either the urlmeta plugin or missing something
> in
> > the configuration.
> >
>
>
>
> --
> *Lewis*
>
>
> This E-Mail has been scanned for viruses.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

RE: Indexing specific metadata tags with urlmeta

Posted by "Wilson, Matt" <Ma...@salliemae.com>.

Also, 

In case this helps.  I removed the Keywords field from the solr schema to see if it would generate an error when the SolrIndexer runs and it does not.  This has lead me to believe that nutch is either not indexing the meta content or it is not sending the update to solr when SolrIndexer runs. 

Matt Wilson

-----Original Message-----
From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com] 
Sent: Monday, September 26, 2011 3:04 PM
To: user@nutch.apache.org
Subject: Re: Indexing specific metadata tags with urlmeta

Hi Matt,

Try changing

<field name="keywords" type="string" stored="true" indexed="true"
multiValued="true"/>

to

<field name="Keywords" type="string" stored="true" indexed="true"
multiValued="true"/> as per your metadata tags.

We also have a configuration option in nutch-site.xml which you could check
out.

<property>
  <name>urlmeta.tags</name>
  <value></value>
  <description>
    To be used in conjunction with features introduced in NUTCH-655, which
allows
    for custom metatags to be injected alongside your crawl URLs. Specifying
those
    custom tags here will allow for their propagation into a pages outlinks,
as
    well as allow for them to be included as part of an index.
    Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad the tags
with
    white-space at their boundaries, if you are using anything earlier than
Hadoop-0.21.
  </description>
</property>

On Mon, Sep 26, 2011 at 7:07 PM, Wilson, Matt
<Ma...@salliemae.com>wrote:

> I am attempting to crawl a corporate intranet site and allow it to be
> searched in solr.  As part of the requirements I have to be able to index
> certain metadata tags as their own field in solr (for faceted search).  For
> example, the pages being crawled contain the following meta tag:
>
> <meta id="ctl00_hdrKeywords" name="Keywords" content="Banking, Savings,
> Student Loans, CDs, Certificates of Deposit, Smart Option Loan, 529 Plans"
> />
>
> I have updated the nutch-site.xml with the following:
>
> <property>
>    <name>plugin.includes</name>
>    <value>urlmeta|protocol-httpclient|... </value>
> </property>
> <property>
>    <name>urlmeta.tags</name>
>    <value>keywords</value>
> </property>
>
> I have updated the solr schema.xml with the following addition:
>
> <field name="keywords" type="string" stored="true" indexed="true"
> multiValued="true"/>
>
> I can see that the field has been created in Solr via the admin interface.
>  I also see that nutch is loading the urlmeta plugin and adding the
> indexfilters etc in the hadroop.log.  The problem is that nutch does not
> appear to be indexing the keywords field.  All of the pages crawled have the
> tag present and I am receiving no errors in the nutch log.  I am unsure as
> to what I am missing.  This seems to be pretty straightforward; however, I
> must be misunderstanding either the urlmeta plugin or missing something in
> the configuration.
>



-- 
*Lewis*


This E-Mail has been scanned for viruses.

Re: Indexing specific metadata tags with urlmeta

Posted by lewis john mcgibbney <le...@gmail.com>.

Hi Matt,

Try changing

<field name="keywords" type="string" stored="true" indexed="true"
multiValued="true"/>

to

<field name="Keywords" type="string" stored="true" indexed="true"
multiValued="true"/> as per your metadata tags.

We also have a configuration option in nutch-site.xml which you could check
out.

<property>
  <name>urlmeta.tags</name>
  <value></value>
  <description>
    To be used in conjunction with features introduced in NUTCH-655, which
allows
    for custom metatags to be injected alongside your crawl URLs. Specifying
those
    custom tags here will allow for their propagation into a pages outlinks,
as
    well as allow for them to be included as part of an index.
    Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad the tags
with
    white-space at their boundaries, if you are using anything earlier than
Hadoop-0.21.
  </description>
</property>

On Mon, Sep 26, 2011 at 7:07 PM, Wilson, Matt
<Ma...@salliemae.com>wrote:

> I am attempting to crawl a corporate intranet site and allow it to be
> searched in solr.  As part of the requirements I have to be able to index
> certain metadata tags as their own field in solr (for faceted search).  For
> example, the pages being crawled contain the following meta tag:
>
> <meta id="ctl00_hdrKeywords" name="Keywords" content="Banking, Savings,
> Student Loans, CDs, Certificates of Deposit, Smart Option Loan, 529 Plans"
> />
>
> I have updated the nutch-site.xml with the following:
>
> <property>
>    <name>plugin.includes</name>
>    <value>urlmeta|protocol-httpclient|... </value>
> </property>
> <property>
>    <name>urlmeta.tags</name>
>    <value>keywords</value>
> </property>
>
> I have updated the solr schema.xml with the following addition:
>
> <field name="keywords" type="string" stored="true" indexed="true"
> multiValued="true"/>
>
> I can see that the field has been created in Solr via the admin interface.
>  I also see that nutch is loading the urlmeta plugin and adding the
> indexfilters etc in the hadroop.log.  The problem is that nutch does not
> appear to be indexing the keywords field.  All of the pages crawled have the
> tag present and I am receiving no errors in the nutch log.  I am unsure as
> to what I am missing.  This seems to be pretty straightforward; however, I
> must be misunderstanding either the urlmeta plugin or missing something in
> the configuration.
>



-- 
*Lewis*