You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Sol Lederman <so...@gmail.com> on 2017/11/06 03:32:00 UTC

Re: Tagging records by seed list

Hi Sebastian,

I tried using the urlmeta plugin but my indexed records don't have the
field I expected.

Here's what I did:

1. I dropped the nutch core in Solr.
2. I recursively removed the files in crawldb, linkdb, and segments
3. I edited seed.txt to have a tab after the url and then source=source1
4, I edited nutch-site.xml and changed index-(basic|anchor) to be
index-(basic|anchor|urlmeta)
5. I set the value of urlmeta.tags to be this: <value>source</value>
6. I went through the tutorial and loaded some data into Solr.
7. I queried that nutch core in the Solr UI. I see records but no "source"
field.

What am I missing?

Thanks.

Sol

On Wed, Oct 25, 2017 at 3:08 PM, Sebastian Nagel <wastl.nagel@googlemail.com
> wrote:

> Hi Sol,
>
> yes, that's the right way to go:
>  1. add metadata to the seed list
>      url \t key=val
>  2. use the urlmeta plugin (links below) to
>   a) pass metadata forward from seeds to linked pages
>   b) and index it
>
> Or did you mean another plugin?
>
> Best,
> Sebastian
>
>
> https://issues.apache.org/jira/browse/NUTCH-655
>
> https://builds.apache.org/job/nutch-trunk/javadoc/org/
> apache/nutch/scoring/urlmeta/package-summary.html
>
> https://builds.apache.org/job/nutch-trunk/javadoc/org/
> apache/nutch/indexer/urlmeta/package-summary.html
>
> https://builds.apache.org/job/nutch-trunk/javadoc/org/
> apache/nutch/indexer/urlmeta/URLMetaIndexingFilter.html
>
> (please open a Jira issue to fix the Javadoc, formatting has been lost.
> Thanks!)
>
> On 10/25/2017 08:03 PM, Sol Lederman wrote:
> > Hi,
> >
> > I've got a requirement to crawl three different sets of seed lists. I'd
> > like to put the crawl results documents into a single Solr index BUT I
> need
> > to tag the records with which seed list they came from. Using facets is
> one
> > way. Having a field that identifies the seed list is another way. I've
> seen
> > a little bit of documentation that mentions using the metadata plugin for
> > this purpose. Is this a good approach for this requirement?
> >
> > Thanks.
> >
> > Sol
> >
>
>

Re: Tagging records by seed list

Posted by Sol Lederman <so...@gmail.com>.

Thanks, Sebastian.

Ah, I got the wrong information from an old API Javadoc page.

I fixed the plugin name, updated the nutch and Solr schemas to have my new
field, dropped and readded the nutch core, recrawled, reindexed, and the
new field is there with the correct value when I query!

Thanks again for your help!

Sol



On Mon, Nov 6, 2017 at 1:45 AM, Sebastian Nagel <wa...@googlemail.com>
wrote:

> Hi Sol,
>
> > 4, I edited nutch-site.xml and changed index-(basic|anchor) to be
> > index-(basic|anchor|urlmeta)
>
> The name of the plugin is "urlmeta" (not "index-urlmeta").
> It implements to plugin extension point: indexing filter and
> scoring filter which makes sure the metadata is transfered to
> the linked pages.
>
> Sebastian
>
> On 11/06/2017 04:32 AM, Sol Lederman wrote:
> > Hi Sebastian,
> >
> > I tried using the urlmeta plugin but my indexed records don't have the
> > field I expected.
> >
> > Here's what I did:
> >
> > 1. I dropped the nutch core in Solr.
> > 2. I recursively removed the files in crawldb, linkdb, and segments
> > 3. I edited seed.txt to have a tab after the url and then source=source1
> > 4, I edited nutch-site.xml and changed index-(basic|anchor) to be
> > index-(basic|anchor|urlmeta)
> > 5. I set the value of urlmeta.tags to be this: <value>source</value>
> > 6. I went through the tutorial and loaded some data into Solr.
> > 7. I queried that nutch core in the Solr UI. I see records but no
> "source"
> > field.
> >
> > What am I missing?
> >
> > Thanks.
> >
> > Sol
> >
> >
> > On Wed, Oct 25, 2017 at 3:08 PM, Sebastian Nagel <
> wastl.nagel@googlemail.com
> >> wrote:
> >
> >> Hi Sol,
> >>
> >> yes, that's the right way to go:
> >>  1. add metadata to the seed list
> >>      url \t key=val
> >>  2. use the urlmeta plugin (links below) to
> >>   a) pass metadata forward from seeds to linked pages
> >>   b) and index it
> >>
> >> Or did you mean another plugin?
> >>
> >> Best,
> >> Sebastian
> >>
> >>
> >> https://issues.apache.org/jira/browse/NUTCH-655
> >>
> >> https://builds.apache.org/job/nutch-trunk/javadoc/org/
> >> apache/nutch/scoring/urlmeta/package-summary.html
> >>
> >> https://builds.apache.org/job/nutch-trunk/javadoc/org/
> >> apache/nutch/indexer/urlmeta/package-summary.html
> >>
> >> https://builds.apache.org/job/nutch-trunk/javadoc/org/
> >> apache/nutch/indexer/urlmeta/URLMetaIndexingFilter.html
> >>
> >> (please open a Jira issue to fix the Javadoc, formatting has been lost.
> >> Thanks!)
> >>
> >> On 10/25/2017 08:03 PM, Sol Lederman wrote:
> >>> Hi,
> >>>
> >>> I've got a requirement to crawl three different sets of seed lists. I'd
> >>> like to put the crawl results documents into a single Solr index BUT I
> >> need
> >>> to tag the records with which seed list they came from. Using facets is
> >> one
> >>> way. Having a field that identifies the seed list is another way. I've
> >> seen
> >>> a little bit of documentation that mentions using the metadata plugin
> for
> >>> this purpose. Is this a good approach for this requirement?
> >>>
> >>> Thanks.
> >>>
> >>> Sol
> >>>
> >>
> >>
> >
>
>

Re: Tagging records by seed list

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Sol,

> 4, I edited nutch-site.xml and changed index-(basic|anchor) to be
> index-(basic|anchor|urlmeta)

The name of the plugin is "urlmeta" (not "index-urlmeta").
It implements to plugin extension point: indexing filter and
scoring filter which makes sure the metadata is transfered to
the linked pages.

Sebastian

On 11/06/2017 04:32 AM, Sol Lederman wrote:
> Hi Sebastian,
> 
> I tried using the urlmeta plugin but my indexed records don't have the
> field I expected.
> 
> Here's what I did:
> 
> 1. I dropped the nutch core in Solr.
> 2. I recursively removed the files in crawldb, linkdb, and segments
> 3. I edited seed.txt to have a tab after the url and then source=source1
> 4, I edited nutch-site.xml and changed index-(basic|anchor) to be
> index-(basic|anchor|urlmeta)
> 5. I set the value of urlmeta.tags to be this: <value>source</value>
> 6. I went through the tutorial and loaded some data into Solr.
> 7. I queried that nutch core in the Solr UI. I see records but no "source"
> field.
> 
> What am I missing?
> 
> Thanks.
> 
> Sol
> 
> 
> On Wed, Oct 25, 2017 at 3:08 PM, Sebastian Nagel <wastl.nagel@googlemail.com
>> wrote:
> 
>> Hi Sol,
>>
>> yes, that's the right way to go:
>>  1. add metadata to the seed list
>>      url \t key=val
>>  2. use the urlmeta plugin (links below) to
>>   a) pass metadata forward from seeds to linked pages
>>   b) and index it
>>
>> Or did you mean another plugin?
>>
>> Best,
>> Sebastian
>>
>>
>> https://issues.apache.org/jira/browse/NUTCH-655
>>
>> https://builds.apache.org/job/nutch-trunk/javadoc/org/
>> apache/nutch/scoring/urlmeta/package-summary.html
>>
>> https://builds.apache.org/job/nutch-trunk/javadoc/org/
>> apache/nutch/indexer/urlmeta/package-summary.html
>>
>> https://builds.apache.org/job/nutch-trunk/javadoc/org/
>> apache/nutch/indexer/urlmeta/URLMetaIndexingFilter.html
>>
>> (please open a Jira issue to fix the Javadoc, formatting has been lost.
>> Thanks!)
>>
>> On 10/25/2017 08:03 PM, Sol Lederman wrote:
>>> Hi,
>>>
>>> I've got a requirement to crawl three different sets of seed lists. I'd
>>> like to put the crawl results documents into a single Solr index BUT I
>> need
>>> to tag the records with which seed list they came from. Using facets is
>> one
>>> way. Having a field that identifies the seed list is another way. I've
>> seen
>>> a little bit of documentation that mentions using the metadata plugin for
>>> this purpose. Is this a good approach for this requirement?
>>>
>>> Thanks.
>>>
>>> Sol
>>>
>>
>>
>