You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "John R. Brinkema" <br...@teo.uscourts.gov> on 2011/08/24 22:36:26 UTC

Trying to understand and use URLmeta

Hi all,

I am trying use URLmeta to inject meta data into documents that I crawl 
and I am having some problems.

First the context:  Nutch 1.3 with Solr 3.2

My seed url files looks like:  
http://mySite.com/Guide/index.html\trecommended="Guide"\tkeywords="Guide,Policy,JBmarker"

I put JBmarker there so I could see where the metadata got put.

Index.html itself is a table of contents of a guide; that is, it is 
mostly a list of outlinks to parts of the overall guide.

My nutch-site.xml includes the following properties:

<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|urlmeta)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
<property>
<name>urlmeta.tags</name>
<value>recommended,keywords</value>
</property>

I fire up nutch to crawl and all goes well.   To see what nutch did, I 
ran 'readseg -dump' and looked at the results.  What I found was the 
following:

... other Recno's above ...

Recno:: 56
URL:: http:/mySite.com/Guide/index.html

CrawlDatum::
Version: 7
Status: 65 (signature)
Fetch time: Tue Aug 23 10:08:18 EDT 2011
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 0 seconds (0 days)
Score: 1.0
Signature: 5c182af41027766eccf1ea60d112772c
Metadata:

CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Aug 23 10:08:04 EDT 2011
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: recommended: Guide_ngt_: 1314108489210keywords: 
"Guide,Policy,JBmarker"

Content::
Version: -1
url: http://mySite.com/Guide/index.html
base: http://mySite.com/Guide/index.html
... lots more content ...

CrawlDatum::
Version: 7
Status: 33 (fetch_success)
Fetch time: Tue Aug 23 10:08:15 EDT 2011
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: recommended: Guide_ngt_: 1314108489210keywords: 
"Guide,Policy,JBmarker"_pst_: success(1), lastModified=0

ParseData::
Version: 5
Status: success(1,0)
Title: Guide
Outlinks: 60
   outlink: toUrl: http://mySite.com/Home/About.html anchor: About Me
   outlink: toUrl: http://mySite.com/Guide/Contact_The_Guide.html 
anchor: Contact Me
... many more outlinks ...
Content Metadata: nutch.content.digest=5c182af41027766eccf1ea60d112772c 
Accept-ranges=bytes Date=Tue, 23 Aug 2011 16:28:43 GMT 
Content-Length=28798 Last-Modified=Wed, 06 Apr 2011 00:15:10 GMT 
nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20110823100811 
Content-Type=text/html Connection=close Server=Netscape-Enterprise/6.0
Parse Metadata: CharEncodingForConversion=windows-1252 
OriginalCharEncoding=windows-1252

ParseText::
... lots of parsed text ...

Recno::  57

... and so forth.

JBmarker does not appear anywhere else, in this segment or any of the 
others.

When I do a solrindex, JBmarker does not appear to be anywhere.  ??

*What I expected*

As I understand ULRmeta (as defined by the two nutch patches), the meta 
data that is included with the url  is injected into the seed url; that 
is to say, it is as if the lines:

<META NAME="recommended" CONTENT="Guide">
<META NAME="keywords" CONTENT="Guide,Policy,JBmarker">

were in the seed url content.  Furthermore,  it is as if those two lines 
were in all the outlink content of the seed url.  So, I expected that 
when I looked at all the CrawlDatum and ParseData of the outlinks from 
the seed url, I would see the same meta data as in the seed CrawlDatum 
and ParseData.  Which is clearly not the case.

As for solrindex, I assume that I have some work to do to get any 
special metadata actions moved over to solr; a special plugin of some 
sort.  That is, urlmeta does not help get the collected metadata from 
Nutch to Solr.

So what is happening?  Where did I go astray?  Am I analyzing the Nutch 
dumps incorrectly?

One other side note:  I assume that Luke no longer will help me debug 
Nutch since it works with Lucene indexes and Nutch no longer create such 
beasts.  Are there any tools that help with viewing Nutch databases?  It 
seems that Nutch takes some liberties with the data it is dumping (e.g., 
the meta tags all concatenated together without delimiters; I assume 
that internally, the meta tags are separated somehow).

Thanks, as always.

Re: Trying to understand and use URLmeta

Posted by Markus Jelsma <ma...@openindex.io>.

Hi,

It's a fine tool indeed. In my experience (not opinion) 1.4-dev is as stable 
as 1.3 with quality bug fixes and improvements. We use it (always the latest 
revision) in production and i would certainly vote +1 if we were to release 
the current 1.4-dev as a new stable release.

If you're hestitant, which is a good quality, you can always use the tool in a 
local/development enviroment. There are no invasive changes in how plugins 
work so using the tool in a dev enviroment would help you on your way.

Cheers

> Markus,
> 
> Yes, I drooled over indexchecker enough that I briefly considered trying
> the development release, but I (for now) need to focus on a production
> quality product.
> 
> In the meantime, LOG.info's scattered about the code will suffice for my
> needs.
> 
> On 8/29/2011 7:00 PM, Markus Jelsma wrote:
> > In the current Nutch 1.4-dev you can check the output of the indexer by
> > using the indexchecker command. It'll take an url and displays the
> > values of the fields it's going to add.
> > 
> >> Lewis,
> >> 
> >> After shaking off the annoyance of your "RTFM Luke" answer (I had read
> >> the tutorial several times), I listened to your suggestion (I do respect
> >> my elders ... especially my 'application-elders') and I spent the
> >> weekend reading code,  scanning the javadocs files and adding logging
> >> statements.  Considering how poorly Nutch is documented (sparse, and
> >> what is documented probably refers to an old version), it was
> >> challenging, but worthwhile.  What I found:
> >> 
> >> That I deserve a kick in the head since I was only looking in the Nutch
> >> databases for the results of urlmeta., The Nutch databases, of course,
> >> no longer contain indexing information; the name URLMetaIndexingFilter
> >> (indexing !!!) should have told me.
> >> 
> >> That still did not help when I looked in the Solr Index.  After a lot of
> >> analysis and some logging statements later, I discovered that 'urlmeta'
> >> was not being loaded.  The plugin.includes statement in the tutorial is
> >> incorrect.  It is (fragment)
> >> 
> >> ...|index-(basic|anchor|'''urlmeta''')| ...
> >> 
> >> 
> >> and should be
> >> 
> >> ...|index-(basic|anchor)| urlmeta | ...
> >> 
> >> The name of the plugin is 'urlmeta' not index-urlmeta.
> >> 
> >> Once I got urlmeta loaded, the indexing almost ran correctly.  I got a
> >> Solr error complaining that a field was undefined ... the metadata
> >> fields that I was injecting.  I solved that problem by added the two
> >> fields I was injecting to the Solr schema.xml.  With that, the indexing
> >> completed with no errors.
> >> 
> >> I now (I think) understand how urlmeta works.  I do have two questions,
> >> however.
> >> 
> >> 1) Now that Solr is the official indexer for Nutch, are we still
> >> supposed to copy the Nutch schema over to Solr?  The Solr schema has
> >> gotten very complicated recently and I am concerned about losing some
> >> Solr functionality.
> >> 
> >> 2) What is the roll of solrindex-mapping.xml ? I only added my field
> >> names to the Solr schema.xml; I made no changes to the Nutch schema.xml
> >> nor made any changes to solrindex-mapping.xml.
> >> 
> >> All, in all, an interesting and educational weekend.
> >> 
> >> /jb
> >> 
> >> On 8/25/2011 5:11 AM, lewis john mcgibbney wrote:
> >>> Hi JB,
> >>> 
> >>> We have recently finished a complete plugin tutorial which fully
> >>> explains the functionality of the urlmeta plugin on the wiki. It can
> >>> be found here [1], could I ask you to have a thorough look at it, and
> >>> the code and if you still have questions then please reinforce them.
> >>> 
> >>> [1] http://wiki.apache.org/nutch/WritingPluginExample
> >>> 
> >>> Thank you
> >>> 
> >>> On Wed, Aug 24, 2011 at 9:36 PM, John R.
> >>> Brinkema<brinkema@teo.uscourts.gov
> >>> 
> >>>> wrote:
> >>>> Hi all,
> >>>> 
> >>>> I am trying use URLmeta to inject meta data into documents that I
> >>>> crawl and I am having some problems.
> >>>> 
> >>>> First the context:  Nutch 1.3 with Solr 3.2
> >>>> 
> >>>> My seed url files looks like:  http://mySite.com/Guide/index.**
> >>>> html\trecommended=<http://mySite.com/Guide/index.html%5Ctrecommended=>
> >>>> "Guide"\**tkeywords="Guide,Policy,**JBmarker"
> >>>> 
> >>>> I put JBmarker there so I could see where the metadata got put.
> >>>> 
> >>>> Index.html itself is a table of contents of a guide; that is, it is
> >>>> mostly a list of outlinks to parts of the overall guide.
> >>>> 
> >>>> My nutch-site.xml includes the following properties:
> >>>> 
> >>>> <property>
> >>>> <name>plugin.includes</name>
> >>>> <value>protocol-http|**urlfilter-regex|parse-(html|**
> >>>> tika)|index-(basic|anchor|**urlmeta)|scoring-opic|**
> >>>> urlnormalizer-(pass|regex|**basic)</value>
> >>>> </property>
> >>>> <property>
> >>>> <name>urlmeta.tags</name>
> >>>> <value>recommended,keywords</**value>
> >>>> </property>
> >>>> 
> >>>> I fire up nutch to crawl and all goes well.   To see what nutch did, I
> >>>> ran 'readseg -dump' and looked at the results.  What I found was the
> >>>> following:
> >>>> 
> >>>> ... other Recno's above ...
> >>>> 
> >>>> Recno:: 56
> >>>> URL:: http:/mySite.com/Guide/index.**html
> >>>> 
> >>>> CrawlDatum::
> >>>> Version: 7
> >>>> Status: 65 (signature)
> >>>> Fetch time: Tue Aug 23 10:08:18 EDT 2011
> >>>> Modified time: Wed Dec 31 19:00:00 EST 1969
> >>>> Retries since fetch: 0
> >>>> Retry interval: 0 seconds (0 days)
> >>>> Score: 1.0
> >>>> Signature: 5c182af41027766eccf1ea60d11277**2c
> >>>> Metadata:
> >>>> 
> >>>> CrawlDatum::
> >>>> Version: 7
> >>>> Status: 1 (db_unfetched)
> >>>> Fetch time: Tue Aug 23 10:08:04 EDT 2011
> >>>> Modified time: Wed Dec 31 19:00:00 EST 1969
> >>>> Retries since fetch: 0
> >>>> Retry interval: 2592000 seconds (30 days)
> >>>> Score: 1.0
> >>>> Signature: null
> >>>> Metadata: recommended: Guide_ngt_: 1314108489210keywords:
> >>>> "Guide,Policy,JBmarker"
> >>>> 
> >>>> Content::
> >>>> Version: -1
> >>>> url:
> >>>> http://mySite.com/Guide/index.**html<http://mySite.com/Guide/index.htm
> >>>> l
> >>>> 
> >>>>> base:
> >>>> http://mySite.com/Guide/index.**html<http://mySite.com/Guide/index.htm
> >>>> l
> >>>> 
> >>>>> ... lots more content ...
> >>>> 
> >>>> CrawlDatum::
> >>>> Version: 7
> >>>> Status: 33 (fetch_success)
> >>>> Fetch time: Tue Aug 23 10:08:15 EDT 2011
> >>>> Modified time: Wed Dec 31 19:00:00 EST 1969
> >>>> Retries since fetch: 0
> >>>> Retry interval: 2592000 seconds (30 days)
> >>>> Score: 1.0
> >>>> Signature: null
> >>>> Metadata: recommended: Guide_ngt_: 1314108489210keywords:
> >>>> "Guide,Policy,JBmarker"_pst_: success(1), lastModified=0
> >>>> 
> >>>> ParseData::
> >>>> Version: 5
> >>>> Status: success(1,0)
> >>>> Title: Guide
> >>>> Outlinks: 60
> >>>> 
> >>>>    outlink: toUrl:
> >>>>    http://mySite.com/Home/About.**html<http://mySite.com/Home/About.ht
> >>>>    ml
> >>>>    
> >>>>    >anchor: About Me outlink: toUrl:
> >>>>    http://mySite.com/Guide/**Contact_The_Guide.html<http://mySite.com/
> >>>>    Gu ide/Contact_The_Guide.html>anchor: Contact Me
> >>>> 
> >>>> ... many more outlinks ...
> >>>> Content Metadata:
> >>>> nutch.content.digest=**5c182af41027766eccf1ea60d11277**2c
> >>>> Accept-ranges=bytes Date=Tue, 23 Aug 2011 16:28:43 GMT
> >>>> Content-Length=28798 Last-Modified=Wed, 06 Apr 2011 00:15:10 GMT
> >>>> nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=**20110823100811
> >>>> Content-Type=text/html
> >>>> Connection=close Server=Netscape-Enterprise/6.0
> >>>> Parse Metadata: CharEncodingForConversion=**windows-1252
> >>>> OriginalCharEncoding=windows-**1252
> >>>> 
> >>>> ParseText::
> >>>> ... lots of parsed text ...
> >>>> 
> >>>> Recno::  57
> >>>> 
> >>>> ... and so forth.
> >>>> 
> >>>> JBmarker does not appear anywhere else, in this segment or any of the
> >>>> others.
> >>>> 
> >>>> When I do a solrindex, JBmarker does not appear to be anywhere.  ??
> >>>> 
> >>>> *What I expected*
> >>>> 
> >>>> As I understand ULRmeta (as defined by the two nutch patches), the
> >>>> meta data that is included with the url  is injected into the seed
> >>>> url; that is to say, it is as if the lines:
> >>>> 
> >>>> <META NAME="recommended" CONTENT="Guide">
> >>>> <META NAME="keywords" CONTENT="Guide,Policy,**JBmarker">
> >>>> 
> >>>> were in the seed url content.  Furthermore,  it is as if those two
> >>>> lines were in all the outlink content of the seed url.  So, I
> >>>> expected that when I looked at all the CrawlDatum and ParseData of
> >>>> the outlinks from the seed url, I would see the same meta data as in
> >>>> the seed CrawlDatum and ParseData.
> >>>> 
> >>>>    Which is clearly not the case.
> >>>> 
> >>>> As for solrindex, I assume that I have some work to do to get any
> >>>> special metadata actions moved over to solr; a special plugin of some
> >>>> sort.  That is, urlmeta does not help get the collected metadata from
> >>>> Nutch to Solr.
> >>>> 
> >>>> So what is happening?  Where did I go astray?  Am I analyzing the
> >>>> Nutch dumps incorrectly?
> >>>> 
> >>>> One other side note:  I assume that Luke no longer will help me debug
> >>>> Nutch since it works with Lucene indexes and Nutch no longer create
> >>>> such beasts.
> >>>> 
> >>>>    Are there any tools that help with viewing Nutch databases?  It
> >>>>    seems that
> >>>> 
> >>>> Nutch takes some liberties with the data it is dumping (e.g., the meta
> >>>> tags all concatenated together without delimiters; I assume that
> >>>> internally, the meta tags are separated somehow).
> >>>> 
> >>>> Thanks, as always.

Re: Trying to understand and use URLmeta

Posted by "John R. Brinkema" <br...@teo.uscourts.gov>.

Markus,

Yes, I drooled over indexchecker enough that I briefly considered trying 
the development release, but I (for now) need to focus on a production 
quality product.

In the meantime, LOG.info's scattered about the code will suffice for my 
needs.

On 8/29/2011 7:00 PM, Markus Jelsma wrote:
> In the current Nutch 1.4-dev you can check the output of the indexer by using
> the indexchecker command. It'll take an url and displays the values of the
> fields it's going to add.
>
>> Lewis,
>>
>> After shaking off the annoyance of your "RTFM Luke" answer (I had read
>> the tutorial several times), I listened to your suggestion (I do respect
>> my elders ... especially my 'application-elders') and I spent the
>> weekend reading code,  scanning the javadocs files and adding logging
>> statements.  Considering how poorly Nutch is documented (sparse, and
>> what is documented probably refers to an old version), it was
>> challenging, but worthwhile.  What I found:
>>
>> That I deserve a kick in the head since I was only looking in the Nutch
>> databases for the results of urlmeta., The Nutch databases, of course,
>> no longer contain indexing information; the name URLMetaIndexingFilter
>> (indexing !!!) should have told me.
>>
>> That still did not help when I looked in the Solr Index.  After a lot of
>> analysis and some logging statements later, I discovered that 'urlmeta'
>> was not being loaded.  The plugin.includes statement in the tutorial is
>> incorrect.  It is (fragment)
>>
>> ...|index-(basic|anchor|'''urlmeta''')| ...
>>
>>
>> and should be
>>
>> ...|index-(basic|anchor)| urlmeta | ...
>>
>> The name of the plugin is 'urlmeta' not index-urlmeta.
>>
>> Once I got urlmeta loaded, the indexing almost ran correctly.  I got a
>> Solr error complaining that a field was undefined ... the metadata
>> fields that I was injecting.  I solved that problem by added the two
>> fields I was injecting to the Solr schema.xml.  With that, the indexing
>> completed with no errors.
>>
>> I now (I think) understand how urlmeta works.  I do have two questions,
>> however.
>>
>> 1) Now that Solr is the official indexer for Nutch, are we still
>> supposed to copy the Nutch schema over to Solr?  The Solr schema has
>> gotten very complicated recently and I am concerned about losing some
>> Solr functionality.
>>
>> 2) What is the roll of solrindex-mapping.xml ? I only added my field
>> names to the Solr schema.xml; I made no changes to the Nutch schema.xml
>> nor made any changes to solrindex-mapping.xml.
>>
>> All, in all, an interesting and educational weekend.
>>
>> /jb
>>
>> On 8/25/2011 5:11 AM, lewis john mcgibbney wrote:
>>> Hi JB,
>>>
>>> We have recently finished a complete plugin tutorial which fully explains
>>> the functionality of the urlmeta plugin on the wiki. It can be found here
>>> [1], could I ask you to have a thorough look at it, and the code and if
>>> you still have questions then please reinforce them.
>>>
>>> [1] http://wiki.apache.org/nutch/WritingPluginExample
>>>
>>> Thank you
>>>
>>> On Wed, Aug 24, 2011 at 9:36 PM, John R.
>>> Brinkema<brinkema@teo.uscourts.gov
>>>
>>>> wrote:
>>>> Hi all,
>>>>
>>>> I am trying use URLmeta to inject meta data into documents that I crawl
>>>> and I am having some problems.
>>>>
>>>> First the context:  Nutch 1.3 with Solr 3.2
>>>>
>>>> My seed url files looks like:  http://mySite.com/Guide/index.**
>>>> html\trecommended=<http://mySite.com/Guide/index.html%5Ctrecommended=>
>>>> "Guide"\**tkeywords="Guide,Policy,**JBmarker"
>>>>
>>>> I put JBmarker there so I could see where the metadata got put.
>>>>
>>>> Index.html itself is a table of contents of a guide; that is, it is
>>>> mostly a list of outlinks to parts of the overall guide.
>>>>
>>>> My nutch-site.xml includes the following properties:
>>>>
>>>> <property>
>>>> <name>plugin.includes</name>
>>>> <value>protocol-http|**urlfilter-regex|parse-(html|**
>>>> tika)|index-(basic|anchor|**urlmeta)|scoring-opic|**
>>>> urlnormalizer-(pass|regex|**basic)</value>
>>>> </property>
>>>> <property>
>>>> <name>urlmeta.tags</name>
>>>> <value>recommended,keywords</**value>
>>>> </property>
>>>>
>>>> I fire up nutch to crawl and all goes well.   To see what nutch did, I
>>>> ran 'readseg -dump' and looked at the results.  What I found was the
>>>> following:
>>>>
>>>> ... other Recno's above ...
>>>>
>>>> Recno:: 56
>>>> URL:: http:/mySite.com/Guide/index.**html
>>>>
>>>> CrawlDatum::
>>>> Version: 7
>>>> Status: 65 (signature)
>>>> Fetch time: Tue Aug 23 10:08:18 EDT 2011
>>>> Modified time: Wed Dec 31 19:00:00 EST 1969
>>>> Retries since fetch: 0
>>>> Retry interval: 0 seconds (0 days)
>>>> Score: 1.0
>>>> Signature: 5c182af41027766eccf1ea60d11277**2c
>>>> Metadata:
>>>>
>>>> CrawlDatum::
>>>> Version: 7
>>>> Status: 1 (db_unfetched)
>>>> Fetch time: Tue Aug 23 10:08:04 EDT 2011
>>>> Modified time: Wed Dec 31 19:00:00 EST 1969
>>>> Retries since fetch: 0
>>>> Retry interval: 2592000 seconds (30 days)
>>>> Score: 1.0
>>>> Signature: null
>>>> Metadata: recommended: Guide_ngt_: 1314108489210keywords:
>>>> "Guide,Policy,JBmarker"
>>>>
>>>> Content::
>>>> Version: -1
>>>> url:
>>>> http://mySite.com/Guide/index.**html<http://mySite.com/Guide/index.html
>>>>> base:
>>>> http://mySite.com/Guide/index.**html<http://mySite.com/Guide/index.html
>>>>> ... lots more content ...
>>>> CrawlDatum::
>>>> Version: 7
>>>> Status: 33 (fetch_success)
>>>> Fetch time: Tue Aug 23 10:08:15 EDT 2011
>>>> Modified time: Wed Dec 31 19:00:00 EST 1969
>>>> Retries since fetch: 0
>>>> Retry interval: 2592000 seconds (30 days)
>>>> Score: 1.0
>>>> Signature: null
>>>> Metadata: recommended: Guide_ngt_: 1314108489210keywords:
>>>> "Guide,Policy,JBmarker"_pst_: success(1), lastModified=0
>>>>
>>>> ParseData::
>>>> Version: 5
>>>> Status: success(1,0)
>>>> Title: Guide
>>>> Outlinks: 60
>>>>
>>>>    outlink: toUrl:
>>>>    http://mySite.com/Home/About.**html<http://mySite.com/Home/About.html
>>>>    >anchor: About Me outlink: toUrl:
>>>>    http://mySite.com/Guide/**Contact_The_Guide.html<http://mySite.com/Gu
>>>>    ide/Contact_The_Guide.html>anchor: Contact Me
>>>>
>>>> ... many more outlinks ...
>>>> Content Metadata:
>>>> nutch.content.digest=**5c182af41027766eccf1ea60d11277**2c
>>>> Accept-ranges=bytes Date=Tue, 23 Aug 2011 16:28:43 GMT
>>>> Content-Length=28798 Last-Modified=Wed, 06 Apr 2011 00:15:10 GMT
>>>> nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=**20110823100811
>>>> Content-Type=text/html
>>>> Connection=close Server=Netscape-Enterprise/6.0
>>>> Parse Metadata: CharEncodingForConversion=**windows-1252
>>>> OriginalCharEncoding=windows-**1252
>>>>
>>>> ParseText::
>>>> ... lots of parsed text ...
>>>>
>>>> Recno::  57
>>>>
>>>> ... and so forth.
>>>>
>>>> JBmarker does not appear anywhere else, in this segment or any of the
>>>> others.
>>>>
>>>> When I do a solrindex, JBmarker does not appear to be anywhere.  ??
>>>>
>>>> *What I expected*
>>>>
>>>> As I understand ULRmeta (as defined by the two nutch patches), the meta
>>>> data that is included with the url  is injected into the seed url; that
>>>> is to say, it is as if the lines:
>>>>
>>>> <META NAME="recommended" CONTENT="Guide">
>>>> <META NAME="keywords" CONTENT="Guide,Policy,**JBmarker">
>>>>
>>>> were in the seed url content.  Furthermore,  it is as if those two lines
>>>> were in all the outlink content of the seed url.  So, I expected that
>>>> when I looked at all the CrawlDatum and ParseData of the outlinks from
>>>> the seed url, I would see the same meta data as in the seed CrawlDatum
>>>> and ParseData.
>>>>
>>>>    Which is clearly not the case.
>>>>
>>>> As for solrindex, I assume that I have some work to do to get any
>>>> special metadata actions moved over to solr; a special plugin of some
>>>> sort.  That is, urlmeta does not help get the collected metadata from
>>>> Nutch to Solr.
>>>>
>>>> So what is happening?  Where did I go astray?  Am I analyzing the Nutch
>>>> dumps incorrectly?
>>>>
>>>> One other side note:  I assume that Luke no longer will help me debug
>>>> Nutch since it works with Lucene indexes and Nutch no longer create
>>>> such beasts.
>>>>
>>>>    Are there any tools that help with viewing Nutch databases?  It seems
>>>>    that
>>>>
>>>> Nutch takes some liberties with the data it is dumping (e.g., the meta
>>>> tags all concatenated together without delimiters; I assume that
>>>> internally, the meta tags are separated somehow).
>>>>
>>>> Thanks, as always.
>

Re: Trying to understand and use URLmeta

Posted by Markus Jelsma <ma...@openindex.io>.

In the current Nutch 1.4-dev you can check the output of the indexer by using 
the indexchecker command. It'll take an url and displays the values of the 
fields it's going to add.
 
> Lewis,
> 
> After shaking off the annoyance of your "RTFM Luke" answer (I had read
> the tutorial several times), I listened to your suggestion (I do respect
> my elders ... especially my 'application-elders') and I spent the
> weekend reading code,  scanning the javadocs files and adding logging
> statements.  Considering how poorly Nutch is documented (sparse, and
> what is documented probably refers to an old version), it was
> challenging, but worthwhile.  What I found:
> 
> That I deserve a kick in the head since I was only looking in the Nutch
> databases for the results of urlmeta., The Nutch databases, of course,
> no longer contain indexing information; the name URLMetaIndexingFilter
> (indexing !!!) should have told me.
> 
> That still did not help when I looked in the Solr Index.  After a lot of
> analysis and some logging statements later, I discovered that 'urlmeta'
> was not being loaded.  The plugin.includes statement in the tutorial is
> incorrect.  It is (fragment)
> 
> ...|index-(basic|anchor|'''urlmeta''')| ...
> 
> 
> and should be
> 
> ...|index-(basic|anchor)| urlmeta | ...
> 
> The name of the plugin is 'urlmeta' not index-urlmeta.
> 
> Once I got urlmeta loaded, the indexing almost ran correctly.  I got a
> Solr error complaining that a field was undefined ... the metadata
> fields that I was injecting.  I solved that problem by added the two
> fields I was injecting to the Solr schema.xml.  With that, the indexing
> completed with no errors.
> 
> I now (I think) understand how urlmeta works.  I do have two questions,
> however.
> 
> 1) Now that Solr is the official indexer for Nutch, are we still
> supposed to copy the Nutch schema over to Solr?  The Solr schema has
> gotten very complicated recently and I am concerned about losing some
> Solr functionality.
> 
> 2) What is the roll of solrindex-mapping.xml ? I only added my field
> names to the Solr schema.xml; I made no changes to the Nutch schema.xml
> nor made any changes to solrindex-mapping.xml.
> 
> All, in all, an interesting and educational weekend.
> 
> /jb
> 
> On 8/25/2011 5:11 AM, lewis john mcgibbney wrote:
> > Hi JB,
> > 
> > We have recently finished a complete plugin tutorial which fully explains
> > the functionality of the urlmeta plugin on the wiki. It can be found here
> > [1], could I ask you to have a thorough look at it, and the code and if
> > you still have questions then please reinforce them.
> > 
> > [1] http://wiki.apache.org/nutch/WritingPluginExample
> > 
> > Thank you
> > 
> > On Wed, Aug 24, 2011 at 9:36 PM, John R.
> > Brinkema<brinkema@teo.uscourts.gov
> > 
> >> wrote:
> >> Hi all,
> >> 
> >> I am trying use URLmeta to inject meta data into documents that I crawl
> >> and I am having some problems.
> >> 
> >> First the context:  Nutch 1.3 with Solr 3.2
> >> 
> >> My seed url files looks like:  http://mySite.com/Guide/index.**
> >> html\trecommended=<http://mySite.com/Guide/index.html%5Ctrecommended=>
> >> "Guide"\**tkeywords="Guide,Policy,**JBmarker"
> >> 
> >> I put JBmarker there so I could see where the metadata got put.
> >> 
> >> Index.html itself is a table of contents of a guide; that is, it is
> >> mostly a list of outlinks to parts of the overall guide.
> >> 
> >> My nutch-site.xml includes the following properties:
> >> 
> >> <property>
> >> <name>plugin.includes</name>
> >> <value>protocol-http|**urlfilter-regex|parse-(html|**
> >> tika)|index-(basic|anchor|**urlmeta)|scoring-opic|**
> >> urlnormalizer-(pass|regex|**basic)</value>
> >> </property>
> >> <property>
> >> <name>urlmeta.tags</name>
> >> <value>recommended,keywords</**value>
> >> </property>
> >> 
> >> I fire up nutch to crawl and all goes well.   To see what nutch did, I
> >> ran 'readseg -dump' and looked at the results.  What I found was the
> >> following:
> >> 
> >> ... other Recno's above ...
> >> 
> >> Recno:: 56
> >> URL:: http:/mySite.com/Guide/index.**html
> >> 
> >> CrawlDatum::
> >> Version: 7
> >> Status: 65 (signature)
> >> Fetch time: Tue Aug 23 10:08:18 EDT 2011
> >> Modified time: Wed Dec 31 19:00:00 EST 1969
> >> Retries since fetch: 0
> >> Retry interval: 0 seconds (0 days)
> >> Score: 1.0
> >> Signature: 5c182af41027766eccf1ea60d11277**2c
> >> Metadata:
> >> 
> >> CrawlDatum::
> >> Version: 7
> >> Status: 1 (db_unfetched)
> >> Fetch time: Tue Aug 23 10:08:04 EDT 2011
> >> Modified time: Wed Dec 31 19:00:00 EST 1969
> >> Retries since fetch: 0
> >> Retry interval: 2592000 seconds (30 days)
> >> Score: 1.0
> >> Signature: null
> >> Metadata: recommended: Guide_ngt_: 1314108489210keywords:
> >> "Guide,Policy,JBmarker"
> >> 
> >> Content::
> >> Version: -1
> >> url:
> >> http://mySite.com/Guide/index.**html<http://mySite.com/Guide/index.html
> >> > base:
> >> http://mySite.com/Guide/index.**html<http://mySite.com/Guide/index.html
> >> > ... lots more content ...
> >> 
> >> CrawlDatum::
> >> Version: 7
> >> Status: 33 (fetch_success)
> >> Fetch time: Tue Aug 23 10:08:15 EDT 2011
> >> Modified time: Wed Dec 31 19:00:00 EST 1969
> >> Retries since fetch: 0
> >> Retry interval: 2592000 seconds (30 days)
> >> Score: 1.0
> >> Signature: null
> >> Metadata: recommended: Guide_ngt_: 1314108489210keywords:
> >> "Guide,Policy,JBmarker"_pst_: success(1), lastModified=0
> >> 
> >> ParseData::
> >> Version: 5
> >> Status: success(1,0)
> >> Title: Guide
> >> Outlinks: 60
> >> 
> >>   outlink: toUrl:
> >>   http://mySite.com/Home/About.**html<http://mySite.com/Home/About.html
> >>   >anchor: About Me outlink: toUrl:
> >>   http://mySite.com/Guide/**Contact_The_Guide.html<http://mySite.com/Gu
> >>   ide/Contact_The_Guide.html>anchor: Contact Me
> >> 
> >> ... many more outlinks ...
> >> Content Metadata:
> >> nutch.content.digest=**5c182af41027766eccf1ea60d11277**2c
> >> Accept-ranges=bytes Date=Tue, 23 Aug 2011 16:28:43 GMT
> >> Content-Length=28798 Last-Modified=Wed, 06 Apr 2011 00:15:10 GMT
> >> nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=**20110823100811
> >> Content-Type=text/html
> >> Connection=close Server=Netscape-Enterprise/6.0
> >> Parse Metadata: CharEncodingForConversion=**windows-1252
> >> OriginalCharEncoding=windows-**1252
> >> 
> >> ParseText::
> >> ... lots of parsed text ...
> >> 
> >> Recno::  57
> >> 
> >> ... and so forth.
> >> 
> >> JBmarker does not appear anywhere else, in this segment or any of the
> >> others.
> >> 
> >> When I do a solrindex, JBmarker does not appear to be anywhere.  ??
> >> 
> >> *What I expected*
> >> 
> >> As I understand ULRmeta (as defined by the two nutch patches), the meta
> >> data that is included with the url  is injected into the seed url; that
> >> is to say, it is as if the lines:
> >> 
> >> <META NAME="recommended" CONTENT="Guide">
> >> <META NAME="keywords" CONTENT="Guide,Policy,**JBmarker">
> >> 
> >> were in the seed url content.  Furthermore,  it is as if those two lines
> >> were in all the outlink content of the seed url.  So, I expected that
> >> when I looked at all the CrawlDatum and ParseData of the outlinks from
> >> the seed url, I would see the same meta data as in the seed CrawlDatum
> >> and ParseData.
> >> 
> >>   Which is clearly not the case.
> >> 
> >> As for solrindex, I assume that I have some work to do to get any
> >> special metadata actions moved over to solr; a special plugin of some
> >> sort.  That is, urlmeta does not help get the collected metadata from
> >> Nutch to Solr.
> >> 
> >> So what is happening?  Where did I go astray?  Am I analyzing the Nutch
> >> dumps incorrectly?
> >> 
> >> One other side note:  I assume that Luke no longer will help me debug
> >> Nutch since it works with Lucene indexes and Nutch no longer create
> >> such beasts.
> >> 
> >>   Are there any tools that help with viewing Nutch databases?  It seems
> >>   that
> >> 
> >> Nutch takes some liberties with the data it is dumping (e.g., the meta
> >> tags all concatenated together without delimiters; I assume that
> >> internally, the meta tags are separated somehow).
> >> 
> >> Thanks, as always.

Re: Trying to understand and use URLmeta

Posted by lewis john mcgibbney <le...@gmail.com>.

Hi John,


On Mon, Aug 29, 2011 at 11:27 PM, John R. Brinkema <
brinkema@teo.uscourts.gov> wrote:

> Lewis,
>
> After shaking off the annoyance of your "RTFM Luke" answer (I had read the
> tutorial several times),


If this was how you interpreted my comments then I am sorry. It was not how
they were meant to be.

I listened to your suggestion (I do respect my elders ... especially my
> 'application-elders') and I spent the weekend reading code,  scanning the
> javadocs files and adding logging statements.  Considering how poorly Nutch
> is documented (sparse, and what is documented probably refers to an old
> version), it was challenging, but worthwhile.


You are correct in some areas, however can you be more precise and give
examples of what kind of documen tation you would appreciate? I have been
gradually working towards the goal of getting some comprehensive
documentation for Nutch rolled out across the board but it is a reasonably
large task. If you would like to help out then please see NUTCH-881.
f

> What I found:
>
> That I deserve a kick in the head since I was only looking in the Nutch
> databases for the results of urlmeta., The Nutch databases, of course, no
> longer contain indexing information; the name URLMetaIndexingFilter
> (indexing !!!) should have told me.
>
> That still did not help when I looked in the Solr Index.  After a lot of
> analysis and some logging statements later, I discovered that 'urlmeta' was
> not being loaded.  The plugin.includes statement in the tutorial is
> incorrect.  It is (fragment)
>
> ...|index-(basic|anchor|'''**urlmeta''')| ...
>
>
> and should be
>
> ...|index-(basic|anchor)| urlmeta | ...
>

Thanks, this was the reason that I posted a quick message to user@ incase I
had missed or wrongly added any information to the writing plugin page.
Thanks for pointing this out I will get it changed ASAP.


> The name of the plugin is 'urlmeta' not index-urlmeta.
>
> Once I got urlmeta loaded, the indexing almost ran correctly.  I got a Solr
> error complaining that a field was undefined ... the metadata fields that I
> was injecting.  I solved that problem by added the two fields I was
> injecting to the Solr schema.xml.  With that, the indexing completed with no
> errors.
>
> I now (I think) understand how urlmeta works.  I do have two questions,
> however.
>
> 1) Now that Solr is the official indexer for Nutch, are we still supposed
> to copy the Nutch schema over to Solr?  The Solr schema has gotten very
> complicated recently and I am concerned about losing some Solr
> functionality.
>

Correct, there are also various JIRA issues related to this long standing
area as we intend to make it an easier and less confusing integration. What
I would encourage you to do (if you have time) is strip you're defacto Solr
schema back to the bare bones then add functionality on top of the nutch
schema as you see fit. This was you can add value to you're Solr core whilst
ensuring all Nutch data remains intact.

>
> 2) What is the roll of solrindex-mapping.xml ? I only added my field names
> to the Solr schema.xml; I made no changes to the Nutch schema.xml nor made
> any changes to solrindex-mapping.xml.
>

Nutch schema enables us to specify the types of fields we  wish to index
from the documents we crawl.
nutch solr-mapping.xml provides a resource for changing/altering the types
of those fields from source to target e.g. Nutch to Solr indexer. In shor5t
it provides us with flexibility when building an index which can then be
easily traversed and easily queried for precise results.


>
> All, in all, an interesting and educational weekend.
>

excellent


>
> /jb
>
>
>
> On 8/25/2011 5:11 AM, lewis john mcgibbney wrote:
>
>> Hi JB,
>>
>> We have recently finished a complete plugin tutorial which fully explains
>> the functionality of the urlmeta plugin on the wiki. It can be found here
>> [1], could I ask you to have a thorough look at it, and the code and if
>> you
>> still have questions then please reinforce them.
>>
>> [1] http://wiki.apache.org/nutch/**WritingPluginExample<http://wiki.apache.org/nutch/WritingPluginExample>
>>
>> Thank you
>>
>> On Wed, Aug 24, 2011 at 9:36 PM, John R. Brinkema<brinkema@teo.**
>> uscourts.gov <br...@teo.uscourts.gov>
>>
>>> wrote:
>>> Hi all,
>>>
>>> I am trying use URLmeta to inject meta data into documents that I crawl
>>> and
>>> I am having some problems.
>>>
>>> First the context:  Nutch 1.3 with Solr 3.2
>>>
>>> My seed url files looks like:  http://mySite.com/Guide/index.****<http://mySite.com/Guide/index.**>
>>> html\trecommended=<http://**mySite.com/Guide/index.html%**
>>> 5Ctrecommended= <http://mySite.com/Guide/index.html%5Ctrecommended=>>
>>>
>>> "Guide"\**tkeywords="Guide,**Policy,**JBmarker"
>>>
>>> I put JBmarker there so I could see where the metadata got put.
>>>
>>> Index.html itself is a table of contents of a guide; that is, it is
>>> mostly
>>> a list of outlinks to parts of the overall guide.
>>>
>>> My nutch-site.xml includes the following properties:
>>>
>>> <property>
>>> <name>plugin.includes</name>
>>> <value>protocol-http|****urlfilter-regex|parse-(html|**
>>> tika)|index-(basic|anchor|****urlmeta)|scoring-opic|**
>>> urlnormalizer-(pass|regex|****basic)</value>
>>> </property>
>>> <property>
>>> <name>urlmeta.tags</name>
>>> <value>recommended,keywords</****value>
>>> </property>
>>>
>>> I fire up nutch to crawl and all goes well.   To see what nutch did, I
>>> ran
>>> 'readseg -dump' and looked at the results.  What I found was the
>>> following:
>>>
>>> ... other Recno's above ...
>>>
>>> Recno:: 56
>>> URL:: http:/mySite.com/Guide/index.****html
>>>
>>> CrawlDatum::
>>> Version: 7
>>> Status: 65 (signature)
>>> Fetch time: Tue Aug 23 10:08:18 EDT 2011
>>> Modified time: Wed Dec 31 19:00:00 EST 1969
>>> Retries since fetch: 0
>>> Retry interval: 0 seconds (0 days)
>>> Score: 1.0
>>> Signature: 5c182af41027766eccf1ea60d11277****2c
>>>
>>> Metadata:
>>>
>>> CrawlDatum::
>>> Version: 7
>>> Status: 1 (db_unfetched)
>>> Fetch time: Tue Aug 23 10:08:04 EDT 2011
>>> Modified time: Wed Dec 31 19:00:00 EST 1969
>>> Retries since fetch: 0
>>> Retry interval: 2592000 seconds (30 days)
>>> Score: 1.0
>>> Signature: null
>>> Metadata: recommended: Guide_ngt_: 1314108489210keywords:
>>> "Guide,Policy,JBmarker"
>>>
>>> Content::
>>> Version: -1
>>> url: http://mySite.com/Guide/index.****html<http://mySite.com/Guide/index.**html>
>>> <http://mySite.com/**Guide/index.html<http://mySite.com/Guide/index.html>
>>> >
>>> base: http://mySite.com/Guide/index.****html<http://mySite.com/Guide/index.**html>
>>> <http://mySite.com/**Guide/index.html<http://mySite.com/Guide/index.html>
>>> >
>>>
>>> ... lots more content ...
>>>
>>> CrawlDatum::
>>> Version: 7
>>> Status: 33 (fetch_success)
>>> Fetch time: Tue Aug 23 10:08:15 EDT 2011
>>> Modified time: Wed Dec 31 19:00:00 EST 1969
>>> Retries since fetch: 0
>>> Retry interval: 2592000 seconds (30 days)
>>> Score: 1.0
>>> Signature: null
>>> Metadata: recommended: Guide_ngt_: 1314108489210keywords:
>>> "Guide,Policy,JBmarker"_pst_: success(1), lastModified=0
>>>
>>> ParseData::
>>> Version: 5
>>> Status: success(1,0)
>>> Title: Guide
>>> Outlinks: 60
>>>  outlink: toUrl: http://mySite.com/Home/About.****html<http://mySite.com/Home/About.**html>
>>> <http://mySite.com/Home/**About.html <http://mySite.com/Home/About.html>>anchor:
>>> About Me
>>>  outlink: toUrl: http://mySite.com/Guide/****Contact_The_Guide.html<http://mySite.com/Guide/**Contact_The_Guide.html>
>>> <http://**mySite.com/Guide/Contact_The_**Guide.html<http://mySite.com/Guide/Contact_The_Guide.html>>anchor:
>>> Contact Me
>>> ... many more outlinks ...
>>> Content Metadata: nutch.content.digest=****
>>> 5c182af41027766eccf1ea60d11277****2c
>>>
>>> Accept-ranges=bytes Date=Tue, 23 Aug 2011 16:28:43 GMT
>>> Content-Length=28798
>>> Last-Modified=Wed, 06 Apr 2011 00:15:10 GMT nutch.crawl.score=1.0
>>> _fst_=33
>>> nutch.segment.name=****20110823100811 Content-Type=text/html
>>> Connection=close Server=Netscape-Enterprise/6.0
>>> Parse Metadata: CharEncodingForConversion=****windows-1252
>>> OriginalCharEncoding=windows-****1252
>>>
>>> ParseText::
>>> ... lots of parsed text ...
>>>
>>> Recno::  57
>>>
>>> ... and so forth.
>>>
>>> JBmarker does not appear anywhere else, in this segment or any of the
>>> others.
>>>
>>> When I do a solrindex, JBmarker does not appear to be anywhere.  ??
>>>
>>> *What I expected*
>>>
>>> As I understand ULRmeta (as defined by the two nutch patches), the meta
>>> data that is included with the url  is injected into the seed url; that
>>> is
>>> to say, it is as if the lines:
>>>
>>> <META NAME="recommended" CONTENT="Guide">
>>> <META NAME="keywords" CONTENT="Guide,Policy,****JBmarker">
>>>
>>> were in the seed url content.  Furthermore,  it is as if those two lines
>>> were in all the outlink content of the seed url.  So, I expected that
>>> when I
>>> looked at all the CrawlDatum and ParseData of the outlinks from the seed
>>> url, I would see the same meta data as in the seed CrawlDatum and
>>> ParseData.
>>>  Which is clearly not the case.
>>>
>>> As for solrindex, I assume that I have some work to do to get any special
>>> metadata actions moved over to solr; a special plugin of some sort.  That
>>> is, urlmeta does not help get the collected metadata from Nutch to Solr.
>>>
>>> So what is happening?  Where did I go astray?  Am I analyzing the Nutch
>>> dumps incorrectly?
>>>
>>> One other side note:  I assume that Luke no longer will help me debug
>>> Nutch
>>> since it works with Lucene indexes and Nutch no longer create such
>>> beasts.
>>>  Are there any tools that help with viewing Nutch databases?  It seems
>>> that
>>> Nutch takes some liberties with the data it is dumping (e.g., the meta
>>> tags
>>> all concatenated together without delimiters; I assume that internally,
>>> the
>>> meta tags are separated somehow).
>>>
>>> Thanks, as always.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>
>


-- 
*Lewis*

Re: Trying to understand and use URLmeta

Posted by "John R. Brinkema" <br...@teo.uscourts.gov>.

Lewis,

After shaking off the annoyance of your "RTFM Luke" answer (I had read 
the tutorial several times), I listened to your suggestion (I do respect 
my elders ... especially my 'application-elders') and I spent the 
weekend reading code,  scanning the javadocs files and adding logging 
statements.  Considering how poorly Nutch is documented (sparse, and 
what is documented probably refers to an old version), it was 
challenging, but worthwhile.  What I found:

That I deserve a kick in the head since I was only looking in the Nutch 
databases for the results of urlmeta., The Nutch databases, of course, 
no longer contain indexing information; the name URLMetaIndexingFilter 
(indexing !!!) should have told me.

That still did not help when I looked in the Solr Index.  After a lot of 
analysis and some logging statements later, I discovered that 'urlmeta' 
was not being loaded.  The plugin.includes statement in the tutorial is 
incorrect.  It is (fragment)

...|index-(basic|anchor|'''urlmeta''')| ...


and should be

...|index-(basic|anchor)| urlmeta | ...

The name of the plugin is 'urlmeta' not index-urlmeta.

Once I got urlmeta loaded, the indexing almost ran correctly.  I got a 
Solr error complaining that a field was undefined ... the metadata 
fields that I was injecting.  I solved that problem by added the two 
fields I was injecting to the Solr schema.xml.  With that, the indexing 
completed with no errors.

I now (I think) understand how urlmeta works.  I do have two questions, 
however.

1) Now that Solr is the official indexer for Nutch, are we still 
supposed to copy the Nutch schema over to Solr?  The Solr schema has 
gotten very complicated recently and I am concerned about losing some 
Solr functionality.

2) What is the roll of solrindex-mapping.xml ? I only added my field 
names to the Solr schema.xml; I made no changes to the Nutch schema.xml 
nor made any changes to solrindex-mapping.xml.

All, in all, an interesting and educational weekend.

/jb


On 8/25/2011 5:11 AM, lewis john mcgibbney wrote:
> Hi JB,
>
> We have recently finished a complete plugin tutorial which fully explains
> the functionality of the urlmeta plugin on the wiki. It can be found here
> [1], could I ask you to have a thorough look at it, and the code and if you
> still have questions then please reinforce them.
>
> [1] http://wiki.apache.org/nutch/WritingPluginExample
>
> Thank you
>
> On Wed, Aug 24, 2011 at 9:36 PM, John R. Brinkema<brinkema@teo.uscourts.gov
>> wrote:
>> Hi all,
>>
>> I am trying use URLmeta to inject meta data into documents that I crawl and
>> I am having some problems.
>>
>> First the context:  Nutch 1.3 with Solr 3.2
>>
>> My seed url files looks like:  http://mySite.com/Guide/index.**
>> html\trecommended=<http://mySite.com/Guide/index.html%5Ctrecommended=>
>> "Guide"\**tkeywords="Guide,Policy,**JBmarker"
>>
>> I put JBmarker there so I could see where the metadata got put.
>>
>> Index.html itself is a table of contents of a guide; that is, it is mostly
>> a list of outlinks to parts of the overall guide.
>>
>> My nutch-site.xml includes the following properties:
>>
>> <property>
>> <name>plugin.includes</name>
>> <value>protocol-http|**urlfilter-regex|parse-(html|**
>> tika)|index-(basic|anchor|**urlmeta)|scoring-opic|**
>> urlnormalizer-(pass|regex|**basic)</value>
>> </property>
>> <property>
>> <name>urlmeta.tags</name>
>> <value>recommended,keywords</**value>
>> </property>
>>
>> I fire up nutch to crawl and all goes well.   To see what nutch did, I ran
>> 'readseg -dump' and looked at the results.  What I found was the following:
>>
>> ... other Recno's above ...
>>
>> Recno:: 56
>> URL:: http:/mySite.com/Guide/index.**html
>>
>> CrawlDatum::
>> Version: 7
>> Status: 65 (signature)
>> Fetch time: Tue Aug 23 10:08:18 EDT 2011
>> Modified time: Wed Dec 31 19:00:00 EST 1969
>> Retries since fetch: 0
>> Retry interval: 0 seconds (0 days)
>> Score: 1.0
>> Signature: 5c182af41027766eccf1ea60d11277**2c
>> Metadata:
>>
>> CrawlDatum::
>> Version: 7
>> Status: 1 (db_unfetched)
>> Fetch time: Tue Aug 23 10:08:04 EDT 2011
>> Modified time: Wed Dec 31 19:00:00 EST 1969
>> Retries since fetch: 0
>> Retry interval: 2592000 seconds (30 days)
>> Score: 1.0
>> Signature: null
>> Metadata: recommended: Guide_ngt_: 1314108489210keywords:
>> "Guide,Policy,JBmarker"
>>
>> Content::
>> Version: -1
>> url: http://mySite.com/Guide/index.**html<http://mySite.com/Guide/index.html>
>> base: http://mySite.com/Guide/index.**html<http://mySite.com/Guide/index.html>
>> ... lots more content ...
>>
>> CrawlDatum::
>> Version: 7
>> Status: 33 (fetch_success)
>> Fetch time: Tue Aug 23 10:08:15 EDT 2011
>> Modified time: Wed Dec 31 19:00:00 EST 1969
>> Retries since fetch: 0
>> Retry interval: 2592000 seconds (30 days)
>> Score: 1.0
>> Signature: null
>> Metadata: recommended: Guide_ngt_: 1314108489210keywords:
>> "Guide,Policy,JBmarker"_pst_: success(1), lastModified=0
>>
>> ParseData::
>> Version: 5
>> Status: success(1,0)
>> Title: Guide
>> Outlinks: 60
>>   outlink: toUrl: http://mySite.com/Home/About.**html<http://mySite.com/Home/About.html>anchor: About Me
>>   outlink: toUrl: http://mySite.com/Guide/**Contact_The_Guide.html<http://mySite.com/Guide/Contact_The_Guide.html>anchor: Contact Me
>> ... many more outlinks ...
>> Content Metadata: nutch.content.digest=**5c182af41027766eccf1ea60d11277**2c
>> Accept-ranges=bytes Date=Tue, 23 Aug 2011 16:28:43 GMT Content-Length=28798
>> Last-Modified=Wed, 06 Apr 2011 00:15:10 GMT nutch.crawl.score=1.0 _fst_=33
>> nutch.segment.name=**20110823100811 Content-Type=text/html
>> Connection=close Server=Netscape-Enterprise/6.0
>> Parse Metadata: CharEncodingForConversion=**windows-1252
>> OriginalCharEncoding=windows-**1252
>>
>> ParseText::
>> ... lots of parsed text ...
>>
>> Recno::  57
>>
>> ... and so forth.
>>
>> JBmarker does not appear anywhere else, in this segment or any of the
>> others.
>>
>> When I do a solrindex, JBmarker does not appear to be anywhere.  ??
>>
>> *What I expected*
>>
>> As I understand ULRmeta (as defined by the two nutch patches), the meta
>> data that is included with the url  is injected into the seed url; that is
>> to say, it is as if the lines:
>>
>> <META NAME="recommended" CONTENT="Guide">
>> <META NAME="keywords" CONTENT="Guide,Policy,**JBmarker">
>>
>> were in the seed url content.  Furthermore,  it is as if those two lines
>> were in all the outlink content of the seed url.  So, I expected that when I
>> looked at all the CrawlDatum and ParseData of the outlinks from the seed
>> url, I would see the same meta data as in the seed CrawlDatum and ParseData.
>>   Which is clearly not the case.
>>
>> As for solrindex, I assume that I have some work to do to get any special
>> metadata actions moved over to solr; a special plugin of some sort.  That
>> is, urlmeta does not help get the collected metadata from Nutch to Solr.
>>
>> So what is happening?  Where did I go astray?  Am I analyzing the Nutch
>> dumps incorrectly?
>>
>> One other side note:  I assume that Luke no longer will help me debug Nutch
>> since it works with Lucene indexes and Nutch no longer create such beasts.
>>   Are there any tools that help with viewing Nutch databases?  It seems that
>> Nutch takes some liberties with the data it is dumping (e.g., the meta tags
>> all concatenated together without delimiters; I assume that internally, the
>> meta tags are separated somehow).
>>
>> Thanks, as always.
>>
>>
>>
>>
>>
>>
>>
>>
>

Re: Trying to understand and use URLmeta

Posted by lewis john mcgibbney <le...@gmail.com>.

Hi JB,

We have recently finished a complete plugin tutorial which fully explains
the functionality of the urlmeta plugin on the wiki. It can be found here
[1], could I ask you to have a thorough look at it, and the code and if you
still have questions then please reinforce them.

[1] http://wiki.apache.org/nutch/WritingPluginExample

Thank you

On Wed, Aug 24, 2011 at 9:36 PM, John R. Brinkema <brinkema@teo.uscourts.gov
> wrote:

> Hi all,
>
> I am trying use URLmeta to inject meta data into documents that I crawl and
> I am having some problems.
>
> First the context:  Nutch 1.3 with Solr 3.2
>
> My seed url files looks like:  http://mySite.com/Guide/index.**
> html\trecommended= <http://mySite.com/Guide/index.html%5Ctrecommended=>
> "Guide"\**tkeywords="Guide,Policy,**JBmarker"
>
> I put JBmarker there so I could see where the metadata got put.
>
> Index.html itself is a table of contents of a guide; that is, it is mostly
> a list of outlinks to parts of the overall guide.
>
> My nutch-site.xml includes the following properties:
>
> <property>
> <name>plugin.includes</name>
> <value>protocol-http|**urlfilter-regex|parse-(html|**
> tika)|index-(basic|anchor|**urlmeta)|scoring-opic|**
> urlnormalizer-(pass|regex|**basic)</value>
> </property>
> <property>
> <name>urlmeta.tags</name>
> <value>recommended,keywords</**value>
> </property>
>
> I fire up nutch to crawl and all goes well.   To see what nutch did, I ran
> 'readseg -dump' and looked at the results.  What I found was the following:
>
> ... other Recno's above ...
>
> Recno:: 56
> URL:: http:/mySite.com/Guide/index.**html
>
> CrawlDatum::
> Version: 7
> Status: 65 (signature)
> Fetch time: Tue Aug 23 10:08:18 EDT 2011
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 0 seconds (0 days)
> Score: 1.0
> Signature: 5c182af41027766eccf1ea60d11277**2c
> Metadata:
>
> CrawlDatum::
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Tue Aug 23 10:08:04 EDT 2011
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> Signature: null
> Metadata: recommended: Guide_ngt_: 1314108489210keywords:
> "Guide,Policy,JBmarker"
>
> Content::
> Version: -1
> url: http://mySite.com/Guide/index.**html<http://mySite.com/Guide/index.html>
> base: http://mySite.com/Guide/index.**html<http://mySite.com/Guide/index.html>
> ... lots more content ...
>
> CrawlDatum::
> Version: 7
> Status: 33 (fetch_success)
> Fetch time: Tue Aug 23 10:08:15 EDT 2011
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> Signature: null
> Metadata: recommended: Guide_ngt_: 1314108489210keywords:
> "Guide,Policy,JBmarker"_pst_: success(1), lastModified=0
>
> ParseData::
> Version: 5
> Status: success(1,0)
> Title: Guide
> Outlinks: 60
>  outlink: toUrl: http://mySite.com/Home/About.**html<http://mySite.com/Home/About.html>anchor: About Me
>  outlink: toUrl: http://mySite.com/Guide/**Contact_The_Guide.html<http://mySite.com/Guide/Contact_The_Guide.html>anchor: Contact Me
> ... many more outlinks ...
> Content Metadata: nutch.content.digest=**5c182af41027766eccf1ea60d11277**2c
> Accept-ranges=bytes Date=Tue, 23 Aug 2011 16:28:43 GMT Content-Length=28798
> Last-Modified=Wed, 06 Apr 2011 00:15:10 GMT nutch.crawl.score=1.0 _fst_=33
> nutch.segment.name=**20110823100811 Content-Type=text/html
> Connection=close Server=Netscape-Enterprise/6.0
> Parse Metadata: CharEncodingForConversion=**windows-1252
> OriginalCharEncoding=windows-**1252
>
> ParseText::
> ... lots of parsed text ...
>
> Recno::  57
>
> ... and so forth.
>
> JBmarker does not appear anywhere else, in this segment or any of the
> others.
>
> When I do a solrindex, JBmarker does not appear to be anywhere.  ??
>
> *What I expected*
>
> As I understand ULRmeta (as defined by the two nutch patches), the meta
> data that is included with the url  is injected into the seed url; that is
> to say, it is as if the lines:
>
> <META NAME="recommended" CONTENT="Guide">
> <META NAME="keywords" CONTENT="Guide,Policy,**JBmarker">
>
> were in the seed url content.  Furthermore,  it is as if those two lines
> were in all the outlink content of the seed url.  So, I expected that when I
> looked at all the CrawlDatum and ParseData of the outlinks from the seed
> url, I would see the same meta data as in the seed CrawlDatum and ParseData.
>  Which is clearly not the case.
>
> As for solrindex, I assume that I have some work to do to get any special
> metadata actions moved over to solr; a special plugin of some sort.  That
> is, urlmeta does not help get the collected metadata from Nutch to Solr.
>
> So what is happening?  Where did I go astray?  Am I analyzing the Nutch
> dumps incorrectly?
>
> One other side note:  I assume that Luke no longer will help me debug Nutch
> since it works with Lucene indexes and Nutch no longer create such beasts.
>  Are there any tools that help with viewing Nutch databases?  It seems that
> Nutch takes some liberties with the data it is dumping (e.g., the meta tags
> all concatenated together without delimiters; I assume that internally, the
> meta tags are separated somehow).
>
> Thanks, as always.
>
>
>
>
>
>
>
>


-- 
*Lewis*