You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Sami Siren <ss...@gmail.com> on 2006/07/26 13:47:18 UTC

[Fwd: [Fwd: Re: [jira] Commented: (NUTCH-271) Meta-data per URL/site/section]]

redirecting to nutch-user...

> What I currently have is that max. 2 matches are shown per website - but
> that also from the summary-website only 2 matches are shown. Either I'd
> need to be able to show only 2 matches per website but _all_ matches
> from the summary-website (would be okay in this case) or give website 1
> to 4 individual "IDs per website" and also assign each URL from the
> summary-website the corresponding ID of the website it belongs to.

You can add whatever (meta-)data to index with indexing filter. You could
for example assign tag "A" to site A, tag "B" to B etc...
then assign unique tags for pages from summary site.

In searching phase you then use that new field as dedupfield (instead of site)

This should give you max (for example 2) hits per website and unlimited hits
from summary web site.

Does that fullfill your requirements?


--
 Sami Siren

Re: [Fwd: [Fwd: Re: [jira] Commented: (NUTCH-271) Meta-data per URL/site/section]]

Posted by Stefan Neufeind <ap...@stefan-neufeind.de>.

Andrzej Bialecki wrote:
> Stefan Neufeind wrote:
>> Sami Siren wrote:
>>  
>>> Stefan Neufeind wrote:
>>>    
>>>> Sami Siren wrote:
>>>>      
>>>>> redirecting to nutch-user...
>>>>>        
>>>>>> What I currently have is that max. 2 matches are shown per website -
>>>>>> but
>>>>>> that also from the summary-website only 2 matches are shown.
>>>>>> Either I'd
>>>>>> need to be able to show only 2 matches per website but _all_ matches
>>>>>> from the summary-website (would be okay in this case) or give
>>>>>> website 1
>>>>>> to 4 individual "IDs per website" and also assign each URL from the
>>>>>> summary-website the corresponding ID of the website it belongs to.
>>>>>>           
>>>>> You can add whatever (meta-)data to index with indexing filter. You
>>>>> could
>>>>> for example assign tag "A" to site A, tag "B" to B etc...
>>>>> then assign unique tags for pages from summary site.
>>>>>
>>>>> In searching phase you then use that new field as dedupfield
>>>>> (instead of
>>>>> site)
>>>>>
>>>>> This should give you max (for example 2) hits per website and
>>>>> unlimited
>>>>> hits
>>>>> from summary web site.
>>>>>
>>>>> Does that fullfill your requirements?
>>>>>         
>>>> That would perfectly fit, yes. But how do I "tag" the pages/URLs? With
>>>> what "filter"?
>>>>       
>>> Write a plugin that provides implementation of
>>> http://lucene.apache.org/nutch/nutch-nightly/docs/api/org/apache/nutch/indexer/IndexingFilter.html
>>
>> That was (part of) my question - how to do that "cleanly", and if
>> somebody could give a hint. I'm not sure what would be the elegant way
>> of having a "match URL against ... and set tags ABC"-patternfile, how to
>> use a hash-map or something for that and how to do it in Java. (Sorry,
>> I'm not that familiar with Java as with other languages, and neither
>> with nutch-internals).
> 
> If it's a relatively short list of urls (let's say less than 50,000
> entries) then you can use org.apache.nutch.util.PrefixStringMatcher,
> which builds a compact trie structure. I would then strongly advise you
> to keep just the urls (or whatever it is that you need to match) in that
> structure, and all other data in an external DB or a special-purpose
> Lucene index. You can implement this as an indexing plugin - if the
> pattern matches, then you get additional metadata from some external
> source, and you add additional fields to the index that contain this data.

Hmm, I'm still not sure how this would work. (Sorry for that!) I know
that for every URL in my index the prefix matches. I just would need to
find out how much. E.g.

	http://www.example.com/test1/    as prefix
and
	http://www.example.com/test1/page1.htm   as the page-URL

Now I would want to do a lookup and assign that, based on the prefix, ID
"test1". Do I conclude right that in this case I could leave out the
PreefixStringMatcher, since I know that some string will match for all
the URLs?

Do you maybe have a small example for a plugin to match against an
external database?

PS: Your help is very much appreciated. Sorry for asking dumb questions :-)


Regards,
 Stefan

Re: [Fwd: [Fwd: Re: [jira] Commented: (NUTCH-271) Meta-data per URL/site/section]]

Posted by Andrzej Bialecki <ab...@getopt.org>.

Stefan Neufeind wrote:
> Sami Siren wrote:
>   
>> Stefan Neufeind wrote:
>>     
>>> Sami Siren wrote:
>>>
>>>       
>>>> redirecting to nutch-user...
>>>>
>>>>
>>>>         
>>>>> What I currently have is that max. 2 matches are shown per website -
>>>>> but
>>>>> that also from the summary-website only 2 matches are shown. Either I'd
>>>>> need to be able to show only 2 matches per website but _all_ matches
>>>>> from the summary-website (would be okay in this case) or give website 1
>>>>> to 4 individual "IDs per website" and also assign each URL from the
>>>>> summary-website the corresponding ID of the website it belongs to.
>>>>>           
>>>> You can add whatever (meta-)data to index with indexing filter. You
>>>> could
>>>> for example assign tag "A" to site A, tag "B" to B etc...
>>>> then assign unique tags for pages from summary site.
>>>>
>>>> In searching phase you then use that new field as dedupfield (instead of
>>>> site)
>>>>
>>>> This should give you max (for example 2) hits per website and unlimited
>>>> hits
>>>> from summary web site.
>>>>
>>>> Does that fullfill your requirements?
>>>>         
>>> That would perfectly fit, yes. But how do I "tag" the pages/URLs? With
>>> what "filter"?
>>>
>>>       
>> Write a plugin that provides implementation of
>> http://lucene.apache.org/nutch/nutch-nightly/docs/api/org/apache/nutch/indexer/IndexingFilter.html
>>     
>
> That was (part of) my question - how to do that "cleanly", and if
> somebody could give a hint. I'm not sure what would be the elegant way
> of having a "match URL against ... and set tags ABC"-patternfile, how to
> use a hash-map or something for that and how to do it in Java. (Sorry,
> I'm not that familiar with Java as with other languages, and neither
> with nutch-internals).
>   

If it's a relatively short list of urls (let's say less than 50,000 
entries) then you can use org.apache.nutch.util.PrefixStringMatcher, 
which builds a compact trie structure. I would then strongly advise you 
to keep just the urls (or whatever it is that you need to match) in that 
structure, and all other data in an external DB or a special-purpose 
Lucene index. You can implement this as an indexing plugin - if the 
pattern matches, then you get additional metadata from some external 
source, and you add additional fields to the index that contain this data.

I implemented several plugins, which used this trick, and it works very 
well.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: [Fwd: [Fwd: Re: [jira] Commented: (NUTCH-271) Meta-data per URL/site/section]]

Posted by Stefan Neufeind <ap...@stefan-neufeind.de>.

Sami Siren wrote:
> Stefan Neufeind wrote:
>> Sami Siren wrote:
>>
>>> redirecting to nutch-user...
>>>
>>>
>>>> What I currently have is that max. 2 matches are shown per website -
>>>> but
>>>> that also from the summary-website only 2 matches are shown. Either I'd
>>>> need to be able to show only 2 matches per website but _all_ matches
>>>> from the summary-website (would be okay in this case) or give website 1
>>>> to 4 individual "IDs per website" and also assign each URL from the
>>>> summary-website the corresponding ID of the website it belongs to.
>>>
>>> You can add whatever (meta-)data to index with indexing filter. You
>>> could
>>> for example assign tag "A" to site A, tag "B" to B etc...
>>> then assign unique tags for pages from summary site.
>>>
>>> In searching phase you then use that new field as dedupfield (instead of
>>> site)
>>>
>>> This should give you max (for example 2) hits per website and unlimited
>>> hits
>>> from summary web site.
>>>
>>> Does that fullfill your requirements?
>>
>>
>> That would perfectly fit, yes. But how do I "tag" the pages/URLs? With
>> what "filter"?
>>
> 
> Write a plugin that provides implementation of
> http://lucene.apache.org/nutch/nutch-nightly/docs/api/org/apache/nutch/indexer/IndexingFilter.html

That was (part of) my question - how to do that "cleanly", and if
somebody could give a hint. I'm not sure what would be the elegant way
of having a "match URL against ... and set tags ABC"-patternfile, how to
use a hash-map or something for that and how to do it in Java. (Sorry,
I'm not that familiar with Java as with other languages, and neither
with nutch-internals).

  Stefan

Re: [Fwd: [Fwd: Re: [jira] Commented: (NUTCH-271) Meta-data per URL/site/section]]

Posted by Sami Siren <ss...@gmail.com>.

Stefan Neufeind wrote:
> Sami Siren wrote:
> 
>>redirecting to nutch-user...
>>
>>
>>>What I currently have is that max. 2 matches are shown per website - but
>>>that also from the summary-website only 2 matches are shown. Either I'd
>>>need to be able to show only 2 matches per website but _all_ matches
>>>from the summary-website (would be okay in this case) or give website 1
>>>to 4 individual "IDs per website" and also assign each URL from the
>>>summary-website the corresponding ID of the website it belongs to.
>>
>>You can add whatever (meta-)data to index with indexing filter. You could
>>for example assign tag "A" to site A, tag "B" to B etc...
>>then assign unique tags for pages from summary site.
>>
>>In searching phase you then use that new field as dedupfield (instead of
>>site)
>>
>>This should give you max (for example 2) hits per website and unlimited
>>hits
>>from summary web site.
>>
>>Does that fullfill your requirements?
> 
> 
> That would perfectly fit, yes. But how do I "tag" the pages/URLs? With
> what "filter"?
> 

Write a plugin that provides implementation of 
http://lucene.apache.org/nutch/nutch-nightly/docs/api/org/apache/nutch/indexer/IndexingFilter.html


--
  Sami Siren

Re: [Fwd: [Fwd: Re: [jira] Commented: (NUTCH-271) Meta-data per URL/site/section]]

Posted by Stefan Neufeind <ap...@stefan-neufeind.de>.

Sami Siren wrote:
> redirecting to nutch-user...
> 
>> What I currently have is that max. 2 matches are shown per website - but
>> that also from the summary-website only 2 matches are shown. Either I'd
>> need to be able to show only 2 matches per website but _all_ matches
>> from the summary-website (would be okay in this case) or give website 1
>> to 4 individual "IDs per website" and also assign each URL from the
>> summary-website the corresponding ID of the website it belongs to.
> 
> You can add whatever (meta-)data to index with indexing filter. You could
> for example assign tag "A" to site A, tag "B" to B etc...
> then assign unique tags for pages from summary site.
> 
> In searching phase you then use that new field as dedupfield (instead of
> site)
> 
> This should give you max (for example 2) hits per website and unlimited
> hits
> from summary web site.
> 
> Does that fullfill your requirements?

That would perfectly fit, yes. But how do I "tag" the pages/URLs? With
what "filter"?

  Stefan