You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Stefan Neufeind (JIRA)" <ji...@apache.org> on 2006/05/18 23:13:08 UTC

[jira] Created: (NUTCH-271) Meta-data per URL/site/section

Meta-data per URL/site/section
------------------------------

         Key: NUTCH-271
         URL: http://issues.apache.org/jira/browse/NUTCH-271
     Project: Nutch
        Type: New Feature

    Versions: 0.7.2    
    Reporter: Stefan Neufeind


We have the need to index sites and attach additional meta-data-tags to them. Afaik this is not yet possible, or is there a "workaround" I don't see? What I think of is using meta-tags per start-url, only indexing content below that URL, and have the ability to limit searches upon those meta-tags. E.g.

http://www.example1.com/something1/   -> meta-tag "companybranch1"
http://www.example2.com/something2/   -> meta-tag "companybranch2"
http://www.example3.com/something3/   -> meta-tag "companybranch1"
http://www.example4.com/something4/   -> meta-tag "companybranch3"

search for everything in companybranch1 or across 1 and 3 or similar

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-271) Meta-data per URL/site/section

Posted by "Gal Nitzan (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-271?page=comments#action_12412435 ] 

Gal Nitzan commented on NUTCH-271:
----------------------------------

This functionality is already available in Nutch-0.8

> Meta-data per URL/site/section
> ------------------------------
>
>          Key: NUTCH-271
>          URL: http://issues.apache.org/jira/browse/NUTCH-271
>      Project: Nutch
>         Type: New Feature

>     Versions: 0.7.2
>     Reporter: Stefan Neufeind

>
> We have the need to index sites and attach additional meta-data-tags to them. Afaik this is not yet possible, or is there a "workaround" I don't see? What I think of is using meta-tags per start-url, only indexing content below that URL, and have the ability to limit searches upon those meta-tags. E.g.
> http://www.example1.com/something1/   -> meta-tag "companybranch1"
> http://www.example2.com/something2/   -> meta-tag "companybranch2"
> http://www.example3.com/something3/   -> meta-tag "companybranch1"
> http://www.example4.com/something4/   -> meta-tag "companybranch3"
> search for everything in companybranch1 or across 1 and 3 or similar

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Closed: (NUTCH-271) Meta-data per URL/site/section

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-271?page=all ]

Andrzej Bialecki  closed NUTCH-271.
-----------------------------------

    Resolution: Fixed

I'm closing this issue, because this functionality can be achieved by using a combination of CrawlDatum.metaData and url/scoring filters.

> Meta-data per URL/site/section
> ------------------------------
>
>                 Key: NUTCH-271
>                 URL: http://issues.apache.org/jira/browse/NUTCH-271
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 0.7.2
>            Reporter: Stefan Neufeind
>
> We have the need to index sites and attach additional meta-data-tags to them. Afaik this is not yet possible, or is there a "workaround" I don't see? What I think of is using meta-tags per start-url, only indexing content below that URL, and have the ability to limit searches upon those meta-tags. E.g.
> http://www.example1.com/something1/   -> meta-tag "companybranch1"
> http://www.example2.com/something2/   -> meta-tag "companybranch2"
> http://www.example3.com/something3/   -> meta-tag "companybranch1"
> http://www.example4.com/something4/   -> meta-tag "companybranch3"
> search for everything in companybranch1 or across 1 and 3 or similar

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-271) Meta-data per URL/site/section

Posted by "Gal Nitzan (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-271?page=comments#action_12412436 ] 

Gal Nitzan commented on NUTCH-271:
----------------------------------

Sorry for the short comment.

Actually the meta tags functionality is already available in the 0.8 version along with a CrawlDatum object.

You can build the required functionality just by developing plugins for parsing indexing and querying....

HTH.

> Meta-data per URL/site/section
> ------------------------------
>
>          Key: NUTCH-271
>          URL: http://issues.apache.org/jira/browse/NUTCH-271
>      Project: Nutch
>         Type: New Feature

>     Versions: 0.7.2
>     Reporter: Stefan Neufeind

>
> We have the need to index sites and attach additional meta-data-tags to them. Afaik this is not yet possible, or is there a "workaround" I don't see? What I think of is using meta-tags per start-url, only indexing content below that URL, and have the ability to limit searches upon those meta-tags. E.g.
> http://www.example1.com/something1/   -> meta-tag "companybranch1"
> http://www.example2.com/something2/   -> meta-tag "companybranch2"
> http://www.example3.com/something3/   -> meta-tag "companybranch1"
> http://www.example4.com/something4/   -> meta-tag "companybranch3"
> search for everything in companybranch1 or across 1 and 3 or similar

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-271) Meta-data per URL/site/section

Posted by "Gal Nitzan (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-271?page=comments#action_12412523 ] 

Gal Nitzan commented on NUTCH-271:
----------------------------------


Hi Stefan,

Indeed 0.8 is not release 1.0 yet but it is stable and we are using it in production.

As a whole Nutch is greate and does the job right. there is a lot of tweakiing to it but once you get the whole thing configured to your liking there is not much to change after.

In terms of plugin development, I do not think Java is that far from PHP so I do not think you would have hard time there. the plugins are usually pretty small  code. Since most job is already done by Nutch.

for example you want to check certain rule and based on this rule to add some information into the index so you can later search your index based on that tag.

The way to go about it would be to develop a parse filter plugin. This plugin is called during the parse phase usualy it happens right after fetching unless disabled in conf.
The plugin has one interface: filter which gets the URL, content and a parse object which contains a meta data object, for every page fetched. There you can put an implementation that when the URL of the fetched page matched some criteria you would add a metat data tag.

Than you would add an index plugin that will take that meta data and store it in your index as a new field.

The last thing to do is write a query plugin that will enable you to search the index based on the field you added in your indexing phase.

HTH.

Gal.

These kind of questions should be sent through the user list and not Jira.

> Meta-data per URL/site/section
> ------------------------------
>
>          Key: NUTCH-271
>          URL: http://issues.apache.org/jira/browse/NUTCH-271
>      Project: Nutch
>         Type: New Feature

>     Versions: 0.7.2
>     Reporter: Stefan Neufeind

>
> We have the need to index sites and attach additional meta-data-tags to them. Afaik this is not yet possible, or is there a "workaround" I don't see? What I think of is using meta-tags per start-url, only indexing content below that URL, and have the ability to limit searches upon those meta-tags. E.g.
> http://www.example1.com/something1/   -> meta-tag "companybranch1"
> http://www.example2.com/something2/   -> meta-tag "companybranch2"
> http://www.example3.com/something3/   -> meta-tag "companybranch1"
> http://www.example4.com/something4/   -> meta-tag "companybranch3"
> search for everything in companybranch1 or across 1 and 3 or similar

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Re: [jira] Commented: (NUTCH-271) Meta-data per URL/site/section

Posted by Stefan Neufeind <ap...@stefan-neufeind.de>.
Hi Sami,

1) That sound quite interesting. Is there any basic information how to 
work with that? Might be useful for something else I'm trying :-)

2) Original intent of my question was because I would need certain 
meta-data by which I can group (reduce? what's the correct word?) 
search-results. I have:

- Website 1 to 4 with 50 pages each
- Summary-website which olds one URL for each of the websites with a 
short profile etc.

When doing a search I'd like to display only two matches per website but:
- either show all matches from the summary-website and two matches per 
website 1 to 4 that have matches: showing e.g. 2 matches from website 1, 
for website 2 to 4 no pages match - but profiles for website 2 to 4 
might match and thus would need to be displayed)

- or display matches grouped by website, including the appropriate pages 
from the summary-website as well: in case there are matches from website 
2 but also the profile for website 2 matches there would be the 2 best 
matches shown for website 2, which could be the profile from the 
summary-website as well as one match from website 2. But still a profile 
for website 3 might be shown as well - since that counts towards website 
3, although it's URL (site-value) is actually part of the summary-website.


What I currently have is that max. 2 matches are shown per website - but 
that also from the summary-website only 2 matches are shown. Either I'd 
need to be able to show only 2 matches per website but _all_ matches 
from the summary-website (would be okay in this case) or give website 1 
to 4 individual "IDs per website" and also assign each URL from the 
summary-website the corresponding ID of the website it belongs to.

(Note: I know all URLs of the summary-website beforehand, and know which 
website/website-ID each URL belongs to.)


Sorry for the long explanation - but I hope I made it clear.
How would that be doable?



Regards,
  Stefan

Sami Siren wrote:
> 0.8 has subcollection plugin. It can add subollection id for set of urls 
> and then you can limit searching to subcollections. Is that what you're 
> after?
> 
> -- 
> Sami Siren
> 
> Stefan Neufeind (JIRA) wrote:
> 
>>    [ 
>> http://issues.apache.org/jira/browse/NUTCH-271?page=comments#action_12422226 
>> ]            Stefan Neufeind commented on NUTCH-271:
>> ---------------------------------------
>>
>> Does somebody have an existing demo-plugin for that, that would catch 
>> URL-prefixes from a file and in case matches are found certain tags 
>> are then added? I don't yet fully get it how to do it "the elegant 
>> way" :-)
>>
>>  
>>
>>> Meta-data per URL/site/section
>>> ------------------------------
>>>
>>>                Key: NUTCH-271
>>>                URL: http://issues.apache.org/jira/browse/NUTCH-271
>>>            Project: Nutch
>>>         Issue Type: New Feature
>>>   Affects Versions: 0.7.2
>>>           Reporter: Stefan Neufeind
>>>
>>> We have the need to index sites and attach additional meta-data-tags 
>>> to them. Afaik this is not yet possible, or is there a "workaround" I 
>>> don't see? What I think of is using meta-tags per start-url, only 
>>> indexing content below that URL, and have the ability to limit 
>>> searches upon those meta-tags. E.g.
>>> http://www.example1.com/something1/   -> meta-tag "companybranch1"
>>> http://www.example2.com/something2/   -> meta-tag "companybranch2"
>>> http://www.example3.com/something3/   -> meta-tag "companybranch1"
>>> http://www.example4.com/something4/   -> meta-tag "companybranch3"
>>> search for everything in companybranch1 or across 1 and 3 or similar

Re: [jira] Commented: (NUTCH-271) Meta-data per URL/site/section

Posted by Sami Siren <ss...@gmail.com>.
0.8 has subcollection plugin. It can add subollection id for set of urls 
and then you can limit searching to subcollections. Is that what you're 
after?

--
 Sami Siren

Stefan Neufeind (JIRA) wrote:

>    [ http://issues.apache.org/jira/browse/NUTCH-271?page=comments#action_12422226 ] 
>            
>Stefan Neufeind commented on NUTCH-271:
>---------------------------------------
>
>Does somebody have an existing demo-plugin for that, that would catch URL-prefixes from a file and in case matches are found certain tags are then added? I don't yet fully get it how to do it "the elegant way" :-)
>
>  
>
>>Meta-data per URL/site/section
>>------------------------------
>>
>>                Key: NUTCH-271
>>                URL: http://issues.apache.org/jira/browse/NUTCH-271
>>            Project: Nutch
>>         Issue Type: New Feature
>>   Affects Versions: 0.7.2
>>           Reporter: Stefan Neufeind
>>
>>We have the need to index sites and attach additional meta-data-tags to them. Afaik this is not yet possible, or is there a "workaround" I don't see? What I think of is using meta-tags per start-url, only indexing content below that URL, and have the ability to limit searches upon those meta-tags. E.g.
>>http://www.example1.com/something1/   -> meta-tag "companybranch1"
>>http://www.example2.com/something2/   -> meta-tag "companybranch2"
>>http://www.example3.com/something3/   -> meta-tag "companybranch1"
>>http://www.example4.com/something4/   -> meta-tag "companybranch3"
>>search for everything in companybranch1 or across 1 and 3 or similar
>>    
>>
>
>  
>


[jira] Commented: (NUTCH-271) Meta-data per URL/site/section

Posted by "Stefan Neufeind (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-271?page=comments#action_12422226 ] 
            
Stefan Neufeind commented on NUTCH-271:
---------------------------------------

Does somebody have an existing demo-plugin for that, that would catch URL-prefixes from a file and in case matches are found certain tags are then added? I don't yet fully get it how to do it "the elegant way" :-)

> Meta-data per URL/site/section
> ------------------------------
>
>                 Key: NUTCH-271
>                 URL: http://issues.apache.org/jira/browse/NUTCH-271
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 0.7.2
>            Reporter: Stefan Neufeind
>
> We have the need to index sites and attach additional meta-data-tags to them. Afaik this is not yet possible, or is there a "workaround" I don't see? What I think of is using meta-tags per start-url, only indexing content below that URL, and have the ability to limit searches upon those meta-tags. E.g.
> http://www.example1.com/something1/   -> meta-tag "companybranch1"
> http://www.example2.com/something2/   -> meta-tag "companybranch2"
> http://www.example3.com/something3/   -> meta-tag "companybranch1"
> http://www.example4.com/something4/   -> meta-tag "companybranch3"
> search for everything in companybranch1 or across 1 and 3 or similar

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira