You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Stefan Neufeind (JIRA)" <ji...@apache.org> on 2006/07/19 20:54:15 UTC
[jira] Commented: (NUTCH-271) Meta-data per URL/site/section
[ http://issues.apache.org/jira/browse/NUTCH-271?page=comments#action_12422226 ]
Stefan Neufeind commented on NUTCH-271:
---------------------------------------
Does somebody have an existing demo-plugin for that, that would catch URL-prefixes from a file and in case matches are found certain tags are then added? I don't yet fully get it how to do it "the elegant way" :-)
> Meta-data per URL/site/section
> ------------------------------
>
> Key: NUTCH-271
> URL: http://issues.apache.org/jira/browse/NUTCH-271
> Project: Nutch
> Issue Type: New Feature
> Affects Versions: 0.7.2
> Reporter: Stefan Neufeind
>
> We have the need to index sites and attach additional meta-data-tags to them. Afaik this is not yet possible, or is there a "workaround" I don't see? What I think of is using meta-tags per start-url, only indexing content below that URL, and have the ability to limit searches upon those meta-tags. E.g.
> http://www.example1.com/something1/ -> meta-tag "companybranch1"
> http://www.example2.com/something2/ -> meta-tag "companybranch2"
> http://www.example3.com/something3/ -> meta-tag "companybranch1"
> http://www.example4.com/something4/ -> meta-tag "companybranch3"
> search for everything in companybranch1 or across 1 and 3 or similar
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [jira] Commented: (NUTCH-271) Meta-data per URL/site/section
Posted by Stefan Neufeind <ap...@stefan-neufeind.de>.
Hi Sami,
1) That sound quite interesting. Is there any basic information how to
work with that? Might be useful for something else I'm trying :-)
2) Original intent of my question was because I would need certain
meta-data by which I can group (reduce? what's the correct word?)
search-results. I have:
- Website 1 to 4 with 50 pages each
- Summary-website which olds one URL for each of the websites with a
short profile etc.
When doing a search I'd like to display only two matches per website but:
- either show all matches from the summary-website and two matches per
website 1 to 4 that have matches: showing e.g. 2 matches from website 1,
for website 2 to 4 no pages match - but profiles for website 2 to 4
might match and thus would need to be displayed)
- or display matches grouped by website, including the appropriate pages
from the summary-website as well: in case there are matches from website
2 but also the profile for website 2 matches there would be the 2 best
matches shown for website 2, which could be the profile from the
summary-website as well as one match from website 2. But still a profile
for website 3 might be shown as well - since that counts towards website
3, although it's URL (site-value) is actually part of the summary-website.
What I currently have is that max. 2 matches are shown per website - but
that also from the summary-website only 2 matches are shown. Either I'd
need to be able to show only 2 matches per website but _all_ matches
from the summary-website (would be okay in this case) or give website 1
to 4 individual "IDs per website" and also assign each URL from the
summary-website the corresponding ID of the website it belongs to.
(Note: I know all URLs of the summary-website beforehand, and know which
website/website-ID each URL belongs to.)
Sorry for the long explanation - but I hope I made it clear.
How would that be doable?
Regards,
Stefan
Sami Siren wrote:
> 0.8 has subcollection plugin. It can add subollection id for set of urls
> and then you can limit searching to subcollections. Is that what you're
> after?
>
> --
> Sami Siren
>
> Stefan Neufeind (JIRA) wrote:
>
>> [
>> http://issues.apache.org/jira/browse/NUTCH-271?page=comments#action_12422226
>> ] Stefan Neufeind commented on NUTCH-271:
>> ---------------------------------------
>>
>> Does somebody have an existing demo-plugin for that, that would catch
>> URL-prefixes from a file and in case matches are found certain tags
>> are then added? I don't yet fully get it how to do it "the elegant
>> way" :-)
>>
>>
>>
>>> Meta-data per URL/site/section
>>> ------------------------------
>>>
>>> Key: NUTCH-271
>>> URL: http://issues.apache.org/jira/browse/NUTCH-271
>>> Project: Nutch
>>> Issue Type: New Feature
>>> Affects Versions: 0.7.2
>>> Reporter: Stefan Neufeind
>>>
>>> We have the need to index sites and attach additional meta-data-tags
>>> to them. Afaik this is not yet possible, or is there a "workaround" I
>>> don't see? What I think of is using meta-tags per start-url, only
>>> indexing content below that URL, and have the ability to limit
>>> searches upon those meta-tags. E.g.
>>> http://www.example1.com/something1/ -> meta-tag "companybranch1"
>>> http://www.example2.com/something2/ -> meta-tag "companybranch2"
>>> http://www.example3.com/something3/ -> meta-tag "companybranch1"
>>> http://www.example4.com/something4/ -> meta-tag "companybranch3"
>>> search for everything in companybranch1 or across 1 and 3 or similar
Re: [jira] Commented: (NUTCH-271) Meta-data per URL/site/section
Posted by Sami Siren <ss...@gmail.com>.
0.8 has subcollection plugin. It can add subollection id for set of urls
and then you can limit searching to subcollections. Is that what you're
after?
--
Sami Siren
Stefan Neufeind (JIRA) wrote:
> [ http://issues.apache.org/jira/browse/NUTCH-271?page=comments#action_12422226 ]
>
>Stefan Neufeind commented on NUTCH-271:
>---------------------------------------
>
>Does somebody have an existing demo-plugin for that, that would catch URL-prefixes from a file and in case matches are found certain tags are then added? I don't yet fully get it how to do it "the elegant way" :-)
>
>
>
>>Meta-data per URL/site/section
>>------------------------------
>>
>> Key: NUTCH-271
>> URL: http://issues.apache.org/jira/browse/NUTCH-271
>> Project: Nutch
>> Issue Type: New Feature
>> Affects Versions: 0.7.2
>> Reporter: Stefan Neufeind
>>
>>We have the need to index sites and attach additional meta-data-tags to them. Afaik this is not yet possible, or is there a "workaround" I don't see? What I think of is using meta-tags per start-url, only indexing content below that URL, and have the ability to limit searches upon those meta-tags. E.g.
>>http://www.example1.com/something1/ -> meta-tag "companybranch1"
>>http://www.example2.com/something2/ -> meta-tag "companybranch2"
>>http://www.example3.com/something3/ -> meta-tag "companybranch1"
>>http://www.example4.com/something4/ -> meta-tag "companybranch3"
>>search for everything in companybranch1 or across 1 and 3 or similar
>>
>>
>
>
>