You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Stefan Neufeind <ap...@stefan-neufeind.de> on 2006/05/25 13:21:07 UTC
Sorting in nutch-webinterface - how?
Hi,
I did use index-basic and index-more. I see lastModified in the
RSS-output. Now I want to &sort=lastModified - does not work. Same for
&sort=title. However &sort=url does work.
What am I doing wrong here?
Regards,
Stefan
Re: Sorting in nutch-webinterface - how?
Posted by Doug Cutting <cu...@apache.org>.
Stefan Neufeind wrote:
> Can you maybe also help me out with sort=title?
Lucene's works with indexed, non-tokenized fields. The title field is
tokenized. If you need to sort by title then you'd need to add a plugin
that indexes another field (e.g., "sortTitle") containing the
un-tokenized title, perhaps lowercased, if you want case-independent
sorting.
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Sort.html
Doug
Re: Sorting in nutch-webinterface - how?
Posted by Stefan Neufeind <ap...@stefan-neufeind.de>.
Marko Bauhardt wrote:
>
> Am 26.05.2006 um 01:57 schrieb Stefan Neufeind:
>>> Modified. If not, date=FetchTime.
>>
>> Hi Marko,
>>
>
> Hi Stefan,
>
>> that hint really helped. Can you maybe also help me out with sort=title?
>> See also:
>> http://issues.apache.org/jira/browse/NUTCH-287
>>
>> The problem is that it works on some searches - but not always. Could it
>> be that maybe some plugins don't write a title or write title as
>> null/empty and that leads to problems? What could I do:
>
> If a html page begins with "<?xml", then the textparser is used and not
> the html parser (i am not sure). If the TextParser is used to parse this
> page, then no title will be extract. So in this case the title is empty
> and the summary is xml-code.
>
> Please verify your pages , that have no title and look whether "<?xml"
> exists at the begin of this page.
I could understand that those documents are "problematic" in sorting -
e.g. they would all be in front or at the end of the sorted list. But
why does this actually lead to no output/an exception/...?
Maybe in case no title is present at least _something_ could be used -
e.g. the URL instead or so?
Regards,
Stefan
Re: Sorting in nutch-webinterface - how?
Posted by Marko Bauhardt <mb...@media-style.com>.
Am 26.05.2006 um 01:57 schrieb Stefan Neufeind:
>> Modified. If not, date=FetchTime.
>
> Hi Marko,
>
Hi Stefan,
> that hint really helped. Can you maybe also help me out with
> sort=title?
> See also:
> http://issues.apache.org/jira/browse/NUTCH-287
>
> The problem is that it works on some searches - but not always.
> Could it
> be that maybe some plugins don't write a title or write title as
> null/empty and that leads to problems? What could I do:
If a html page begins with "<?xml", then the textparser is used and
not the html parser (i am not sure). If the TextParser is used to
parse this page, then no title will be extract. So in this case the
title is empty and the summary is xml-code.
Please verify your pages , that have no title and look whether "<?
xml" exists at the begin of this page.
Marko
Re: Sorting in nutch-webinterface - how?
Posted by Stefan Neufeind <ap...@stefan-neufeind.de>.
Marko Bauhardt wrote:
>
>> Hmm, that works. But why - since I think the field is named lastModified.
>
> LastModified is only used if lastModified is available about the html
> meta tags. If that true, lastModified is stored but not indexed.
> However the date field is always indexed. Is lastModified is available
> as metatag, then date=lastModified. If not, date=FetchTime.
Hi Marko,
that hint really helped. Can you maybe also help me out with sort=title?
See also:
http://issues.apache.org/jira/browse/NUTCH-287
The problem is that it works on some searches - but not always. Could it
be that maybe some plugins don't write a title or write title as
null/empty and that leads to problems? What could I do:
a) as a quickfix to prevent the exception and
b) to track this further down which result(s) and why actually cause the
problem.
I've taken a look at the javadoc from the lucene-interface. It looks
like if you sort by something the fields[0] should always be set with
the field you searched for - but afaik actually it is null, or maybe
even fields is empty or so.
Regards,
Stefan
Re: Sorting in nutch-webinterface - how?
Posted by Marko Bauhardt <mb...@media-style.com>.
>
> Hmm, that works. But why - since I think the field is named
> lastModified.
LastModified is only used if lastModified is available about the html
meta tags. If that true, lastModified is stored but not indexed.
However the date field is always indexed. Is lastModified is
available as metatag, then date=lastModified. If not, date=FetchTime.
HTH,
Marko
Re: Sorting in nutch-webinterface - how?
Posted by Stefan Neufeind <ap...@stefan-neufeind.de>.
Marko Bauhardt wrote:
>
> Am 25.05.2006 um 13:21 schrieb Stefan Neufeind:
>
>> Hi,
>>
>> I did use index-basic and index-more. I see lastModified in the
>> RSS-output. Now I want to &sort=lastModified - does not work.
>
> Try sort=date.
Hmm, that works. But why - since I think the field is named lastModified.
Thank you very much for your help,
Stefan
Re: Sorting in nutch-webinterface - how?
Posted by Marko Bauhardt <mb...@media-style.com>.
Am 25.05.2006 um 13:21 schrieb Stefan Neufeind:
> Hi,
>
> I did use index-basic and index-more. I see lastModified in the
> RSS-output. Now I want to &sort=lastModified - does not work.
Try sort=date.
Regards,
Marko