You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Stefan Neufeind <ap...@stefan-neufeind.de> on 2006/05/25 13:21:07 UTC

Sorting in nutch-webinterface - how?

Hi,

I did use index-basic and index-more. I see lastModified in the
RSS-output. Now I want to &sort=lastModified - does not work. Same for
&sort=title. However &sort=url does work.

What am I doing wrong here?


Regards,
 Stefan

Re: Sorting in nutch-webinterface - how?

Posted by Doug Cutting <cu...@apache.org>.
Stefan Neufeind wrote:
> Can you maybe also help me out with sort=title?

Lucene's works with indexed, non-tokenized fields.  The title field is 
tokenized.  If you need to sort by title then you'd need to add a plugin 
that indexes another field (e.g., "sortTitle") containing the 
un-tokenized title, perhaps lowercased, if you want case-independent 
sorting.

http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Sort.html

Doug

Re: Sorting in nutch-webinterface - how?

Posted by Stefan Neufeind <ap...@stefan-neufeind.de>.
Marko Bauhardt wrote:
> 
> Am 26.05.2006 um 01:57 schrieb Stefan Neufeind:
>>> Modified. If not, date=FetchTime.
>>
>> Hi Marko,
>>
> 
> Hi Stefan,
> 
>> that hint really helped. Can you maybe also help me out with sort=title?
>> See also:
>> http://issues.apache.org/jira/browse/NUTCH-287
>>
>> The problem is that it works on some searches - but not always. Could it
>> be that maybe some plugins don't write a title or write title as
>> null/empty and that leads to problems? What could I do:
> 
> If a html page begins with "<?xml", then the textparser is used and not
> the html parser (i am not sure). If the TextParser is used to parse this
> page, then no title will be extract. So in this case the title is empty
> and the summary is xml-code.
> 
> Please verify your pages , that have no title and look whether "<?xml"
> exists at the begin of this page.

I could understand that those documents are "problematic" in sorting -
e.g. they would all be in front or at the end of the sorted list. But
why does this actually lead to no output/an exception/...?

Maybe in case no title is present at least _something_ could be used -
e.g. the URL instead or so?


Regards,
 Stefan

Re: Sorting in nutch-webinterface - how?

Posted by Marko Bauhardt <mb...@media-style.com>.
Am 26.05.2006 um 01:57 schrieb Stefan Neufeind:
>> Modified. If not, date=FetchTime.
>
> Hi Marko,
>

Hi Stefan,

> that hint really helped. Can you maybe also help me out with  
> sort=title?
> See also:
> http://issues.apache.org/jira/browse/NUTCH-287
>
> The problem is that it works on some searches - but not always.  
> Could it
> be that maybe some plugins don't write a title or write title as
> null/empty and that leads to problems? What could I do:

If a html page begins with "<?xml", then the textparser is used and  
not the html parser (i am not sure). If the TextParser is used to  
parse this page, then no title will be extract. So in this case the  
title is empty and the summary is xml-code.

Please verify your pages , that have no title and look whether "<? 
xml" exists at the begin of this page.

Marko





Re: Sorting in nutch-webinterface - how?

Posted by Stefan Neufeind <ap...@stefan-neufeind.de>.
Marko Bauhardt wrote:
> 
>> Hmm, that works. But why - since I think the field is named lastModified.
> 
> LastModified is only used if lastModified is available about the html
> meta tags. If that true, lastModified is stored but not indexed.
> However the date field is always indexed. Is lastModified is available
> as metatag, then date=lastModified. If not, date=FetchTime.

Hi Marko,

that hint really helped. Can you maybe also help me out with sort=title?
See also:
http://issues.apache.org/jira/browse/NUTCH-287

The problem is that it works on some searches - but not always. Could it
be that maybe some plugins don't write a title or write title as
null/empty and that leads to problems? What could I do:
a) as a quickfix to prevent the exception    and
b) to track this further down which result(s) and why actually cause the
problem.

I've taken a look at the javadoc from the lucene-interface. It looks
like if you sort by something the fields[0] should always be set with
the field you searched for - but afaik actually it is null, or maybe
even fields is empty or so.


Regards,
 Stefan

Re: Sorting in nutch-webinterface - how?

Posted by Marko Bauhardt <mb...@media-style.com>.

>
> Hmm, that works. But why - since I think the field is named  
> lastModified.

LastModified is only used if lastModified is available about the html  
meta tags. If that true, lastModified is stored but not indexed.
However the date field is always indexed. Is lastModified is  
available as metatag, then date=lastModified. If not, date=FetchTime.

HTH,
Marko


Re: Sorting in nutch-webinterface - how?

Posted by Stefan Neufeind <ap...@stefan-neufeind.de>.
Marko Bauhardt wrote:
> 
> Am 25.05.2006 um 13:21 schrieb Stefan Neufeind:
> 
>> Hi,
>>
>> I did use index-basic and index-more. I see lastModified in the
>> RSS-output. Now I want to &sort=lastModified - does not work.
> 
> Try sort=date.

Hmm, that works. But why - since I think the field is named lastModified.


Thank you very much for your help,
 Stefan

Re: Sorting in nutch-webinterface - how?

Posted by Marko Bauhardt <mb...@media-style.com>.
Am 25.05.2006 um 13:21 schrieb Stefan Neufeind:

> Hi,
>
> I did use index-basic and index-more. I see lastModified in the
> RSS-output. Now I want to &sort=lastModified - does not work.

Try sort=date.

Regards,
Marko