You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Alan Wang <sf...@gmail.com> on 2005/04/20 04:01:56 UTC

Sort does not work properly

Hi,

I am trying to sort the search result with "lastModified" field.  So I index
"lastModified " as Integer and Keyword into index and search with
search(Qurey query, Filter filter, int n, Sort sort) method.  Just modified
in net.nutch.searcher.LuceneQueryOptimizer.optimize.
return searcher.search(query, filter, numHits, 

	new Sort( 
		new SortField[]{
			new SortField("lastModified", SortField.INT, true)
		}
		));

The result sure changed, and largely sorted by time. But it didn't exactly
sorted by lastModified. The results looks ugly, :(.

Someboy can help?

--
Regards,
Alan Wang 


Re: [Nutch-dev] Re: Sort does not work properly

Posted by zhang jin <pr...@gmail.com>.
That' s good,thanks

2005/4/21, Alan Wang <sf...@gmail.com>: 
> 
> Thanks.
> 
> I am sorry that I thought the message is not sent and I resend it. :(.
> And I am sorry that I did not describe it clearly.
> 
> The two item that Doug mentioned is not the source of this problem
> because I have already changed MoreIndexingFilter.java as listed
> below. So maybe there are something freak in Sort related things. I
> will check more deeply and check the SortComparatorSource and
> HitCollector for some information.
> 
> BTW,
> 1.fo.getFetchDate() is is more reasonable than get current time and I
> will change it.
> 2.If any documemt did not have the "lastModified" field, the sort
> results is totally wrong. Doug, maybe you know why does this happen.
> Now, it's only partly wrong.
> :)
> 
> code listed below:
> ------
> private Document addTime(Document doc, Properties metaData, String url) {
> 
> String lastModified = metaData.getProperty("last-modified");
> if (lastModified == null)
> return doc;
> 
> // index/store it as long value
> DateFormat df = new SimpleDateFormat("EEE MMM dd HH:mm:ss yyyy zzz");
> try {
> lastModified = new Long(HttpDateFormat.toLong(lastModified)).toString();
> } catch (ParseException e) {
> // try to parse it as date in alternative format
> try {
> Date d = df.parse(lastModified);
> lastModified = new Long(d.getTime()).toString();
> } catch (Exception e1) {
> try{
> Date d=new Date();
> lastModified = new Long(d.getTime()).toString();
> }
> catch (Exception ex){
> LOG.fine(url+": can't use current time as last-modified");
> }
> LOG.fine(url+": can't parse erroneous last-modified: "+lastModified);
> 
> }
> }
> 
> if (lastModified != null)
> doc.add(Field.Keyword("lastModified", lastModified));
> 
> return doc;
> }
> 
> On 4/21/05, Doug Cutting <cu...@nutch.org> wrote:
> > Alan Wang wrote:
> > > I am trying to sort the search result with "lastModified" field. So I 
> index
> > > "lastModified " as Integer and Keyword into index and search with
> > > search(Qurey query, Filter filter, int n, Sort sort) method. Just 
> modified
> > > in net.nutch.searcher.LuceneQueryOptimizer.optimize.
> > > return searcher.search(query, filter, numHits,
> > >
> > > new Sort(
> > > new SortField[]{
> > > new SortField("lastModified", SortField.INT <http://SortField.INT>, 
> true)
> > > }
> > > ));
> > >
> > > The result sure changed, and largely sorted by time. But it didn't 
> exactly
> > > sorted by lastModified. The results looks ugly, :(.
> >
> > I can see two sources of problems:
> >
> > 1. You should sort by the "date" field, not "lastModified", since that's
> > not indexed, and sorting requires an indexed field.
> >
> > 2. Not all pages have a lastModified value. You should change
> > MoreIndexingFilter to always add a date. If no last modified is
> > specified, then use the fetch date, fo.getFetchDate().
> >
> > If you get this working, please send a patch. Even if it's a hack, it's
> > a start for others.
> >
> > Thanks,
> >
> > Doug
> >
> > -------------------------------------------------------
> > This SF.Net <http://SF.Net> email is sponsored by: New Crystal Reports 
> XI.
> > Version 11 adds new functionality designed to reduce time involved in
> > creating, integrating, and deploying reporting solutions. Free runtime 
> info,
> > new features, or free trial, at: 
> http://www.businessobjects.com/devxi/728
> > _______________________________________________
> > Nutch-developers mailing list
> > Nutch-developers@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/nutch-developers
> >
> 
> --
> Regards,
> Alan Wang
> 



-- 
TEL 0512-68251233-6966
MSN:prettysino@hotmail.com
Mail:jimijinzhang@BenQ.com
QQ:58624951
BenQ.com <http://BenQ.com>
268 Shishan Road, New District, 
Suzhou, China

Re: [Nutch-dev] Re: Sort does not work properly

Posted by Alan Wang <sf...@gmail.com>.
Yep, that's my problems. Got it and thanks. 

And I found that my problem is mainly in
net.nutch.searcher.DistributedSearch. Lucene give sorted results
properly and the search server return results sorted by lastModified.
However, the search client resort them by score. The Hit class cannot
pass the lastModified back, and get HitDetail for every  hit is
insufficient. Now I use score to pass it when sorted by lastModified.

Are there more beautiful solutions? And any consideration to add more
sort mothod on the search page? Maybe I can help.

------
Regards,
Alan Wang

Re: [Nutch-dev] Re: Sort does not work properly

Posted by Doug Cutting <cu...@nutch.org>.
Alan Wang wrote:
>     String lastModified = metaData.getProperty("last-modified");
>     if (lastModified == null)
>       return doc;

If the metaData does not contain a "last-modified" entry (from the http 
headers) then the document ends up with no last-modified field, and 
hence nothing to sort it on.

Also, the sorting code you sent assumes that dates are ints, while 
you've modified things to index a long.  That will cause problems too. 
It is substantially more efficient in Lucene to sort by ints, so I 
recommend switching this back to indexing a YYYYMMDD int.  If you need 
more precision, you could index to the hour (YYYYMMDDHH) and still stay 
within positive integers, or you could convert things to something like 
minutes since 1970.

Doug

Re: [Nutch-dev] Re: Sort does not work properly

Posted by Alan Wang <sf...@gmail.com>.
Thanks.

I am sorry that I thought the message is not sent and I resend it. :(.
And I am sorry that I did not describe it clearly.

The two item that Doug mentioned is not the source of this problem
because I have already changed MoreIndexingFilter.java as listed
below. So maybe there are something freak in Sort related things. I
will check more deeply and check the SortComparatorSource and
HitCollector for some information.

BTW, 
1.fo.getFetchDate() is is more reasonable than get current time and I
will change it.
2.If any documemt did not have the "lastModified" field, the sort
results is totally  wrong. Doug, maybe you know why does this happen.
Now, it's only partly wrong.
:)

code listed below:
------
  private Document addTime(Document doc, Properties metaData, String url) {

    String lastModified = metaData.getProperty("last-modified");
    if (lastModified == null)
      return doc;

    // index/store it as long value
    DateFormat df = new SimpleDateFormat("EEE MMM dd HH:mm:ss yyyy zzz");
    try {
      lastModified = new Long(HttpDateFormat.toLong(lastModified)).toString();
    } catch  (ParseException e) {
      // try to parse it as date in alternative format
      try {
        Date d = df.parse(lastModified);
        lastModified = new Long(d.getTime()).toString();
      } catch (Exception e1) {
      	try{
      		Date d=new Date();
      		lastModified = new Long(d.getTime()).toString();
      	}
      	catch (Exception ex){
      		LOG.fine(url+": can't use current time as last-modified");
      	}
        LOG.fine(url+": can't parse erroneous last-modified: "+lastModified);
        
      }
    }

    if (lastModified != null)
      doc.add(Field.Keyword("lastModified", lastModified));

    return doc;
  }

On 4/21/05, Doug Cutting <cu...@nutch.org> wrote:
> Alan Wang wrote:
> > I am trying to sort the search result with "lastModified" field.  So I index
> > "lastModified " as Integer and Keyword into index and search with
> > search(Qurey query, Filter filter, int n, Sort sort) method.  Just modified
> > in net.nutch.searcher.LuceneQueryOptimizer.optimize.
> > return searcher.search(query, filter, numHits,
> >
> >       new Sort(
> >               new SortField[]{
> >                       new SortField("lastModified", SortField.INT, true)
> >               }
> >               ));
> >
> > The result sure changed, and largely sorted by time. But it didn't exactly
> > sorted by lastModified. The results looks ugly, :(.
> 
> I can see two sources of problems:
> 
> 1. You should sort by the "date" field, not "lastModified", since that's
> not indexed, and sorting requires an indexed field.
> 
> 2. Not all pages have a lastModified value.  You should change
> MoreIndexingFilter to always add a date.  If no last modified is
> specified, then use the fetch date, fo.getFetchDate().
> 
> If you get this working, please send a patch.  Even if it's a hack, it's
> a start for others.
> 
> Thanks,
> 
> Doug
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by: New Crystal Reports XI.
> Version 11 adds new functionality designed to reduce time involved in
> creating, integrating, and deploying reporting solutions. Free runtime info,
> new features, or free trial, at: http://www.businessobjects.com/devxi/728
> _______________________________________________
> Nutch-developers mailing list
> Nutch-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
> 


-- 
Regards,
Alan Wang

Re: Sort does not work properly

Posted by Doug Cutting <cu...@nutch.org>.
Alan Wang wrote:
> I am trying to sort the search result with "lastModified" field.  So I index
> "lastModified " as Integer and Keyword into index and search with
> search(Qurey query, Filter filter, int n, Sort sort) method.  Just modified
> in net.nutch.searcher.LuceneQueryOptimizer.optimize.
> return searcher.search(query, filter, numHits, 
> 
> 	new Sort( 
> 		new SortField[]{
> 			new SortField("lastModified", SortField.INT, true)
> 		}
> 		));
> 
> The result sure changed, and largely sorted by time. But it didn't exactly
> sorted by lastModified. The results looks ugly, :(.

I can see two sources of problems:

1. You should sort by the "date" field, not "lastModified", since that's 
not indexed, and sorting requires an indexed field.

2. Not all pages have a lastModified value.  You should change 
MoreIndexingFilter to always add a date.  If no last modified is 
specified, then use the fetch date, fo.getFetchDate().

If you get this working, please send a patch.  Even if it's a hack, it's 
a start for others.

Thanks,

Doug