You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Howie Wang <ho...@hotmail.com> on 2005/06/19 17:08:54 UTC

Ideas for enhancements

Hi,

There was some functionality in Nutch that I've either implemented,
or am planning to implement, and I was curious if other people were
interested so that maybe the changes could get into the main line.

1. A String[] HitDetails.getValues(String field) method that
returns an array of the values. The current only returns a
single string, and Lucene indexes can have multiple values
per field.

2. In Link.java, put in a field (parentURL) for the URL of the page that
contains the link. Right now it seems we just have the links themselves
and we can't backtrack where they come from. Being able to backtrack
through the links is handy for doing something like categorization. For
example, you see that all the links are coming from a page about poodles,
so you might categorize the linked page as a poodle page. It might also
come in handy for doing something like a Google TrustRank scoring, where
you penalize certain sites if they're a known link farm, or boost them if 
they're
from some place respected like DMOZ.

3. Get sorting to work on multiple fields. Lucene already works on
multiple fields so it shouldn't be difficult to get this working. Just
change the places where is passes down String field so that it
accepts an array. The sort fields could be read from the query
string in order:

   search.jsp?sort=score&reverse=true&sort=date&reverse=false

Is anybody interested in these things? It would be nice to get them
merged into the main code.

Howie



Re: Ideas for enhancements

Posted by Stefan Groschupf <sg...@media-style.com>.
Hi Howie,
> Howie Wang wrote:
>> 1. A String[] HitDetails.getValues(String field) method that
>> returns an array of the values. The current only returns a
>> single string, and Lucene indexes can have multiple values
>> per field.
>
> That sounds useful.  Please submit a patch against the trunk  
> attached to a bug report.

Any work already done for this? I would love to have multiple values  
and if there is nothing done yet I would love to create such a patch.

Thanks.
Stefan



Re: Ideas for enhancements

Posted by Doug Cutting <cu...@nutch.org>.
Howie Wang wrote:
> 1. A String[] HitDetails.getValues(String field) method that
> returns an array of the values. The current only returns a
> single string, and Lucene indexes can have multiple values
> per field.

That sounds useful.  Please submit a patch against the trunk attached to 
a bug report.

> 2. In Link.java, put in a field (parentURL) for the URL of the page that
> contains the link. Right now it seems we just have the links themselves
> and we can't backtrack where they come from. Being able to backtrack
> through the links is handy for doing something like categorization. For
> example, you see that all the links are coming from a page about poodles,
> so you might categorize the linked page as a poodle page. It might also
> come in handy for doing something like a Google TrustRank scoring, where
> you penalize certain sites if they're a known link farm, or boost them 
> if they're
> from some place respected like DMOZ.

This would certainly be useful functionality.  The link db has changed 
substantially in the current trunk and there is no longer a class named 
Link.  This has been replaced with Inlink and Outlink.  Have a look at 
the trunk and see if what you need isn't already there.

> 3. Get sorting to work on multiple fields. Lucene already works on
> multiple fields so it shouldn't be difficult to get this working. Just
> change the places where is passes down String field so that it
> accepts an array. The sort fields could be read from the query
> string in order:
> 
>   search.jsp?sort=score&reverse=true&sort=date&reverse=false

This would also be useful.  Please submit a patch against the trunk.

Thanks!

Doug