You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Chris Fraschetti <fr...@gmail.com> on 2005/07/28 01:16:07 UTC

URL Stemmer

Writing simple code to trim down a URL is trivial, but to actually
trim it down to its most meaningful state is very hard. In same cases
the URL parameters actually define the page in others they are useless
babble. I'd like to use the hash of a page's URL as well as a hash of
the content data to help me eliminate duplicates... is there any good
methods that are commonly used for URL stemming?

-- 
___________________________________________________
Chris Fraschetti
e fraschetti@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: URL Stemmer

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hm, not sure why you're emailing java-user@lucene.  nutch-user@lucene
may be better.  Here are 2 ancient classes from 2003 that I once used
to normalize URLs, to help me identify URL duplicates.  This may get
stripped on its way to the list.

Otis


--- Chris Fraschetti <fr...@gmail.com> wrote:

> Writing simple code to trim down a URL is trivial, but to actually
> trim it down to its most meaningful state is very hard. In same cases
> the URL parameters actually define the page in others they are
> useless
> babble. I'd like to use the hash of a page's URL as well as a hash of
> the content data to help me eliminate duplicates... is there any good
> methods that are commonly used for URL stemming?
> 
> -- 
> ___________________________________________________
> Chris Fraschetti
> e fraschetti@gmail.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
>