You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Chris Fraschetti <fr...@gmail.com> on 2005/07/28 01:16:07 UTC
URL Stemmer
Writing simple code to trim down a URL is trivial, but to actually
trim it down to its most meaningful state is very hard. In same cases
the URL parameters actually define the page in others they are useless
babble. I'd like to use the hash of a page's URL as well as a hash of
the content data to help me eliminate duplicates... is there any good
methods that are commonly used for URL stemming?
--
___________________________________________________
Chris Fraschetti
e fraschetti@gmail.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: URL Stemmer
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hm, not sure why you're emailing java-user@lucene. nutch-user@lucene
may be better. Here are 2 ancient classes from 2003 that I once used
to normalize URLs, to help me identify URL duplicates. This may get
stripped on its way to the list.
Otis
--- Chris Fraschetti <fr...@gmail.com> wrote:
> Writing simple code to trim down a URL is trivial, but to actually
> trim it down to its most meaningful state is very hard. In same cases
> the URL parameters actually define the page in others they are
> useless
> babble. I'd like to use the hash of a page's URL as well as a hash of
> the content data to help me eliminate duplicates... is there any good
> methods that are commonly used for URL stemming?
>
> --
> ___________________________________________________
> Chris Fraschetti
> e fraschetti@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>