You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Gabriele Kahlout <ga...@mysimpatico.com> on 2011/09/05 07:18:09 UTC
How to make the url id case insensitive?
Hi,
I've just noticed that two search results of indexed data have the same url:
http://www.atory.com/dupe_checker_pro/
http://www.atory.com/dupe_checker_PRO/
I thought the url/id was case-insentively unique. Is there how I can set it
up to be so?
For Solr it makes sense not to make it the default for disparate uses, but
for nutch not.
--
Regards,
K. Gabriele
--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).
If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).
Re: How to make the url id case insensitive?
Posted by Markus Jelsma <ma...@openindex.io>.
Deduplication, either using Nutch or Solr.
> On Mon, Sep 5, 2011 at 1:22 PM, Markus Jelsma
<ma...@openindex.io>wrote:
> > Hi,
> >
> > URI paths are case-sensitive. If you really want to treat all URL's as
> > case-
> > insensitive i would suggest to modifiy the basic URL normalizer to
> > lowercase
> > all URL's so that it also ends up lowercased in the CrawlDB.
> >
> > What is your problem? I would strongly suggest another solution if you're
> > doing wide web crawls.
>
> I don't want duplicate results where the only real difference is the case
> of some letters in the URL.
> What other solution?
>
> > Cheers,
> >
> > > Hi,
> > > I've just noticed that two search results of indexed data have the same
> > > url:
> > >
> > > http://www.atory.com/dupe_checker_pro/
> > > http://www.atory.com/dupe_checker_PRO/
> > >
> > > I thought the url/id was case-insentively unique. Is there how I can
> > > set
> >
> > it
> >
> > > up to be so?
> > >
> > > For Solr it makes sense not to make it the default for disparate uses,
> >
> > but
> >
> > > for nutch not.
Re: How to make the url id case insensitive?
Posted by Markus Jelsma <ma...@openindex.io>.
Deduplication, either using Nutch or Solr.
> On Mon, Sep 5, 2011 at 1:22 PM, Markus Jelsma
<ma...@openindex.io>wrote:
> > Hi,
> >
> > URI paths are case-sensitive. If you really want to treat all URL's as
> > case-
> > insensitive i would suggest to modifiy the basic URL normalizer to
> > lowercase
> > all URL's so that it also ends up lowercased in the CrawlDB.
> >
> > What is your problem? I would strongly suggest another solution if you're
> > doing wide web crawls.
>
> I don't want duplicate results where the only real difference is the case
> of some letters in the URL.
> What other solution?
>
> > Cheers,
> >
> > > Hi,
> > > I've just noticed that two search results of indexed data have the same
> > > url:
> > >
> > > http://www.atory.com/dupe_checker_pro/
> > > http://www.atory.com/dupe_checker_PRO/
> > >
> > > I thought the url/id was case-insentively unique. Is there how I can
> > > set
> >
> > it
> >
> > > up to be so?
> > >
> > > For Solr it makes sense not to make it the default for disparate uses,
> >
> > but
> >
> > > for nutch not.
Re: How to make the url id case insensitive?
Posted by Gabriele Kahlout <ga...@mysimpatico.com>.
On Mon, Sep 5, 2011 at 1:22 PM, Markus Jelsma <ma...@openindex.io>wrote:
> Hi,
>
> URI paths are case-sensitive. If you really want to treat all URL's as
> case-
> insensitive i would suggest to modifiy the basic URL normalizer to
> lowercase
> all URL's so that it also ends up lowercased in the CrawlDB.
>
> What is your problem? I would strongly suggest another solution if you're
> doing wide web crawls.
>
I don't want duplicate results where the only real difference is the case of
some letters in the URL.
What other solution?
>
> Cheers,
>
> > Hi,
> > I've just noticed that two search results of indexed data have the same
> > url:
> >
> > http://www.atory.com/dupe_checker_pro/
> > http://www.atory.com/dupe_checker_PRO/
> >
> > I thought the url/id was case-insentively unique. Is there how I can set
> it
> > up to be so?
> >
> > For Solr it makes sense not to make it the default for disparate uses,
> but
> > for nutch not.
>
--
Regards,
K. Gabriele
--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).
If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).
Re: How to make the url id case insensitive?
Posted by Gabriele Kahlout <ga...@mysimpatico.com>.
On Mon, Sep 5, 2011 at 1:22 PM, Markus Jelsma <ma...@openindex.io>wrote:
> Hi,
>
> URI paths are case-sensitive. If you really want to treat all URL's as
> case-
> insensitive i would suggest to modifiy the basic URL normalizer to
> lowercase
> all URL's so that it also ends up lowercased in the CrawlDB.
>
> What is your problem? I would strongly suggest another solution if you're
> doing wide web crawls.
>
I don't want duplicate results where the only real difference is the case of
some letters in the URL.
What other solution?
>
> Cheers,
>
> > Hi,
> > I've just noticed that two search results of indexed data have the same
> > url:
> >
> > http://www.atory.com/dupe_checker_pro/
> > http://www.atory.com/dupe_checker_PRO/
> >
> > I thought the url/id was case-insentively unique. Is there how I can set
> it
> > up to be so?
> >
> > For Solr it makes sense not to make it the default for disparate uses,
> but
> > for nutch not.
>
--
Regards,
K. Gabriele
--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).
If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).
Re: How to make the url id case insensitive?
Posted by Markus Jelsma <ma...@openindex.io>.
Hi,
URI paths are case-sensitive. If you really want to treat all URL's as case-
insensitive i would suggest to modifiy the basic URL normalizer to lowercase
all URL's so that it also ends up lowercased in the CrawlDB.
What is your problem? I would strongly suggest another solution if you're
doing wide web crawls.
Cheers,
> Hi,
> I've just noticed that two search results of indexed data have the same
> url:
>
> http://www.atory.com/dupe_checker_pro/
> http://www.atory.com/dupe_checker_PRO/
>
> I thought the url/id was case-insentively unique. Is there how I can set it
> up to be so?
>
> For Solr it makes sense not to make it the default for disparate uses, but
> for nutch not.
Re: How to make the url id case insensitive?
Posted by Markus Jelsma <ma...@openindex.io>.
Hi,
URI paths are case-sensitive. If you really want to treat all URL's as case-
insensitive i would suggest to modifiy the basic URL normalizer to lowercase
all URL's so that it also ends up lowercased in the CrawlDB.
What is your problem? I would strongly suggest another solution if you're
doing wide web crawls.
Cheers,
> Hi,
> I've just noticed that two search results of indexed data have the same
> url:
>
> http://www.atory.com/dupe_checker_pro/
> http://www.atory.com/dupe_checker_PRO/
>
> I thought the url/id was case-insentively unique. Is there how I can set it
> up to be so?
>
> For Solr it makes sense not to make it the default for disparate uses, but
> for nutch not.