You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Gabriele Kahlout <ga...@mysimpatico.com> on 2011/09/05 07:18:09 UTC

How to make the url id case insensitive?

Hi,
I've just noticed that two search results of indexed data have the same url:

http://www.atory.com/dupe_checker_pro/
http://www.atory.com/dupe_checker_PRO/

I thought the url/id was case-insentively unique. Is there how I can set it
up to be so?

For Solr it makes sense not to make it the default for disparate uses, but
for nutch not.

-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: How to make the url id case insensitive?

Posted by Markus Jelsma <ma...@openindex.io>.
Deduplication, either using Nutch or Solr.


> On Mon, Sep 5, 2011 at 1:22 PM, Markus Jelsma 
<ma...@openindex.io>wrote:
> > Hi,
> > 
> > URI paths are case-sensitive. If you really want to treat all URL's as
> > case-
> > insensitive i would suggest to modifiy the basic URL normalizer to
> > lowercase
> > all URL's so that it also ends up lowercased in the CrawlDB.
> > 
> > What is your problem? I would strongly suggest another solution if you're
> > doing wide web crawls.
> 
> I don't want duplicate results where the only real difference is the case
> of some letters in the URL.
> What other solution?
> 
> > Cheers,
> > 
> > > Hi,
> > > I've just noticed that two search results of indexed data have the same
> > > url:
> > > 
> > > http://www.atory.com/dupe_checker_pro/
> > > http://www.atory.com/dupe_checker_PRO/
> > > 
> > > I thought the url/id was case-insentively unique. Is there how I can
> > > set
> > 
> > it
> > 
> > > up to be so?
> > > 
> > > For Solr it makes sense not to make it the default for disparate uses,
> > 
> > but
> > 
> > > for nutch not.

Re: How to make the url id case insensitive?

Posted by Markus Jelsma <ma...@openindex.io>.
Deduplication, either using Nutch or Solr.


> On Mon, Sep 5, 2011 at 1:22 PM, Markus Jelsma 
<ma...@openindex.io>wrote:
> > Hi,
> > 
> > URI paths are case-sensitive. If you really want to treat all URL's as
> > case-
> > insensitive i would suggest to modifiy the basic URL normalizer to
> > lowercase
> > all URL's so that it also ends up lowercased in the CrawlDB.
> > 
> > What is your problem? I would strongly suggest another solution if you're
> > doing wide web crawls.
> 
> I don't want duplicate results where the only real difference is the case
> of some letters in the URL.
> What other solution?
> 
> > Cheers,
> > 
> > > Hi,
> > > I've just noticed that two search results of indexed data have the same
> > > url:
> > > 
> > > http://www.atory.com/dupe_checker_pro/
> > > http://www.atory.com/dupe_checker_PRO/
> > > 
> > > I thought the url/id was case-insentively unique. Is there how I can
> > > set
> > 
> > it
> > 
> > > up to be so?
> > > 
> > > For Solr it makes sense not to make it the default for disparate uses,
> > 
> > but
> > 
> > > for nutch not.

Re: How to make the url id case insensitive?

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.
On Mon, Sep 5, 2011 at 1:22 PM, Markus Jelsma <ma...@openindex.io>wrote:

> Hi,
>
> URI paths are case-sensitive. If you really want to treat all URL's as
> case-
> insensitive i would suggest to modifiy the basic URL normalizer to
> lowercase
> all URL's so that it also ends up lowercased in the CrawlDB.
>
> What is your problem? I would strongly suggest another solution if you're
> doing wide web crawls.
>

I don't want duplicate results where the only real difference is the case of
some letters in the URL.
What other solution?


>
> Cheers,
>
> > Hi,
> > I've just noticed that two search results of indexed data have the same
> > url:
> >
> > http://www.atory.com/dupe_checker_pro/
> > http://www.atory.com/dupe_checker_PRO/
> >
> > I thought the url/id was case-insentively unique. Is there how I can set
> it
> > up to be so?
> >
> > For Solr it makes sense not to make it the default for disparate uses,
> but
> > for nutch not.
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: How to make the url id case insensitive?

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.
On Mon, Sep 5, 2011 at 1:22 PM, Markus Jelsma <ma...@openindex.io>wrote:

> Hi,
>
> URI paths are case-sensitive. If you really want to treat all URL's as
> case-
> insensitive i would suggest to modifiy the basic URL normalizer to
> lowercase
> all URL's so that it also ends up lowercased in the CrawlDB.
>
> What is your problem? I would strongly suggest another solution if you're
> doing wide web crawls.
>

I don't want duplicate results where the only real difference is the case of
some letters in the URL.
What other solution?


>
> Cheers,
>
> > Hi,
> > I've just noticed that two search results of indexed data have the same
> > url:
> >
> > http://www.atory.com/dupe_checker_pro/
> > http://www.atory.com/dupe_checker_PRO/
> >
> > I thought the url/id was case-insentively unique. Is there how I can set
> it
> > up to be so?
> >
> > For Solr it makes sense not to make it the default for disparate uses,
> but
> > for nutch not.
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: How to make the url id case insensitive?

Posted by Markus Jelsma <ma...@openindex.io>.
Hi,

URI paths are case-sensitive. If you really want to treat all URL's as case-
insensitive i would suggest to modifiy the basic URL normalizer to lowercase 
all URL's so that it also ends up lowercased in the CrawlDB.

What is your problem? I would strongly suggest another solution if you're 
doing wide web crawls.

Cheers,

> Hi,
> I've just noticed that two search results of indexed data have the same
> url:
> 
> http://www.atory.com/dupe_checker_pro/
> http://www.atory.com/dupe_checker_PRO/
> 
> I thought the url/id was case-insentively unique. Is there how I can set it
> up to be so?
> 
> For Solr it makes sense not to make it the default for disparate uses, but
> for nutch not.

Re: How to make the url id case insensitive?

Posted by Markus Jelsma <ma...@openindex.io>.
Hi,

URI paths are case-sensitive. If you really want to treat all URL's as case-
insensitive i would suggest to modifiy the basic URL normalizer to lowercase 
all URL's so that it also ends up lowercased in the CrawlDB.

What is your problem? I would strongly suggest another solution if you're 
doing wide web crawls.

Cheers,

> Hi,
> I've just noticed that two search results of indexed data have the same
> url:
> 
> http://www.atory.com/dupe_checker_pro/
> http://www.atory.com/dupe_checker_PRO/
> 
> I thought the url/id was case-insentively unique. Is there how I can set it
> up to be so?
> 
> For Solr it makes sense not to make it the default for disparate uses, but
> for nutch not.