You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Hans Benedict <be...@chemie.de> on 2005/06/27 09:38:42 UTC

dedup vs. session ids

Hi,

I am crawling some sites that use session ids. As the crawler does not 
use cookies, they are put in the url's querystring. This results in 
thousands of pages that are - based on the visible content - duplicates, 
but are detected as such, because the urls contained in the html are 
different.

Has anybody found a solution to this problem? Is there a way to activate 
cookies for the crawler?

-- 
Kind regards,

Hans Benedict

_________________________________________________________________
Chemie.DE Information Service GmbH     Hans Benedict
Seydelstraße 28                        mailto: benedict@chemie.de
10117 Berlin, Germany                  Tel +49 30 204568-40
                                       Fax +49 30 204568-70

www.Chemie.DE               |          www.ChemieKarriere.NET   
www.Bionity.COM             |          www.BioKarriere.NET 


Re: dedup vs. session ids

Posted by Hans <be...@chemie.de>.
Andy Liu wrote:

> URL normalization occurs during parsing.  If your index isn't that
> big, it may be easier to start your crawl from scratch.

Can I do parsing without re-fetching? Or are only the parsed data stored on disk?

Can I re-fetch only some servers while keeping the data of the other servers intact? (It's only a handful of my servers that use session ids.) 

Will the old pages whith badly normalized urls get overwritten by the new ones or will I have to delete them manually?

Thanks for your help!

Regards,

Hans Benedict




Andy Liu wrote:

>URL normalization occurs during parsing.  If your index isn't that
>big, it may be easier to start your crawl from scratch.
>
>On 6/29/05, Hans Benedict <be...@chemie.de> wrote:
>  
>
>>Juho, thanks, that was what I was looking for.
>>
>>What I still don't understand: When is this URL-Normalization done? Or
>>more precisely: What will I have to do with my already crawled pages?
>>Reindex? Update the db? A simple dedup did not seem to do the job...
>>
>>Regards,
>>
>>Hans Benedict
>>
>>_________________________________________________________________
>>Chemie.DE Information Service GmbH     Hans Benedict
>>Seydelstraße 28                        mailto: benedict@chemie.de
>>10117 Berlin, Germany                  Tel +49 30 204568-40
>>                                       Fax +49 30 204568-70
>>
>>www.Chemie.DE               |          www.ChemieKarriere.NET
>>www.Bionity.COM             |          www.BioKarriere.NET
>>
>>
>>
>>Juho Mäkinen wrote:
>>
>>    
>>
>>>Take a look under conf/regex-normalize.xml
>>>
>>>I don't know how it works, but it seems to do just what you need,
>>>removing session data from GET urls. It's been configured to
>>>remove PHPSESSID variables on default, but you should be
>>>easily able to figure how to custome it for your needs.
>>>
>>>- Juho Mäkinen, http://www.juhonkoti.net
>>>
>>>On 6/27/05, Hans Benedict <be...@chemie.de> wrote:
>>>
>>>
>>>      
>>>
>>>>Hi,
>>>>
>>>>I am crawling some sites that use session ids. As the crawler does not
>>>>use cookies, they are put in the url's querystring. This results in
>>>>thousands of pages that are - based on the visible content - duplicates,
>>>>but are detected as such, because the urls contained in the html are
>>>>different.
>>>>
>>>>Has anybody found a solution to this problem? Is there a way to activate
>>>>cookies for the crawler?
>>>>
>>>>--
>>>>Kind regards,
>>>>
>>>>Hans Benedict
>>>>
>>>>_________________________________________________________________
>>>>Chemie.DE Information Service GmbH     Hans Benedict
>>>>Seydelstraße 28                        mailto: benedict@chemie.de
>>>>10117 Berlin, Germany                  Tel +49 30 204568-40
>>>>                                      Fax +49 30 204568-70
>>>>
>>>>www.Chemie.DE               |          www.ChemieKarriere.NET
>>>>www.Bionity.COM             |          www.BioKarriere.NET
>>>>
>>>>
>>>>
>>>>
>>>>        
>>>>

-- 
Hans


Re: dedup vs. session ids

Posted by Andy Liu <an...@gmail.com>.
URL normalization occurs during parsing.  If your index isn't that
big, it may be easier to start your crawl from scratch.

On 6/29/05, Hans Benedict <be...@chemie.de> wrote:
> Juho, thanks, that was what I was looking for.
> 
> What I still don't understand: When is this URL-Normalization done? Or
> more precisely: What will I have to do with my already crawled pages?
> Reindex? Update the db? A simple dedup did not seem to do the job...
> 
> Regards,
> 
> Hans Benedict
> 
> _________________________________________________________________
> Chemie.DE Information Service GmbH     Hans Benedict
> Seydelstraße 28                        mailto: benedict@chemie.de
> 10117 Berlin, Germany                  Tel +49 30 204568-40
>                                        Fax +49 30 204568-70
> 
> www.Chemie.DE               |          www.ChemieKarriere.NET
> www.Bionity.COM             |          www.BioKarriere.NET
> 
> 
> 
> Juho Mäkinen wrote:
> 
> >Take a look under conf/regex-normalize.xml
> >
> >I don't know how it works, but it seems to do just what you need,
> >removing session data from GET urls. It's been configured to
> >remove PHPSESSID variables on default, but you should be
> >easily able to figure how to custome it for your needs.
> >
> > - Juho Mäkinen, http://www.juhonkoti.net
> >
> >On 6/27/05, Hans Benedict <be...@chemie.de> wrote:
> >
> >
> >>Hi,
> >>
> >>I am crawling some sites that use session ids. As the crawler does not
> >>use cookies, they are put in the url's querystring. This results in
> >>thousands of pages that are - based on the visible content - duplicates,
> >>but are detected as such, because the urls contained in the html are
> >>different.
> >>
> >>Has anybody found a solution to this problem? Is there a way to activate
> >>cookies for the crawler?
> >>
> >>--
> >>Kind regards,
> >>
> >>Hans Benedict
> >>
> >>_________________________________________________________________
> >>Chemie.DE Information Service GmbH     Hans Benedict
> >>Seydelstraße 28                        mailto: benedict@chemie.de
> >>10117 Berlin, Germany                  Tel +49 30 204568-40
> >>                                       Fax +49 30 204568-70
> >>
> >>www.Chemie.DE               |          www.ChemieKarriere.NET
> >>www.Bionity.COM             |          www.BioKarriere.NET
> >>
> >>
> >>
> >>
>

Re: dedup vs. session ids

Posted by Hans Benedict <be...@chemie.de>.
Juho, thanks, that was what I was looking for.

What I still don't understand: When is this URL-Normalization done? Or 
more precisely: What will I have to do with my already crawled pages? 
Reindex? Update the db? A simple dedup did not seem to do the job...

Regards,

Hans Benedict

_________________________________________________________________
Chemie.DE Information Service GmbH     Hans Benedict
Seydelstraße 28                        mailto: benedict@chemie.de
10117 Berlin, Germany                  Tel +49 30 204568-40
                                       Fax +49 30 204568-70

www.Chemie.DE               |          www.ChemieKarriere.NET   
www.Bionity.COM             |          www.BioKarriere.NET 



Juho Mäkinen wrote:

>Take a look under conf/regex-normalize.xml
>
>I don't know how it works, but it seems to do just what you need,
>removing session data from GET urls. It's been configured to
>remove PHPSESSID variables on default, but you should be
>easily able to figure how to custome it for your needs.
>
> - Juho Mäkinen, http://www.juhonkoti.net
>
>On 6/27/05, Hans Benedict <be...@chemie.de> wrote:
>  
>
>>Hi,
>>
>>I am crawling some sites that use session ids. As the crawler does not
>>use cookies, they are put in the url's querystring. This results in
>>thousands of pages that are - based on the visible content - duplicates,
>>but are detected as such, because the urls contained in the html are
>>different.
>>
>>Has anybody found a solution to this problem? Is there a way to activate
>>cookies for the crawler?
>>
>>--
>>Kind regards,
>>
>>Hans Benedict
>>
>>_________________________________________________________________
>>Chemie.DE Information Service GmbH     Hans Benedict
>>Seydelstraße 28                        mailto: benedict@chemie.de
>>10117 Berlin, Germany                  Tel +49 30 204568-40
>>                                       Fax +49 30 204568-70
>>
>>www.Chemie.DE               |          www.ChemieKarriere.NET
>>www.Bionity.COM             |          www.BioKarriere.NET
>>
>>
>>    
>>

Re: dedup vs. session ids

Posted by Juho Mäkinen <ju...@gmail.com>.
Take a look under conf/regex-normalize.xml

I don't know how it works, but it seems to do just what you need,
removing session data from GET urls. It's been configured to
remove PHPSESSID variables on default, but you should be
easily able to figure how to custome it for your needs.

 - Juho Mäkinen, http://www.juhonkoti.net

On 6/27/05, Hans Benedict <be...@chemie.de> wrote:
> Hi,
> 
> I am crawling some sites that use session ids. As the crawler does not
> use cookies, they are put in the url's querystring. This results in
> thousands of pages that are - based on the visible content - duplicates,
> but are detected as such, because the urls contained in the html are
> different.
> 
> Has anybody found a solution to this problem? Is there a way to activate
> cookies for the crawler?
> 
> --
> Kind regards,
> 
> Hans Benedict
> 
> _________________________________________________________________
> Chemie.DE Information Service GmbH     Hans Benedict
> Seydelstraße 28                        mailto: benedict@chemie.de
> 10117 Berlin, Germany                  Tel +49 30 204568-40
>                                        Fax +49 30 204568-70
> 
> www.Chemie.DE               |          www.ChemieKarriere.NET
> www.Bionity.COM             |          www.BioKarriere.NET
> 
>