You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Hans Benedict <be...@chemie.de> on 2005/06/27 09:38:42 UTC
dedup vs. session ids
Hi,
I am crawling some sites that use session ids. As the crawler does not
use cookies, they are put in the url's querystring. This results in
thousands of pages that are - based on the visible content - duplicates,
but are detected as such, because the urls contained in the html are
different.
Has anybody found a solution to this problem? Is there a way to activate
cookies for the crawler?
--
Kind regards,
Hans Benedict
_________________________________________________________________
Chemie.DE Information Service GmbH Hans Benedict
Seydelstraße 28 mailto: benedict@chemie.de
10117 Berlin, Germany Tel +49 30 204568-40
Fax +49 30 204568-70
www.Chemie.DE | www.ChemieKarriere.NET
www.Bionity.COM | www.BioKarriere.NET
Re: dedup vs. session ids
Posted by Hans <be...@chemie.de>.
Andy Liu wrote:
> URL normalization occurs during parsing. If your index isn't that
> big, it may be easier to start your crawl from scratch.
Can I do parsing without re-fetching? Or are only the parsed data stored on disk?
Can I re-fetch only some servers while keeping the data of the other servers intact? (It's only a handful of my servers that use session ids.)
Will the old pages whith badly normalized urls get overwritten by the new ones or will I have to delete them manually?
Thanks for your help!
Regards,
Hans Benedict
Andy Liu wrote:
>URL normalization occurs during parsing. If your index isn't that
>big, it may be easier to start your crawl from scratch.
>
>On 6/29/05, Hans Benedict <be...@chemie.de> wrote:
>
>
>>Juho, thanks, that was what I was looking for.
>>
>>What I still don't understand: When is this URL-Normalization done? Or
>>more precisely: What will I have to do with my already crawled pages?
>>Reindex? Update the db? A simple dedup did not seem to do the job...
>>
>>Regards,
>>
>>Hans Benedict
>>
>>_________________________________________________________________
>>Chemie.DE Information Service GmbH Hans Benedict
>>Seydelstraße 28 mailto: benedict@chemie.de
>>10117 Berlin, Germany Tel +49 30 204568-40
>> Fax +49 30 204568-70
>>
>>www.Chemie.DE | www.ChemieKarriere.NET
>>www.Bionity.COM | www.BioKarriere.NET
>>
>>
>>
>>Juho Mäkinen wrote:
>>
>>
>>
>>>Take a look under conf/regex-normalize.xml
>>>
>>>I don't know how it works, but it seems to do just what you need,
>>>removing session data from GET urls. It's been configured to
>>>remove PHPSESSID variables on default, but you should be
>>>easily able to figure how to custome it for your needs.
>>>
>>>- Juho Mäkinen, http://www.juhonkoti.net
>>>
>>>On 6/27/05, Hans Benedict <be...@chemie.de> wrote:
>>>
>>>
>>>
>>>
>>>>Hi,
>>>>
>>>>I am crawling some sites that use session ids. As the crawler does not
>>>>use cookies, they are put in the url's querystring. This results in
>>>>thousands of pages that are - based on the visible content - duplicates,
>>>>but are detected as such, because the urls contained in the html are
>>>>different.
>>>>
>>>>Has anybody found a solution to this problem? Is there a way to activate
>>>>cookies for the crawler?
>>>>
>>>>--
>>>>Kind regards,
>>>>
>>>>Hans Benedict
>>>>
>>>>_________________________________________________________________
>>>>Chemie.DE Information Service GmbH Hans Benedict
>>>>Seydelstraße 28 mailto: benedict@chemie.de
>>>>10117 Berlin, Germany Tel +49 30 204568-40
>>>> Fax +49 30 204568-70
>>>>
>>>>www.Chemie.DE | www.ChemieKarriere.NET
>>>>www.Bionity.COM | www.BioKarriere.NET
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
--
Hans
Re: dedup vs. session ids
Posted by Andy Liu <an...@gmail.com>.
URL normalization occurs during parsing. If your index isn't that
big, it may be easier to start your crawl from scratch.
On 6/29/05, Hans Benedict <be...@chemie.de> wrote:
> Juho, thanks, that was what I was looking for.
>
> What I still don't understand: When is this URL-Normalization done? Or
> more precisely: What will I have to do with my already crawled pages?
> Reindex? Update the db? A simple dedup did not seem to do the job...
>
> Regards,
>
> Hans Benedict
>
> _________________________________________________________________
> Chemie.DE Information Service GmbH Hans Benedict
> Seydelstraße 28 mailto: benedict@chemie.de
> 10117 Berlin, Germany Tel +49 30 204568-40
> Fax +49 30 204568-70
>
> www.Chemie.DE | www.ChemieKarriere.NET
> www.Bionity.COM | www.BioKarriere.NET
>
>
>
> Juho Mäkinen wrote:
>
> >Take a look under conf/regex-normalize.xml
> >
> >I don't know how it works, but it seems to do just what you need,
> >removing session data from GET urls. It's been configured to
> >remove PHPSESSID variables on default, but you should be
> >easily able to figure how to custome it for your needs.
> >
> > - Juho Mäkinen, http://www.juhonkoti.net
> >
> >On 6/27/05, Hans Benedict <be...@chemie.de> wrote:
> >
> >
> >>Hi,
> >>
> >>I am crawling some sites that use session ids. As the crawler does not
> >>use cookies, they are put in the url's querystring. This results in
> >>thousands of pages that are - based on the visible content - duplicates,
> >>but are detected as such, because the urls contained in the html are
> >>different.
> >>
> >>Has anybody found a solution to this problem? Is there a way to activate
> >>cookies for the crawler?
> >>
> >>--
> >>Kind regards,
> >>
> >>Hans Benedict
> >>
> >>_________________________________________________________________
> >>Chemie.DE Information Service GmbH Hans Benedict
> >>Seydelstraße 28 mailto: benedict@chemie.de
> >>10117 Berlin, Germany Tel +49 30 204568-40
> >> Fax +49 30 204568-70
> >>
> >>www.Chemie.DE | www.ChemieKarriere.NET
> >>www.Bionity.COM | www.BioKarriere.NET
> >>
> >>
> >>
> >>
>
Re: dedup vs. session ids
Posted by Hans Benedict <be...@chemie.de>.
Juho, thanks, that was what I was looking for.
What I still don't understand: When is this URL-Normalization done? Or
more precisely: What will I have to do with my already crawled pages?
Reindex? Update the db? A simple dedup did not seem to do the job...
Regards,
Hans Benedict
_________________________________________________________________
Chemie.DE Information Service GmbH Hans Benedict
Seydelstraße 28 mailto: benedict@chemie.de
10117 Berlin, Germany Tel +49 30 204568-40
Fax +49 30 204568-70
www.Chemie.DE | www.ChemieKarriere.NET
www.Bionity.COM | www.BioKarriere.NET
Juho Mäkinen wrote:
>Take a look under conf/regex-normalize.xml
>
>I don't know how it works, but it seems to do just what you need,
>removing session data from GET urls. It's been configured to
>remove PHPSESSID variables on default, but you should be
>easily able to figure how to custome it for your needs.
>
> - Juho Mäkinen, http://www.juhonkoti.net
>
>On 6/27/05, Hans Benedict <be...@chemie.de> wrote:
>
>
>>Hi,
>>
>>I am crawling some sites that use session ids. As the crawler does not
>>use cookies, they are put in the url's querystring. This results in
>>thousands of pages that are - based on the visible content - duplicates,
>>but are detected as such, because the urls contained in the html are
>>different.
>>
>>Has anybody found a solution to this problem? Is there a way to activate
>>cookies for the crawler?
>>
>>--
>>Kind regards,
>>
>>Hans Benedict
>>
>>_________________________________________________________________
>>Chemie.DE Information Service GmbH Hans Benedict
>>Seydelstraße 28 mailto: benedict@chemie.de
>>10117 Berlin, Germany Tel +49 30 204568-40
>> Fax +49 30 204568-70
>>
>>www.Chemie.DE | www.ChemieKarriere.NET
>>www.Bionity.COM | www.BioKarriere.NET
>>
>>
>>
>>
Re: dedup vs. session ids
Posted by Juho Mäkinen <ju...@gmail.com>.
Take a look under conf/regex-normalize.xml
I don't know how it works, but it seems to do just what you need,
removing session data from GET urls. It's been configured to
remove PHPSESSID variables on default, but you should be
easily able to figure how to custome it for your needs.
- Juho Mäkinen, http://www.juhonkoti.net
On 6/27/05, Hans Benedict <be...@chemie.de> wrote:
> Hi,
>
> I am crawling some sites that use session ids. As the crawler does not
> use cookies, they are put in the url's querystring. This results in
> thousands of pages that are - based on the visible content - duplicates,
> but are detected as such, because the urls contained in the html are
> different.
>
> Has anybody found a solution to this problem? Is there a way to activate
> cookies for the crawler?
>
> --
> Kind regards,
>
> Hans Benedict
>
> _________________________________________________________________
> Chemie.DE Information Service GmbH Hans Benedict
> Seydelstraße 28 mailto: benedict@chemie.de
> 10117 Berlin, Germany Tel +49 30 204568-40
> Fax +49 30 204568-70
>
> www.Chemie.DE | www.ChemieKarriere.NET
> www.Bionity.COM | www.BioKarriere.NET
>
>