You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by David Wallace <da...@nzqa.govt.nz> on 2006/09/07 23:12:44 UTC

Re: Recrawling (Tomi NA)

Just guessing, but could this be caused by session ids in the URL?  Or
some other unimportant piece of data?  If this is the case, then every
page would be added to the index when it's crawled, regardless of
whether it's already in there, with a different session id.  If this is
what's causing your problem, then you need to use the regexp URL
normaliser to strip out the session ids.
 
Regards,
David.

On 7/9/2006 11:45:03 +0200, Tomi NA hefest@gmail.com wrote:
>
> On 9/6/06, Andrei Hajdukewycz <ah...@mozilla.com> wrote:
>> Another problem I've noticed is that it seems the db grows *rapidly*
with 
>> each successive recrawl. Mine started at 379MB, and it seems to
increase 
>> by roughly 350MB every time I run a recrawl, despite there not
being
>> anywhere near that many additional pages.
>>
>> This seems like a pretty severe problem, honestly, obviously there's
a lot of duplicated data in the segments.

> I have the same problem: my index grew from 1.5GB after the original
> crawl to over 5GB(!) after the recrawl...from the looks of it, I
might
> as well crawl anew every time. :\

> t.n.a.




********************************************************************************
This email may contain legally privileged information and is intended only for the addressee. It is not necessarily the official view or 
communication of the New Zealand Qualifications Authority. If you are not the intended recipient you must not use, disclose, copy or distribute this email or 
information in it. If you have received this email in error, please contact the sender immediately. NZQA does not accept any liability for changes made to this email or attachments after sending by NZQA. 

All emails have been scanned for viruses and content by MailMarshal. 
NZQA reserves the right to monitor all email communications through its network.

********************************************************************************

Re: Recrawling (Tomi NA)

Posted by Tomi NA <he...@gmail.com>.
On 9/8/06, Andrzej Bialecki <ab...@getopt.org> wrote:
> Tomi NA wrote:
> > On 9/7/06, David Wallace <da...@nzqa.govt.nz> wrote:
> >> Just guessing, but could this be caused by session ids in the URL?  Or
> >> some other unimportant piece of data?  If this is the case, then every
> >> page would be added to the index when it's crawled, regardless of
> >> whether it's already in there, with a different session id.  If this is
> >> what's causing your problem, then you need to use the regexp URL
> >> normaliser to strip out the session ids.
> >
> > Nice try but no luck, I'm afraid.
> > The complete web is absolutely static. The reason is that we've set up
> > IIS (I'm not too happy choosing IIS over apache) to serve files from a
> > shared directory on the same server, the rationale beeing that we'd
> > rather have http://-type links than file://.
> >> From what I've seen in the logs, I don't see URLs varying so I'm still
> > at square one. Still, thanks for the effort. If you have any other
> > ideas, I'm eager to hear them.
>
> The best way to discover what's going on is to start from a small subset
> of injected urls, and do the following:
>
> * inject
>
> * dump the db to a text file
>
> * generate / fetch / updatedb
>
> * dump the db again to a second text file
>
> * compare the files.

I'll see if I'm able to reproduce those steps here, thanks.

t.n.a.

Re: Recrawling (Tomi NA)

Posted by Andrzej Bialecki <ab...@getopt.org>.
Tomi NA wrote:
> On 9/7/06, David Wallace <da...@nzqa.govt.nz> wrote:
>> Just guessing, but could this be caused by session ids in the URL?  Or
>> some other unimportant piece of data?  If this is the case, then every
>> page would be added to the index when it's crawled, regardless of
>> whether it's already in there, with a different session id.  If this is
>> what's causing your problem, then you need to use the regexp URL
>> normaliser to strip out the session ids.
>
> Nice try but no luck, I'm afraid.
> The complete web is absolutely static. The reason is that we've set up
> IIS (I'm not too happy choosing IIS over apache) to serve files from a
> shared directory on the same server, the rationale beeing that we'd
> rather have http://-type links than file://.
>> From what I've seen in the logs, I don't see URLs varying so I'm still
> at square one. Still, thanks for the effort. If you have any other
> ideas, I'm eager to hear them.

The best way to discover what's going on is to start from a small subset 
of injected urls, and do the following:

* inject

* dump the db to a text file

* generate / fetch / updatedb

* dump the db again to a second text file

* compare the files.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Recrawling (Tomi NA)

Posted by Tomi NA <he...@gmail.com>.
On 9/7/06, David Wallace <da...@nzqa.govt.nz> wrote:
> Just guessing, but could this be caused by session ids in the URL?  Or
> some other unimportant piece of data?  If this is the case, then every
> page would be added to the index when it's crawled, regardless of
> whether it's already in there, with a different session id.  If this is
> what's causing your problem, then you need to use the regexp URL
> normaliser to strip out the session ids.

Nice try but no luck, I'm afraid.
The complete web is absolutely static. The reason is that we've set up
IIS (I'm not too happy choosing IIS over apache) to serve files from a
shared directory on the same server, the rationale beeing that we'd
rather have http://-type links than file://.
>From what I've seen in the logs, I don't see URLs varying so I'm still
at square one. Still, thanks for the effort. If you have any other
ideas, I'm eager to hear them.

t.n.a.