You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by Erlend Garåsen <e....@usit.uio.no> on 2011/07/25 15:06:52 UTC

Disk usage for big crawl

Hello list,

In order to crawl around 100,000 documents, how much disk usage/table 
space will be needed for PostgreSQL? Our database administrators are now 
asking. Instead of starting up this crawl (which will take a lot of 
time) and try to measure this manually, I hope we could get an answer 
from the list members instead.

And will the table space increase significantly for every recrqwl?

Erlend

-- 
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Disk usage for big crawl

Posted by Erlend Garåsen <e....@usit.uio.no>.

Thanks, I will send your recommendations to the database administrators. 
They are responsible for setting up maintenance strategies for the 
PostgreSQL databases running at the university.

BTW, I started to crawl all the web pages yesterday, and the job will 
probably finish later today. Then I can ask the database administrators 
to check the size of the tables (after they eventually have performed a 
vacuum). I'm not sure whether it's recommended to do a new crawl after 
they have checked the disk usage in order to find out whether a recrawl 
will increase the table space significantly. I don't think so, but we 
need to inform them in case the size will increase significantly after a 
month or two.

Erlend

On 25.07.11 15.16, Karl Wright wrote:
> Hi Erlend,
>
> I can't answer for how PostgreSQL allocates space on the whole - the
> PostgreSQL documentation may tell you more.  I can say this much:
>
> (1) Postgresql keeps "dead tuples" around until they are "vacuumed".
> This implies that the table space grows until the vacuuming operation
> takes place.
> (2) At MetaCarta, we found that PostgreSQL's normal autovacuuming
> process (which runs in background) was insufficient to keep up with
> ManifoldCF going at full tilt in a web crawl.
> (3) The solution at MetaCarta was to periodically run "maintenance",
> which involves running a VACUUM FULL operation on the database.  This
> will cause the crawl to stall while the vacuum operation is going,
> since a new (compact) disk image of every table must be made, and thus
> each table is locked for a period of time.
>
> So my suggestion is to adopt a maintenance strategy first, make sure
> it is working for you, and then calculate how much disk space you will
> need for that strategy.  Typically maintenance might be done once or
> twice a week.  Under heavy crawling (lots and lots of hosts being
> crawled), you might do maintenance once every 2 days or so.
>
> Karl
>
>
> On Mon, Jul 25, 2011 at 9:06 AM, Erlend Garåsen<e....@usit.uio.no>  wrote:
>>
>> Hello list,
>>
>> In order to crawl around 100,000 documents, how much disk usage/table space
>> will be needed for PostgreSQL? Our database administrators are now asking.
>> Instead of starting up this crawl (which will take a lot of time) and try to
>> measure this manually, I hope we could get an answer from the list members
>> instead.
>>
>> And will the table space increase significantly for every recrqwl?
>>
>> Erlend
>>
>> --
>> Erlend Garåsen
>> Center for Information Technology Services
>> University of Oslo
>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>>


-- 
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Disk usage for big crawl

Posted by Erlend Garåsen <e....@usit.uio.no>.

Thanks for your suggestions, Farzad!

As you probably read in my previous post, I started a full crawl 
yesterday. Instead of the Null Output connector, I used our own request 
handler which is more advanced than the regular ExtractingRequestHandler 
(Solr Cell). We can configure our handler to skip posting the data to 
Solr and instead dump the content to disk. This makes it unnecessary to 
do a recrawl if we just want to create a new Solr index for testing 
purposes. Reading the data from the generated file is much faster.

Erlend

On 25.07.11 21.29, Farzad Valad wrote:
> You can also do some smaller tests and project a number to satisfy your
> db admins. Perform a few small crawls, like 100, 500, 1000 and estimate
> a growth rate. The other thing you can do is full crawl with the Null
> Output connector. Depending on your system you can get speeds up to 60
> docs a second, even at half that speed the crawl will finish in less
> than an hour and you'll at least know what half of requirement is for
> that set, the input crawl needs. Depending on the output connector, you
> may or may not have additional growing storage needs. You can do both
> these technique to get closer at a reasonable guesstimate : )
>
> On 7/25/2011 8:16 AM, Karl Wright wrote:
>> Hi Erlend,
>>
>> I can't answer for how PostgreSQL allocates space on the whole - the
>> PostgreSQL documentation may tell you more. I can say this much:
>>
>> (1) Postgresql keeps "dead tuples" around until they are "vacuumed".
>> This implies that the table space grows until the vacuuming operation
>> takes place.
>> (2) At MetaCarta, we found that PostgreSQL's normal autovacuuming
>> process (which runs in background) was insufficient to keep up with
>> ManifoldCF going at full tilt in a web crawl.
>> (3) The solution at MetaCarta was to periodically run "maintenance",
>> which involves running a VACUUM FULL operation on the database. This
>> will cause the crawl to stall while the vacuum operation is going,
>> since a new (compact) disk image of every table must be made, and thus
>> each table is locked for a period of time.
>>
>> So my suggestion is to adopt a maintenance strategy first, make sure
>> it is working for you, and then calculate how much disk space you will
>> need for that strategy. Typically maintenance might be done once or
>> twice a week. Under heavy crawling (lots and lots of hosts being
>> crawled), you might do maintenance once every 2 days or so.
>>
>> Karl
>>
>>
>> On Mon, Jul 25, 2011 at 9:06 AM, Erlend
>> Garåsen<e....@usit.uio.no> wrote:
>>> Hello list,
>>>
>>> In order to crawl around 100,000 documents, how much disk usage/table
>>> space
>>> will be needed for PostgreSQL? Our database administrators are now
>>> asking.
>>> Instead of starting up this crawl (which will take a lot of time) and
>>> try to
>>> measure this manually, I hope we could get an answer from the list
>>> members
>>> instead.
>>>
>>> And will the table space increase significantly for every recrqwl?
>>>
>>> Erlend
>>>
>>> --
>>> Erlend Garåsen
>>> Center for Information Technology Services
>>> University of Oslo
>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>> 31050
>>>
>


-- 
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Disk usage for big crawl

Posted by Farzad Valad <ho...@farzad.net>.

You can also do some smaller tests and project a number to satisfy your 
db admins.  Perform a few small crawls, like 100, 500, 1000 and estimate 
a growth rate.  The other thing you can do is full crawl with the Null 
Output connector.  Depending on your system you can get speeds up to 60 
docs a second, even at half that speed the crawl will finish in less 
than an hour and you'll at least know what half of requirement is for 
that set, the input crawl needs.  Depending on the output connector, you 
may or may not have additional growing storage needs.  You can do both 
these technique to get closer at a reasonable guesstimate : )

On 7/25/2011 8:16 AM, Karl Wright wrote:
> Hi Erlend,
>
> I can't answer for how PostgreSQL allocates space on the whole - the
> PostgreSQL documentation may tell you more.  I can say this much:
>
> (1) Postgresql keeps "dead tuples" around until they are "vacuumed".
> This implies that the table space grows until the vacuuming operation
> takes place.
> (2) At MetaCarta, we found that PostgreSQL's normal autovacuuming
> process (which runs in background) was insufficient to keep up with
> ManifoldCF going at full tilt in a web crawl.
> (3) The solution at MetaCarta was to periodically run "maintenance",
> which involves running a VACUUM FULL operation on the database.  This
> will cause the crawl to stall while the vacuum operation is going,
> since a new (compact) disk image of every table must be made, and thus
> each table is locked for a period of time.
>
> So my suggestion is to adopt a maintenance strategy first, make sure
> it is working for you, and then calculate how much disk space you will
> need for that strategy.  Typically maintenance might be done once or
> twice a week.  Under heavy crawling (lots and lots of hosts being
> crawled), you might do maintenance once every 2 days or so.
>
> Karl
>
>
> On Mon, Jul 25, 2011 at 9:06 AM, Erlend Garåsen<e....@usit.uio.no>  wrote:
>> Hello list,
>>
>> In order to crawl around 100,000 documents, how much disk usage/table space
>> will be needed for PostgreSQL? Our database administrators are now asking.
>> Instead of starting up this crawl (which will take a lot of time) and try to
>> measure this manually, I hope we could get an answer from the list members
>> instead.
>>
>> And will the table space increase significantly for every recrqwl?
>>
>> Erlend
>>
>> --
>> Erlend Garåsen
>> Center for Information Technology Services
>> University of Oslo
>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>>

Re: Disk usage for big crawl

Posted by Karl Wright <da...@gmail.com>.

Hi Erlend,

I can't answer for how PostgreSQL allocates space on the whole - the
PostgreSQL documentation may tell you more.  I can say this much:

(1) Postgresql keeps "dead tuples" around until they are "vacuumed".
This implies that the table space grows until the vacuuming operation
takes place.
(2) At MetaCarta, we found that PostgreSQL's normal autovacuuming
process (which runs in background) was insufficient to keep up with
ManifoldCF going at full tilt in a web crawl.
(3) The solution at MetaCarta was to periodically run "maintenance",
which involves running a VACUUM FULL operation on the database.  This
will cause the crawl to stall while the vacuum operation is going,
since a new (compact) disk image of every table must be made, and thus
each table is locked for a period of time.

So my suggestion is to adopt a maintenance strategy first, make sure
it is working for you, and then calculate how much disk space you will
need for that strategy.  Typically maintenance might be done once or
twice a week.  Under heavy crawling (lots and lots of hosts being
crawled), you might do maintenance once every 2 days or so.

Karl

On Mon, Jul 25, 2011 at 9:06 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
>
> Hello list,
>
> In order to crawl around 100,000 documents, how much disk usage/table space
> will be needed for PostgreSQL? Our database administrators are now asking.
> Instead of starting up this crawl (which will take a lot of time) and try to
> measure this manually, I hope we could get an answer from the list members
> instead.
>
> And will the table space increase significantly for every recrqwl?
>
> Erlend
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>