You are viewing a plain text version of this content. The canonical link for it is here.
Posted to modperl@perl.apache.org by Ernest Lergon <er...@virtualitas.net> on 2002/05/01 21:31:42 UTC

Re: Memory explodes loading CSV into hash

Hi Stas,

having a look at Apache::Status and playing around with your tips on

http://www.apacheweek.com/features/mod_perl11

I found some interesting results and a compromising solution:

In a module I load a CSV file as class data into different structures
and compared the output of Apache::Status with top.

Enclosed you'll find a test report.

The code below 'building' shows, how the lines are put into the
structures.

The lines below 'perl-status' show the output of Apache::Status.
The line below 'top' shows the output of top.

Examples for the tested structures are:

$buffer = '1\tr1v1\tr1v2\tr1v3\n2\tr2v1\tr2v2\tr2v3\n' ...

@lines = (
        '1\tr1v1\tr1v2\tr1v3',
        '2\tr2v1\tr2v2\tr2v3',
        ... )

%data = (
        1 => [ 1, 'r1v1' , 'r1v2' , 'r1v3' ],
        2 => [ 2, 'r2v1' , 'r2v2' , 'r2v3' ],
        ... )

$pack = {
        1 => [ 1, 'r1v1' , 'r1v2' , 'r1v3' ],
        2 => [ 2, 'r2v1' , 'r2v2' , 'r2v3' ],
        ... }

%index = (
        1 => '1\tr1v1\tr1v2\tr1v3',
        2 => '2\tr2v1\tr2v2\tr2v3',
        ... )

One thing I realized using Devel::Peek is, that using a hash of
array-ref, each item in the array has the full blown perl flags etc.
That seems to be the reason for the 'memory explosion'.

Another thing I found is, that Apache::Status seems not always report
complete values. Therefore I recorded the sizes from top, too.

Especially for the the hash of array-refs (%data) and the hash-ref of
array-refs ($pack) perl-status reports only a part of the used memory
size: for $pack only the pointer (16 bytes), for %data only the keys
(?).

As compromise I'll use the %index structure. It is small enough while
providing a fast access. A further optimization will be to remove the
redundant key-field from each line.

Success: A reduction from 26 MB to 7 MB - what I estimated in my first
mail.

A last word from perldebguts.pod:

|| Perl is a profligate wastrel when it comes to memory use.  There is a
|| saying that to estimate memory usage of Perl, assume a reasonable
|| algorithm for memory allocation, multiply that estimate by 10, and
|| while you still may miss the mark, at least you won't be quite so
|| astonished.  This is not absolutely true, but may prvide a good grasp
|| of what happens.
||
|| [...]
||
|| Anecdotal estimates of source-to-compiled code bloat suggest an
|| eightfold increase.

Perhaps my experiences could be added to the long row of anecdotes ;-))

Thank you all again for escorting me on this deep dive.

Ernest

--

*********************************************************************
* VIRTUALITAS Inc.               *                                  *
*                                *                                  *
* European Consultant Office     *      http://www.virtualitas.net  *
* Internationales Handelszentrum *   contact:Ernest Lergon          *
* Friedrichstraße 95             *    mailto:Ernest@virtualitas.net *
* 10117 Berlin / Germany         *       ums:+49180528132130266     *
*********************************************************************
       PGP-Key http://www.virtualitas.net/Ernest_Lergon.asc




TEST REPORT
===========

CSV file:
        14350 records
        CSV     2151045 bytes = 2101 Kbytes
        CSV_2   2136695 bytes = 2086 Kbytes (w/o CR)

1       all empty
=================

building:
        none

perl-status:
        *buffer{SCALAR}           25 bytes
        *lines{ARRAY}             56 bytes
        *data{HASH}              228 bytes
        *pack{SCALAR}             16 bytes
        *index{HASH}             228 bytes

top:
        12992  12M 12844   base

2       buffer
==============

building:
        $buffer .= $_ . "\n";

perl-status:
        *buffer{SCALAR}      2151069 bytes = CSV + 24 bytes
        *lines{ARRAY}             56 bytes
        *data{HASH}              228 bytes
        *pack{SCALAR}             16 bytes
        *index{HASH}             228 bytes

top:
        17200  16M 17040   base + 4208 Kbytes = CSV + 2107 KBytes

3       lines
=============

building:
        push @lines, $_;

perl-status:
        *buffer{SCALAR}           25 bytes
        *lines{ARRAY}        2519860 bytes = CSV_2 + 383165 bytes
                                             (approx. 27 * 14350 )
        *data{HASH}              228 bytes
        *pack{SCALAR}             16 bytes
        *index{HASH}             228 bytes

top:
        18220  17M 18076   base + 5228 Kbytes = CSV_2 + 3142 Kbytes

4       data
============

building:
        @record = split ( "\t", $_ );
        $key = 0 + $record[0];
        $data{$key} = [ @record ];

perl-status:
        *buffer{SCALAR}           25 bytes
        *lines{ARRAY}             56 bytes
        *data{HASH}           723302 bytes = approx. 50 * 14350 ( key +
ref )
                                             (where is the data?)
        *pack{SCALAR}             16 bytes
        *index{HASH}             228 bytes

top:
        40488  38M 39208   base + 27566 Kbytes = CSV_2 + 25480 Kbytes
(!)

5       pack
============

building:
        @record = split ( "\t", $_ );
        $key = 0 + $record[0];
        $pack->{$key} = [ @record ];

perl-status:
        *buffer{SCALAR}           25 bytes
        *lines{ARRAY}             56 bytes
        *data{HASH}              228 bytes
        *pack{SCALAR}             16 bytes (where is the data?)
        *index{HASH}             228 bytes

top:
        40492  39M 40340   base + 27570 Kbytes = CSV_2 + 25484 Kbytes
(!)

6       index
=============

building:
        @record = split ( "\t", $_ );
        $key = 0 + $record[0];
        $index->{$key} = $_;            # !!!

perl-status:
        *buffer{SCALAR}           25 bytes
        *lines{ARRAY}             56 bytes
        *data{HASH}              228 bytes
        *pack{SCALAR}             16 bytes
        *index{HASH}         2989146 bytes = CSV_2 + 852448 bytes
                                             ( approx. 59 * 14350 )

top:
        19988  19M 19824   base + 6996 Kbytes = CSV_2 + 4910 Kbytes

EOF


Re: Memory explodes loading CSV into hash

Posted by Ernest Lergon <er...@virtualitas.net>.
Stas Bekman wrote:
> 
> Ideally when such a
> situation happens, and you must load all the data into the memory, which
> is at short, your best bet is to rewrite the datastorage layer in XS/C,
> and use a tie interface to make it transparent to your perl code. So you
> will still use the hash but the refs to arrays will be actually C arrays.
> 
Sorry, I'm not familiar with C(hinese) - but if someone could develop a
XS/Pascal interface ;-))

> Ernest Lergon wrote:
>
> > Another thing I found is, that Apache::Status seems not always report
> > complete values. Therefore I recorded the sizes from top, too.
> 
> Were you running a single process? If you aren't Apache::Status could
> have shown you a different process.
> 
Running httpd -X shows the same results.

I will use the named %index structure for now. Thanks to the modular OO
perl I can re-code my data package later, if the "memory explosion" hits
me again ;-))

Ernest


-- 

*********************************************************************
* VIRTUALITAS Inc.               *                                  *
*                                *                                  *
* European Consultant Office     *      http://www.virtualitas.net  *
* Internationales Handelszentrum *   contact:Ernest Lergon          *
* Friedrichstraße 95             *    mailto:Ernest@virtualitas.net *
* 10117 Berlin / Germany         *       ums:+49180528132130266     *
*********************************************************************
       PGP-Key http://www.virtualitas.net/Ernest_Lergon.asc


Re: Memory explodes loading CSV into hash

Posted by Stas Bekman <st...@stason.org>.
Ernest Lergon wrote:

> having a look at Apache::Status and playing around with your tips on
> 
> http://www.apacheweek.com/features/mod_perl11
> 
> I found some interesting results and a compromising solution:

Glad to hear that Apache::Status was of help to you. Ideally when such a 
situation happens, and you must load all the data into the memory, which 
is at short, your best bet is to rewrite the datastorage layer in XS/C, 
and use a tie interface to make it transparent to your perl code. So you 
will still use the hash but the refs to arrays will be actually C arrays.

> Another thing I found is, that Apache::Status seems not always report
> complete values. Therefore I recorded the sizes from top, too.

Were you running a single process? If you aren't Apache::Status could 
have shown you a different process. Also you can use GTop, if you have 
libgtop on your system, which gives you a perl interface to the proc's 
memory usage. See the guide for many examples.

> Success: A reduction from 26 MB to 7 MB - what I estimated in my first
> mail.

:)
__________________________________________________________________
Stas Bekman            JAm_pH ------> Just Another mod_perl Hacker
http://stason.org/     mod_perl Guide ---> http://perl.apache.org
mailto:stas@stason.org http://use.perl.org http://apacheweek.com
http://modperlbook.org http://apache.org   http://ticketmaster.com