You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Daniel Harper <dj...@gmail.com> on 2018/12/04 21:22:53 UTC

Re: [Go] High memory usage on CSV read into table

Sorry I've been away at reinvent.

Just tried out what's currently on master (with the chunked change that
looks like it has merged). I'll do the break down of the different parts
later but as a high level look at just running the same script as described
above these are the numbers

https://docs.google.com/spreadsheets/d/1SE4S-wcKQ5cwlHoN7rQm7XOZLjI0HSyMje6q-zLvUHM/edit?usp=sharing

Looks to me like the change has definitely helped, with memory usage
dropping to around 300mb, although the usage doesn't really change that
much once chunk size is > 1000




Daniel Harper
http://djhworld.github.io


On Fri, 23 Nov 2018 at 10:58, Sebastien Binet <bi...@cern.ch> wrote:

> On Mon, Nov 19, 2018 at 11:29 PM Wes McKinney <we...@gmail.com> wrote:
>
> > That seems buggy then. There is only 4.125 bytes of overhead per
> > string value on average (a 32-bit offset, plus a valid bit)
> > On Mon, Nov 19, 2018 at 5:02 PM Daniel Harper <dj...@gmail.com>
> > wrote:
> > >
> > > Uncompressed
> > >
> > > $ ls -la concurrent_streams.csv
> > > -rw-r--r-- 1 danielharper 112M Nov 16 19:21 concurrent_streams.csv
> > >
> > > $ wc -l concurrent_streams.csv
> > >  1007481 concurrent_streams.csv
> > >
> > >
> > > Daniel Harper
> > > http://djhworld.github.io
> > >
> > >
> > > On Mon, 19 Nov 2018 at 21:55, Wes McKinney <we...@gmail.com>
> wrote:
> > >
> > > > I'm curious how the file is only 100MB if it's producing ~6GB of
> > > > strings in memory. Is it compressed?
> > > > On Mon, Nov 19, 2018 at 4:48 PM Daniel Harper <dj...@gmail.com>
> > > > wrote:
> > > > >
> > > > > Thanks,
> > > > >
> > > > > I've tried the new code and that seems to have shaved about 1GB of
> > memory
> > > > > off, so the heap is about 8.84GB now, here is the updated pprof
> > output
> > > > > https://i.imgur.com/itOHqBf.png
> > > > >
> > > > > It looks like the majority of allocations are in the
> > memory.GoAllocator
> > > > >
> > > > > (pprof) top
> > > > > Showing nodes accounting for 8.84GB, 100% of 8.84GB total
> > > > > Showing top 10 nodes out of 41
> > > > >       flat  flat%   sum%        cum   cum%
> > > > >     4.24GB 47.91% 47.91%     4.24GB 47.91%
> > > > > github.com/apache/arrow/go/arrow/memory.(*GoAllocator).Allocate
> > > > >     2.12GB 23.97% 71.88%     2.12GB 23.97%
> > > > > github.com/apache/arrow/go/arrow/memory.NewResizableBuffer
> (inline)
> > > > >     1.07GB 12.07% 83.95%     1.07GB 12.07%
> > > > > github.com/apache/arrow/go/arrow/array.NewData
> > > > >     0.83GB  9.38% 93.33%     0.83GB  9.38%
> > > > > github.com/apache/arrow/go/arrow/array.NewStringData
> > > > >     0.33GB  3.69% 97.02%     1.31GB 14.79%
> > > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).newData
> > > > >     0.18GB  2.04% 99.06%     0.18GB  2.04%
> > > > > github.com/apache/arrow/go/arrow/array.NewChunked
> > > > >     0.07GB  0.78% 99.85%     0.07GB  0.78%
> > > > > github.com/apache/arrow/go/arrow/array.NewInt64Data
> > > > >     0.01GB  0.15%   100%     0.21GB  2.37%
> > > > > github.com/apache/arrow/go/arrow/array.(*Int64Builder).newData
> > > > >          0     0%   100%        6GB 67.91%
> > > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Append
> > > > >          0     0%   100%     4.03GB 45.54%
> > > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Reserve
> > > > >
> > > > >
> > > > > I'm a bit busy at the moment but I'll probably repeat the same test
> > on
> > > > the
> > > > > other Arrow implementations (e.g. Java) to see if they allocate a
> > similar
> > > > > amount.
> >
>
> I've implemented chunking over there:
>
> - https://github.com/apache/arrow/pull/3019
>
> could you try with a couple of chunking values?
> e.g.:
> - csv.WithChunk(-1): reads the whole file into memory, creates one big
> record
> - csv.WithChunk(nrows/10): creates 10 records
>
> also, it would be great to try to disentangle the memory usage of the "CSV
> reading part" from the "Table creation" one:
> - have some perf numbers w/o storing all these Records into a []Record
> slice,
> - have some perf numbers w/ only storing these Records into a []Record
> slice,
> - have some perf numbers w/ storing the records into the slice + creating
> the Table.
>
> hth,
> -s
>

Re: [Go] High memory usage on CSV read into table

Posted by Sebastien Binet <bi...@cern.ch>.
On Tue, Dec 4, 2018 at 10:23 PM Daniel Harper <dj...@gmail.com> wrote:

> Sorry I've been away at reinvent.
>
> Just tried out what's currently on master (with the chunked change that
> looks like it has merged). I'll do the break down of the different parts
> later but as a high level look at just running the same script as described
> above these are the numbers
>
>
> https://docs.google.com/spreadsheets/d/1SE4S-wcKQ5cwlHoN7rQm7XOZLjI0HSyMje6q-zLvUHM/edit?usp=sharing
>


>
> Looks to me like the change has definitely helped, with memory usage
> dropping to around 300mb, although the usage doesn't really change that
> much once chunk size is > 1000
>

good. you might want to try with a chunk size of -1 (this loads the whole
CSV file into memory in one fell swoop.)

also, there's this PR wich should probably also reduce the memory pressure:
- https://github.com/apache/arrow/pull/3073

cheers,
-s


>
>
>
>
> Daniel Harper
> http://djhworld.github.io
>
>
> On Fri, 23 Nov 2018 at 10:58, Sebastien Binet <bi...@cern.ch> wrote:
>
> > On Mon, Nov 19, 2018 at 11:29 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > That seems buggy then. There is only 4.125 bytes of overhead per
> > > string value on average (a 32-bit offset, plus a valid bit)
> > > On Mon, Nov 19, 2018 at 5:02 PM Daniel Harper <dj...@gmail.com>
> > > wrote:
> > > >
> > > > Uncompressed
> > > >
> > > > $ ls -la concurrent_streams.csv
> > > > -rw-r--r-- 1 danielharper 112M Nov 16 19:21 concurrent_streams.csv
> > > >
> > > > $ wc -l concurrent_streams.csv
> > > >  1007481 concurrent_streams.csv
> > > >
> > > >
> > > > Daniel Harper
> > > > http://djhworld.github.io
> > > >
> > > >
> > > > On Mon, 19 Nov 2018 at 21:55, Wes McKinney <we...@gmail.com>
> > wrote:
> > > >
> > > > > I'm curious how the file is only 100MB if it's producing ~6GB of
> > > > > strings in memory. Is it compressed?
> > > > > On Mon, Nov 19, 2018 at 4:48 PM Daniel Harper <
> djharperuk@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > I've tried the new code and that seems to have shaved about 1GB
> of
> > > memory
> > > > > > off, so the heap is about 8.84GB now, here is the updated pprof
> > > output
> > > > > > https://i.imgur.com/itOHqBf.png
> > > > > >
> > > > > > It looks like the majority of allocations are in the
> > > memory.GoAllocator
> > > > > >
> > > > > > (pprof) top
> > > > > > Showing nodes accounting for 8.84GB, 100% of 8.84GB total
> > > > > > Showing top 10 nodes out of 41
> > > > > >       flat  flat%   sum%        cum   cum%
> > > > > >     4.24GB 47.91% 47.91%     4.24GB 47.91%
> > > > > > github.com/apache/arrow/go/arrow/memory.(*GoAllocator).Allocate
> > > > > >     2.12GB 23.97% 71.88%     2.12GB 23.97%
> > > > > > github.com/apache/arrow/go/arrow/memory.NewResizableBuffer
> > (inline)
> > > > > >     1.07GB 12.07% 83.95%     1.07GB 12.07%
> > > > > > github.com/apache/arrow/go/arrow/array.NewData
> > > > > >     0.83GB  9.38% 93.33%     0.83GB  9.38%
> > > > > > github.com/apache/arrow/go/arrow/array.NewStringData
> > > > > >     0.33GB  3.69% 97.02%     1.31GB 14.79%
> > > > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).newData
> > > > > >     0.18GB  2.04% 99.06%     0.18GB  2.04%
> > > > > > github.com/apache/arrow/go/arrow/array.NewChunked
> > > > > >     0.07GB  0.78% 99.85%     0.07GB  0.78%
> > > > > > github.com/apache/arrow/go/arrow/array.NewInt64Data
> > > > > >     0.01GB  0.15%   100%     0.21GB  2.37%
> > > > > > github.com/apache/arrow/go/arrow/array.(*Int64Builder).newData
> > > > > >          0     0%   100%        6GB 67.91%
> > > > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Append
> > > > > >          0     0%   100%     4.03GB 45.54%
> > > > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Reserve
> > > > > >
> > > > > >
> > > > > > I'm a bit busy at the moment but I'll probably repeat the same
> test
> > > on
> > > > > the
> > > > > > other Arrow implementations (e.g. Java) to see if they allocate a
> > > similar
> > > > > > amount.
> > >
> >
> > I've implemented chunking over there:
> >
> > - https://github.com/apache/arrow/pull/3019
> >
> > could you try with a couple of chunking values?
> > e.g.:
> > - csv.WithChunk(-1): reads the whole file into memory, creates one big
> > record
> > - csv.WithChunk(nrows/10): creates 10 records
> >
> > also, it would be great to try to disentangle the memory usage of the
> "CSV
> > reading part" from the "Table creation" one:
> > - have some perf numbers w/o storing all these Records into a []Record
> > slice,
> > - have some perf numbers w/ only storing these Records into a []Record
> > slice,
> > - have some perf numbers w/ storing the records into the slice + creating
> > the Table.
> >
> > hth,
> > -s
> >
>