You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by uw...@apache.org on 2018/12/05 18:06:52 UTC

[arrow] branch master updated: ARROW-3929: [Go] improve CSV reader memory usage

This is an automated email from the ASF dual-hosted git repository.

uwe pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
     new 0d0ff75  ARROW-3929: [Go] improve CSV reader memory usage
0d0ff75 is described below

commit 0d0ff7521b2212800a9aeec0c945bf3f4a402f68
Author: Sebastien Binet <bi...@cern.ch>
AuthorDate: Wed Dec 5 19:06:43 2018 +0100

    ARROW-3929: [Go] improve CSV reader memory usage
    
    This CL enables the encoding/csv reader to reuse the memory used by
    records, from row to row, and thus reduce memory pressure on Go's GC.
    
    ```
    $> benchstat old.txt new.txt
    name                                    old time/op    new time/op    delta
    Read/rows=10_cols=1_chunks=10-8           39.4µs ±18%    42.6µs ±24%     ~     (p=0.218 n=10+10)
    Read/rows=10_cols=10_chunks=10-8           293µs ±23%     280µs ±24%     ~     (p=0.400 n=10+9)
    Read/rows=10_cols=100_chunks=10-8         2.72ms ±24%    2.56ms ±20%     ~     (p=0.353 n=10+10)
    Read/rows=10_cols=1000_chunks=10-8        24.3ms ± 2%    24.0ms ± 3%     ~     (p=0.059 n=8+9)
    Read/rows=100_cols=1_chunks=10-8          74.9µs ±11%    62.1µs ±19%  -17.21%  (p=0.004 n=10+10)
    Read/rows=100_cols=10_chunks=10-8          559µs ±21%     474µs ±21%  -15.12%  (p=0.009 n=10+10)
    Read/rows=100_cols=100_chunks=10-8        5.53ms ±21%    4.36ms ±16%  -21.27%  (p=0.000 n=10+9)
    Read/rows=100_cols=1000_chunks=10-8       41.9ms ± 3%    42.2ms ±13%     ~     (p=0.684 n=10+10)
    Read/rows=1000_cols=1_chunks=10-8          421µs ±13%     320µs ±10%  -23.98%  (p=0.000 n=10+10)
    Read/rows=1000_cols=10_chunks=10-8        3.24ms ±24%    2.63ms ±15%  -18.77%  (p=0.007 n=10+10)
    Read/rows=1000_cols=100_chunks=10-8       33.0ms ±17%    27.0ms ±19%  -18.09%  (p=0.001 n=10+10)
    Read/rows=1000_cols=1000_chunks=10-8       219ms ± 1%     211ms ± 2%   -3.81%  (p=0.000 n=9+10)
    Read/rows=10000_cols=1_chunks=10-8        3.66ms ±11%    2.91ms ±10%  -20.27%  (p=0.000 n=10+10)
    Read/rows=10000_cols=10_chunks=10-8       31.8ms ±16%    25.6ms ±15%  -19.66%  (p=0.000 n=10+10)
    Read/rows=10000_cols=100_chunks=10-8       192ms ± 1%     182ms ± 1%   -5.19%  (p=0.000 n=10+10)
    Read/rows=10000_cols=1000_chunks=10-8      1.99s ± 1%     1.93s ± 2%   -3.26%  (p=0.000 n=9+9)
    Read/rows=100000_cols=1_chunks=10-8       32.9ms ± 4%    26.1ms ± 4%  -20.75%  (p=0.000 n=10+10)
    Read/rows=100000_cols=10_chunks=10-8       203ms ± 1%     198ms ± 7%     ~     (p=0.123 n=10+10)
    Read/rows=100000_cols=100_chunks=10-8      2.00s ± 1%     1.92s ± 1%   -4.24%  (p=0.000 n=10+8)
    Read/rows=100000_cols=1000_chunks=10-8     22.7s ± 2%     22.0s ± 2%   -3.31%  (p=0.000 n=9+10)
    
    name                                    old alloc/op   new alloc/op   delta
    Read/rows=10_cols=1_chunks=10-8           32.7kB ± 0%    32.2kB ± 0%   -1.32%  (p=0.000 n=10+10)
    Read/rows=10_cols=10_chunks=10-8           281kB ± 0%     277kB ± 0%   -1.54%  (p=0.000 n=10+10)
    Read/rows=10_cols=100_chunks=10-8         2.77MB ± 0%    2.73MB ± 0%   -1.58%  (p=0.000 n=10+10)
    Read/rows=10_cols=1000_chunks=10-8        27.8MB ± 0%    27.3MB ± 0%   -1.59%  (p=0.000 n=9+9)
    Read/rows=100_cols=1_chunks=10-8          44.0kB ± 0%    39.3kB ± 0%  -10.80%  (p=0.000 n=10+10)
    Read/rows=100_cols=10_chunks=10-8          381kB ± 0%     333kB ± 0%  -12.48%  (p=0.000 n=10+10)
    Read/rows=100_cols=100_chunks=10-8        3.78MB ± 0%    3.29MB ± 0%  -12.75%  (p=0.000 n=10+10)
    Read/rows=100_cols=1000_chunks=10-8       37.9MB ± 0%    33.1MB ± 0%  -12.83%  (p=0.000 n=10+9)
    Read/rows=1000_cols=1_chunks=10-8          200kB ± 0%     152kB ± 0%  -23.99%  (p=0.000 n=10+10)
    Read/rows=1000_cols=10_chunks=10-8        1.84MB ± 0%    1.36MB ± 0%  -26.08%  (p=0.000 n=10+9)
    Read/rows=1000_cols=100_chunks=10-8       18.4MB ± 0%    13.5MB ± 0%  -26.44%  (p=0.000 n=9+10)
    Read/rows=1000_cols=1000_chunks=10-8       184MB ± 0%     135MB ± 0%  -26.62%  (p=0.000 n=10+10)
    Read/rows=10000_cols=1_chunks=10-8        1.65MB ± 0%    1.17MB ± 0%  -29.02%  (p=0.000 n=10+10)
    Read/rows=10000_cols=10_chunks=10-8       15.7MB ± 0%    10.9MB ± 0%  -30.65%  (p=0.000 n=10+10)
    Read/rows=10000_cols=100_chunks=10-8       156MB ± 0%     108MB ± 0%  -31.12%  (p=0.000 n=10+8)
    Read/rows=10000_cols=1000_chunks=10-8     1.58GB ± 0%    1.09GB ± 0%  -31.06%  (p=0.000 n=10+10)
    Read/rows=100000_cols=1_chunks=10-8       20.1MB ± 0%    15.3MB ± 0%  -23.93%  (p=0.000 n=10+9)
    Read/rows=100000_cols=10_chunks=10-8       197MB ± 0%     149MB ± 0%  -24.39%  (p=0.000 n=10+8)
    Read/rows=100000_cols=100_chunks=10-8     1.96GB ± 0%    1.47GB ± 0%  -24.86%  (p=0.000 n=10+10)
    Read/rows=100000_cols=1000_chunks=10-8    19.7GB ± 0%    14.7GB ± 0%  -25.00%  (p=0.000 n=10+10)
    
    name                                    old allocs/op  new allocs/op  delta
    Read/rows=10_cols=1_chunks=10-8              319 ± 0%       310 ± 0%   -2.82%  (p=0.000 n=10+10)
    Read/rows=10_cols=10_chunks=10-8           2.63k ± 0%     2.62k ± 0%   -0.34%  (p=0.000 n=10+10)
    Read/rows=10_cols=100_chunks=10-8          25.7k ± 0%     25.7k ± 0%   -0.04%  (p=0.000 n=10+10)
    Read/rows=10_cols=1000_chunks=10-8          256k ± 0%      256k ± 0%   -0.00%  (p=0.000 n=10+10)
    Read/rows=100_cols=1_chunks=10-8             524 ± 0%       425 ± 0%  -18.89%  (p=0.000 n=10+10)
    Read/rows=100_cols=10_chunks=10-8          3.02k ± 0%     2.92k ± 0%   -3.27%  (p=0.000 n=10+10)
    Read/rows=100_cols=100_chunks=10-8         28.0k ± 0%     27.9k ± 0%   -0.35%  (p=0.000 n=10+10)
    Read/rows=100_cols=1000_chunks=10-8         277k ± 0%      277k ± 0%   -0.04%  (p=0.000 n=10+10)
    Read/rows=1000_cols=1_chunks=10-8          2.43k ± 0%     1.44k ± 0%  -41.04%  (p=0.000 n=10+10)
    Read/rows=1000_cols=10_chunks=10-8         5.92k ± 0%     4.92k ± 0%  -16.87%  (p=0.000 n=10+10)
    Read/rows=1000_cols=100_chunks=10-8        40.8k ± 0%     39.8k ± 0%   -2.45%  (p=0.000 n=10+10)
    Read/rows=1000_cols=1000_chunks=10-8        389k ± 0%      388k ± 0%   -0.26%  (p=0.000 n=10+10)
    Read/rows=10000_cols=1_chunks=10-8         20.6k ± 0%     10.6k ± 0%  -48.58%  (p=0.000 n=10+10)
    Read/rows=10000_cols=10_chunks=10-8        25.4k ± 0%     15.4k ± 0%  -39.33%  (p=0.000 n=10+10)
    Read/rows=10000_cols=100_chunks=10-8       73.8k ± 0%     63.8k ± 0%  -13.56%  (p=0.000 n=10+10)
    Read/rows=10000_cols=1000_chunks=10-8       557k ± 0%      547k ± 0%   -1.79%  (p=0.000 n=10+10)
    Read/rows=100000_cols=1_chunks=10-8         201k ± 0%      101k ± 0%  -49.78%  (p=0.000 n=10+10)
    Read/rows=100000_cols=10_chunks=10-8        208k ± 0%      108k ± 0%  -48.02%  (p=0.000 n=10+10)
    Read/rows=100000_cols=100_chunks=10-8       282k ± 0%      182k ± 0%  -35.49%  (p=0.000 n=10+10)
    Read/rows=100000_cols=1000_chunks=10-8     1.02M ± 0%     0.92M ± 0%   -9.83%  (p=0.000 n=10+10)
    ```
    
    Author: Sebastien Binet <bi...@cern.ch>
    
    Closes #3073 from sbinet/issue-3929 and squashes the following commits:
    
    67a3272c <Sebastien Binet> ARROW-3929:  improve CSV reader memory usage
    8eb60c52 <Sebastien Binet> ARROW-3681:  Add benchmarks for CSV reader
---
 go/arrow/csv/csv.go | 1 +
 1 file changed, 1 insertion(+)

diff --git a/go/arrow/csv/csv.go b/go/arrow/csv/csv.go
index 79f2280..022c46d 100644
--- a/go/arrow/csv/csv.go
+++ b/go/arrow/csv/csv.go
@@ -98,6 +98,7 @@ func NewReader(r io.Reader, schema *arrow.Schema, opts ...Option) *Reader {
 	validate(schema)
 
 	rr := &Reader{r: csv.NewReader(r), schema: schema, refs: 1, chunk: 1}
+	rr.r.ReuseRecord = true
 	for _, opt := range opts {
 		opt(rr)
 	}