You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by ALeX Wang <ee...@gmail.com> on 2018/07/30 17:54:53 UTC

Small malloc at file open and metadata parsing

Hi,

I'm reading parquet file (generated by Java parquet library).  Our schema
has 400 columns (including non-array elements, 1-dimensional array
elements).

I'm using release 1.3.1, gcc 4.8.5, boost static library 1.53,

I compile parquet-cpp with following cmake options,
```
cmake3    -DCMAKE_BUILD_TYPE=Debug     -DPARQUET_BUILD_EXAMPLES=OFF
 -DPARQUET_BUILD_TESTS=OFF     -DPARQUET_ARROW_LINKAGE="static"
 -DPARQUET_BUILD_SHARED=OFF     -DPARQUET_BOOST_USE_SHARED=OFF .
```

One thing we noticed is that the cpp library conducts a lot of small
mallocs during the open file and the reading metadata phases...  shown
below:

```
(gdb) where
#0  0x00007fdf40594801 in malloc () from /lib64/libc.so.6
#1  0x00007fdf40e52ecd in operator new(unsigned long) () from
/lib64/libstdc++.so.6
#2  0x0000000000ea16c0 in __gnu_cxx::new_allocator<std::string>::allocate
(this=0x33e6930, __n=3) at /usr/include/c++/4.8.2/ext/new_allocator.h:104
#3  0x0000000000e9eabb in std::_Vector_base<std::string,
std::allocator<std::string> >::_M_allocate (this=0x33e6930, __n=3) at
/usr/include/c++/4.8.2/bits/stl_vector.h:168
#4  0x0000000000ecf512 in std::vector<std::string,
std::allocator<std::string> >::_M_default_append (this=0x33e6930, __n=3) at
/usr/include/c++/4.8.2/bits/vector.tcc:549
#5  0x0000000000eca887 in std::vector<std::string,
std::allocator<std::string> >::resize (this=0x33e6930, __new_size=3) at
/usr/include/c++/4.8.2/bits/stl_vector.h:667
#6  0x0000000000ebd589 in parquet::format::ColumnMetaData::read
(this=0x33e6908, iprot=0x3337300) at
/opt/parquet-cpp/src/parquet/parquet_types.cpp:3845
#7  0x0000000000ebf9ed in parquet::format::ColumnChunk::read
(this=0x33e68f0, iprot=0x3337300) at
/opt/parquet-cpp/src/parquet/parquet_types.cpp:4246
#8  0x0000000000ec0cd2 in parquet::format::RowGroup::read (this=0x33cf7c0,
iprot=0x3337300) at /opt/parquet-cpp/src/parquet/parquet_types.cpp:4451
#9  0x0000000000ec4e22 in parquet::format::FileMetaData::read
(this=0x3337270, iprot=0x3337300) at
/opt/parquet-cpp/src/parquet/parquet_types.cpp:5385
#10 0x0000000000e9364d in
parquet::DeserializeThriftMsg<parquet::format::FileMetaData>
(buf=0x7fdf2cace040 "\025\002\031\374\313\004H\bsessions\025\374\005",
len=0x7ffc8c96ff34, deserialized_msg=0x3337270) at
/opt/parquet-cpp/src/parquet/thrift.h:119
#11 0x0000000000e8fda5 in
parquet::FileMetaData::FileMetaDataImpl::FileMetaDataImpl (this=0x3302fb0,
metadata=0x7fdf2cace040 "\025\002\031\374\313\004H\bsessions\025\374\005",
metadata_len=0x7ffc8c96ff34) at /opt/parquet-cpp/src/parquet/metadata.cc:303
#12 0x0000000000e8bf4f in parquet::FileMetaData::FileMetaData
(this=0x31a4ca0, metadata=0x7fdf2cace040
"\025\002\031\374\313\004H\bsessions\025\374\005",
metadata_len=0x7ffc8c96ff34) at /opt/parquet-cpp/src/parquet/metadata.cc:403
#13 0x0000000000e8bee3 in parquet::FileMetaData::Make
(metadata=0x7fdf2cace040 "\025\002\031\374\313\004H\bsessions\025\374\005",
metadata_len=0x7ffc8c96ff34) at /opt/parquet-cpp/src/parquet/metadata.cc:398
#14 0x0000000000e87572 in parquet::SerializedFile::ParseMetaData
(this=0x3241450) at /opt/parquet-cpp/src/parquet/file_reader.cc:213
#15 0x0000000000e858d4 in parquet::ParquetFileReader::Contents::Open
(source=std::unique_ptr<parquet::RandomAccessSource> containing 0x0,
props=..., metadata=std::shared_ptr (empty) 0x0) at
/opt/parquet-cpp/src/parquet/file_reader.cc:247
#16 0x0000000000e85a6f in parquet::ParquetFileReader::Open
(source=std::unique_ptr<parquet::RandomAccessSource> containing 0x0,
props=..., metadata=std::shared_ptr (empty) 0x0) at
/opt/parquet-cpp/src/parquet/file_reader.cc:265
#17 0x0000000000e859ba in parquet::ParquetFileReader::Open
(source=std::shared_ptr (count 2, weak 0) 0x32e2e80, props=...,
metadata=std::shared_ptr (empty) 0x0) at
/opt/parquet-cpp/src/parquet/file_reader.cc:259
#18 0x0000000000e85df4 in parquet::ParquetFileReader::OpenFile
(path="/data-slow/data0/test_parquet_file/seg0/0_1530129731023-1530136801030",
memory_map=false, props=..., metadata=std::shared_ptr (empty) 0x0) at
/opt/parquet-cpp/src/parquet/file_reader.cc:287

(gdb) info br
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   <MULTIPLE>
        breakpoint already hit 2679 times
        ignore next 2321 hits
```

I set the breakpoint to `malloc`, above ^

This seems to be the case regardless of mmap option.

Would really appreciate some pointer on how to avoid this.

Thanks,
Alex Wang,

-- 
Alex Wang,
Open vSwitch developer

Re: Small malloc at file open and metadata parsing

Posted by Ivan Sadikov <iv...@gmail.com>.
Sorry to jump in like this, but I was wondering if parquet-rs can read the
file correctly, or the issue also happens there.
Alex, could you give it a go and see if file and metadata can be read with
parquet-rs (https://github.com/sunchao/parquet-rs, you can run cargo
install parquet to install parquet tools).


Cheers,

Ivan

On Mon, 30 Jul 2018 at 21:49 ALeX Wang <ee...@gmail.com> wrote:

> Thanks for the quick reply @Wes,
>
> Too bad this is causing a lot of delays (due to page fault handing) for
> light queries (ones that query only few rows/columns),
>
> Will try to use jemallc and see,,,
>
> One more question, when i upgrade to 1.4.0 or later code, and use the same
> cmake options, and environment, OpenFile result in segfault,,,
>
> ```
> awake@ev003:/tmp$ cat tmpfile
> (gdb) where
> #0  0x00007fc542eebc3c in free () from /lib64/libc.so.6
> #1  0x0000000000f13cb1 in arrow::DefaultMemoryPool::Free (this=0x16e71e0
> <arrow::default_memory_pool()::default_memory_pool_>, buffer=0x7fc52f425040
> <Address 0x7fc52f425040 out of bounds>, size=616512)
>     at
>
> /opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/memory_pool.cc:147
> #2  0x0000000000f117b6 in arrow::PoolBuffer::~PoolBuffer (this=0x34b5fb8,
> __in_chrg=<optimized out>) at
> /opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/buffer.cc:70
> #3  0x0000000000e364b7 in
> __gnu_cxx::new_allocator<arrow::PoolBuffer>::destroy<arrow::PoolBuffer>
> (this=0x34b5fb0, __p=0x34b5fb8) at
> /usr/include/c++/4.8.2/ext/new_allocator.h:124
> #4  0x0000000000e35e10 in
> std::allocator_traits<std::allocator<arrow::PoolBuffer>
> >::_S_destroy<arrow::PoolBuffer> (__a=..., __p=0x34b5fb8) at
> /usr/include/c++/4.8.2/bits/alloc_traits.h:281
> #5  0x0000000000e34ea3 in
> std::allocator_traits<std::allocator<arrow::PoolBuffer>
> >::destroy<arrow::PoolBuffer> (__a=..., __p=0x34b5fb8) at
> /usr/include/c++/4.8.2/bits/alloc_traits.h:405
> #6  0x0000000000e33f01 in std::_Sp_counted_ptr_inplace<arrow::PoolBuffer,
> std::allocator<arrow::PoolBuffer>, (__gnu_cxx::_Lock_policy)2>::_M_dispose
> (this=0x34b5fa0) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:407
> #7  0x0000000000e27748 in
> std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release
> (this=0x34b5fa0) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:144
> #8  0x0000000000e255bb in
> std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count
> (this=0x7ffea5fffc88, __in_chrg=<optimized out>) at
> /usr/include/c++/4.8.2/bits/shared_ptr_base.h:546
> #9  0x0000000000e23eae in std::__shared_ptr<arrow::PoolBuffer,
> (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7ffea5fffc80,
> __in_chrg=<optimized out>) at
> /usr/include/c++/4.8.2/bits/shared_ptr_base.h:781
> #10 0x0000000000e23ec8 in std::shared_ptr<arrow::PoolBuffer>::~shared_ptr
> (this=0x7ffea5fffc80, __in_chrg=<optimized out>) at
> /usr/include/c++/4.8.2/bits/shared_ptr.h:93
> #11 0x0000000000e875a4 in parquet::SerializedFile::ParseMetaData
> (this=0x34b5f60) at /opt/parquet-cpp/src/parquet/file_reader.cc:213
> #12 0x0000000000e858d4 in parquet::ParquetFileReader::Contents::Open
> (source=std::unique_ptr<parquet::RandomAccessSource> containing 0x0,
> props=..., metadata=std::shared_ptr (empty) 0x0) at
> /opt/parquet-cpp/src/parquet/file_reader.cc:247
> ---Type <return> to continue, or q <return> to quit---
> #13 0x0000000000e85a6f in parquet::ParquetFileReader::Open
> (source=std::unique_ptr<parquet::RandomAccessSource> containing 0x0,
> props=..., metadata=std::shared_ptr (empty) 0x0) at
> /opt/parquet-cpp/src/parquet/file_reader.cc:265
> #14 0x0000000000e859ba in parquet::ParquetFileReader::Open
> (source=std::shared_ptr (count 2, weak 0) 0x34b5e50, props=...,
> metadata=std::shared_ptr (empty) 0x0) at
> /opt/parquet-cpp/src/parquet/file_reader.cc:259
> #15 0x0000000000e85df4 in parquet::ParquetFileReader::OpenFile
>
> (path="/data-slow/data0/test_parquet_file/seg0/0_1530129731023-1530136801030",
> memory_map=false, props=..., metadata=std::shared_ptr (empty) 0x0) at
> /opt/parquet-cpp/src/parquet/file_reader.cc:287
> ```
>
> Is this a known issue?
>
> Thanks,
> Alex Wang,
>
>
>
> On Mon, Jul 30, 2018, 11:22 AM Wes McKinney <we...@gmail.com> wrote:
>
> > hi Alex,
> >
> > It looks like the mallocs are coming from Thrift
> > (parquet/parquet_types.cpp is generated by Thrift). I'm not sure if we
> > can do much about this. I'm curious if it's possible to pass a custom
> > STL allocator to Thrift so we could use a different allocation
> > strategy than the default STL allocator
> >
> > - Wes
> >
> > On Mon, Jul 30, 2018 at 1:54 PM, ALeX Wang <ee...@gmail.com> wrote:
> > > Hi,
> > >
> > > I'm reading parquet file (generated by Java parquet library).  Our
> schema
> > > has 400 columns (including non-array elements, 1-dimensional array
> > > elements).
> > >
> > > I'm using release 1.3.1, gcc 4.8.5, boost static library 1.53,
> > >
> > > I compile parquet-cpp with following cmake options,
> > > ```
> > > cmake3    -DCMAKE_BUILD_TYPE=Debug     -DPARQUET_BUILD_EXAMPLES=OFF
> > >  -DPARQUET_BUILD_TESTS=OFF     -DPARQUET_ARROW_LINKAGE="static"
> > >  -DPARQUET_BUILD_SHARED=OFF     -DPARQUET_BOOST_USE_SHARED=OFF .
> > > ```
> > >
> > > One thing we noticed is that the cpp library conducts a lot of small
> > > mallocs during the open file and the reading metadata phases...  shown
> > > below:
> > >
> > > ```
> > > (gdb) where
> > > #0  0x00007fdf40594801 in malloc () from /lib64/libc.so.6
> > > #1  0x00007fdf40e52ecd in operator new(unsigned long) () from
> > > /lib64/libstdc++.so.6
> > > #2  0x0000000000ea16c0 in
> __gnu_cxx::new_allocator<std::string>::allocate
> > > (this=0x33e6930, __n=3) at
> /usr/include/c++/4.8.2/ext/new_allocator.h:104
> > > #3  0x0000000000e9eabb in std::_Vector_base<std::string,
> > > std::allocator<std::string> >::_M_allocate (this=0x33e6930, __n=3) at
> > > /usr/include/c++/4.8.2/bits/stl_vector.h:168
> > > #4  0x0000000000ecf512 in std::vector<std::string,
> > > std::allocator<std::string> >::_M_default_append (this=0x33e6930,
> __n=3)
> > at
> > > /usr/include/c++/4.8.2/bits/vector.tcc:549
> > > #5  0x0000000000eca887 in std::vector<std::string,
> > > std::allocator<std::string> >::resize (this=0x33e6930, __new_size=3) at
> > > /usr/include/c++/4.8.2/bits/stl_vector.h:667
> > > #6  0x0000000000ebd589 in parquet::format::ColumnMetaData::read
> > > (this=0x33e6908, iprot=0x3337300) at
> > > /opt/parquet-cpp/src/parquet/parquet_types.cpp:3845
> > > #7  0x0000000000ebf9ed in parquet::format::ColumnChunk::read
> > > (this=0x33e68f0, iprot=0x3337300) at
> > > /opt/parquet-cpp/src/parquet/parquet_types.cpp:4246
> > > #8  0x0000000000ec0cd2 in parquet::format::RowGroup::read
> > (this=0x33cf7c0,
> > > iprot=0x3337300) at /opt/parquet-cpp/src/parquet/parquet_types.cpp:4451
> > > #9  0x0000000000ec4e22 in parquet::format::FileMetaData::read
> > > (this=0x3337270, iprot=0x3337300) at
> > > /opt/parquet-cpp/src/parquet/parquet_types.cpp:5385
> > > #10 0x0000000000e9364d in
> > > parquet::DeserializeThriftMsg<parquet::format::FileMetaData>
> > > (buf=0x7fdf2cace040 "\025\002\031\374\313\004H\bsessions\025\374\005",
> > > len=0x7ffc8c96ff34, deserialized_msg=0x3337270) at
> > > /opt/parquet-cpp/src/parquet/thrift.h:119
> > > #11 0x0000000000e8fda5 in
> > > parquet::FileMetaData::FileMetaDataImpl::FileMetaDataImpl
> > (this=0x3302fb0,
> > > metadata=0x7fdf2cace040
> > "\025\002\031\374\313\004H\bsessions\025\374\005",
> > > metadata_len=0x7ffc8c96ff34) at
> > /opt/parquet-cpp/src/parquet/metadata.cc:303
> > > #12 0x0000000000e8bf4f in parquet::FileMetaData::FileMetaData
> > > (this=0x31a4ca0, metadata=0x7fdf2cace040
> > > "\025\002\031\374\313\004H\bsessions\025\374\005",
> > > metadata_len=0x7ffc8c96ff34) at
> > /opt/parquet-cpp/src/parquet/metadata.cc:403
> > > #13 0x0000000000e8bee3 in parquet::FileMetaData::Make
> > > (metadata=0x7fdf2cace040
> > "\025\002\031\374\313\004H\bsessions\025\374\005",
> > > metadata_len=0x7ffc8c96ff34) at
> > /opt/parquet-cpp/src/parquet/metadata.cc:398
> > > #14 0x0000000000e87572 in parquet::SerializedFile::ParseMetaData
> > > (this=0x3241450) at /opt/parquet-cpp/src/parquet/file_reader.cc:213
> > > #15 0x0000000000e858d4 in parquet::ParquetFileReader::Contents::Open
> > > (source=std::unique_ptr<parquet::RandomAccessSource> containing 0x0,
> > > props=..., metadata=std::shared_ptr (empty) 0x0) at
> > > /opt/parquet-cpp/src/parquet/file_reader.cc:247
> > > #16 0x0000000000e85a6f in parquet::ParquetFileReader::Open
> > > (source=std::unique_ptr<parquet::RandomAccessSource> containing 0x0,
> > > props=..., metadata=std::shared_ptr (empty) 0x0) at
> > > /opt/parquet-cpp/src/parquet/file_reader.cc:265
> > > #17 0x0000000000e859ba in parquet::ParquetFileReader::Open
> > > (source=std::shared_ptr (count 2, weak 0) 0x32e2e80, props=...,
> > > metadata=std::shared_ptr (empty) 0x0) at
> > > /opt/parquet-cpp/src/parquet/file_reader.cc:259
> > > #18 0x0000000000e85df4 in parquet::ParquetFileReader::OpenFile
> > >
> >
> (path="/data-slow/data0/test_parquet_file/seg0/0_1530129731023-1530136801030",
> > > memory_map=false, props=..., metadata=std::shared_ptr (empty) 0x0) at
> > > /opt/parquet-cpp/src/parquet/file_reader.cc:287
> > >
> > > (gdb) info br
> > > Num     Type           Disp Enb Address            What
> > > 1       breakpoint     keep y   <MULTIPLE>
> > >         breakpoint already hit 2679 times
> > >         ignore next 2321 hits
> > > ```
> > >
> > > I set the breakpoint to `malloc`, above ^
> > >
> > > This seems to be the case regardless of mmap option.
> > >
> > > Would really appreciate some pointer on how to avoid this.
> > >
> > > Thanks,
> > > Alex Wang,
> > >
> > > --
> > > Alex Wang,
> > > Open vSwitch developer
> >
>

Re: Small malloc at file open and metadata parsing

Posted by ALeX Wang <ee...@gmail.com>.
Thanks for the quick reply @Wes,

Too bad this is causing a lot of delays (due to page fault handing) for
light queries (ones that query only few rows/columns),

Will try to use jemallc and see,,,

One more question, when i upgrade to 1.4.0 or later code, and use the same
cmake options, and environment, OpenFile result in segfault,,,

```
awake@ev003:/tmp$ cat tmpfile
(gdb) where
#0  0x00007fc542eebc3c in free () from /lib64/libc.so.6
#1  0x0000000000f13cb1 in arrow::DefaultMemoryPool::Free (this=0x16e71e0
<arrow::default_memory_pool()::default_memory_pool_>, buffer=0x7fc52f425040
<Address 0x7fc52f425040 out of bounds>, size=616512)
    at
/opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/memory_pool.cc:147
#2  0x0000000000f117b6 in arrow::PoolBuffer::~PoolBuffer (this=0x34b5fb8,
__in_chrg=<optimized out>) at
/opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/buffer.cc:70
#3  0x0000000000e364b7 in
__gnu_cxx::new_allocator<arrow::PoolBuffer>::destroy<arrow::PoolBuffer>
(this=0x34b5fb0, __p=0x34b5fb8) at
/usr/include/c++/4.8.2/ext/new_allocator.h:124
#4  0x0000000000e35e10 in
std::allocator_traits<std::allocator<arrow::PoolBuffer>
>::_S_destroy<arrow::PoolBuffer> (__a=..., __p=0x34b5fb8) at
/usr/include/c++/4.8.2/bits/alloc_traits.h:281
#5  0x0000000000e34ea3 in
std::allocator_traits<std::allocator<arrow::PoolBuffer>
>::destroy<arrow::PoolBuffer> (__a=..., __p=0x34b5fb8) at
/usr/include/c++/4.8.2/bits/alloc_traits.h:405
#6  0x0000000000e33f01 in std::_Sp_counted_ptr_inplace<arrow::PoolBuffer,
std::allocator<arrow::PoolBuffer>, (__gnu_cxx::_Lock_policy)2>::_M_dispose
(this=0x34b5fa0) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:407
#7  0x0000000000e27748 in
std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release
(this=0x34b5fa0) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:144
#8  0x0000000000e255bb in
std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count
(this=0x7ffea5fffc88, __in_chrg=<optimized out>) at
/usr/include/c++/4.8.2/bits/shared_ptr_base.h:546
#9  0x0000000000e23eae in std::__shared_ptr<arrow::PoolBuffer,
(__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7ffea5fffc80,
__in_chrg=<optimized out>) at
/usr/include/c++/4.8.2/bits/shared_ptr_base.h:781
#10 0x0000000000e23ec8 in std::shared_ptr<arrow::PoolBuffer>::~shared_ptr
(this=0x7ffea5fffc80, __in_chrg=<optimized out>) at
/usr/include/c++/4.8.2/bits/shared_ptr.h:93
#11 0x0000000000e875a4 in parquet::SerializedFile::ParseMetaData
(this=0x34b5f60) at /opt/parquet-cpp/src/parquet/file_reader.cc:213
#12 0x0000000000e858d4 in parquet::ParquetFileReader::Contents::Open
(source=std::unique_ptr<parquet::RandomAccessSource> containing 0x0,
props=..., metadata=std::shared_ptr (empty) 0x0) at
/opt/parquet-cpp/src/parquet/file_reader.cc:247
---Type <return> to continue, or q <return> to quit---
#13 0x0000000000e85a6f in parquet::ParquetFileReader::Open
(source=std::unique_ptr<parquet::RandomAccessSource> containing 0x0,
props=..., metadata=std::shared_ptr (empty) 0x0) at
/opt/parquet-cpp/src/parquet/file_reader.cc:265
#14 0x0000000000e859ba in parquet::ParquetFileReader::Open
(source=std::shared_ptr (count 2, weak 0) 0x34b5e50, props=...,
metadata=std::shared_ptr (empty) 0x0) at
/opt/parquet-cpp/src/parquet/file_reader.cc:259
#15 0x0000000000e85df4 in parquet::ParquetFileReader::OpenFile
(path="/data-slow/data0/test_parquet_file/seg0/0_1530129731023-1530136801030",
memory_map=false, props=..., metadata=std::shared_ptr (empty) 0x0) at
/opt/parquet-cpp/src/parquet/file_reader.cc:287
```

Is this a known issue?

Thanks,
Alex Wang,



On Mon, Jul 30, 2018, 11:22 AM Wes McKinney <we...@gmail.com> wrote:

> hi Alex,
>
> It looks like the mallocs are coming from Thrift
> (parquet/parquet_types.cpp is generated by Thrift). I'm not sure if we
> can do much about this. I'm curious if it's possible to pass a custom
> STL allocator to Thrift so we could use a different allocation
> strategy than the default STL allocator
>
> - Wes
>
> On Mon, Jul 30, 2018 at 1:54 PM, ALeX Wang <ee...@gmail.com> wrote:
> > Hi,
> >
> > I'm reading parquet file (generated by Java parquet library).  Our schema
> > has 400 columns (including non-array elements, 1-dimensional array
> > elements).
> >
> > I'm using release 1.3.1, gcc 4.8.5, boost static library 1.53,
> >
> > I compile parquet-cpp with following cmake options,
> > ```
> > cmake3    -DCMAKE_BUILD_TYPE=Debug     -DPARQUET_BUILD_EXAMPLES=OFF
> >  -DPARQUET_BUILD_TESTS=OFF     -DPARQUET_ARROW_LINKAGE="static"
> >  -DPARQUET_BUILD_SHARED=OFF     -DPARQUET_BOOST_USE_SHARED=OFF .
> > ```
> >
> > One thing we noticed is that the cpp library conducts a lot of small
> > mallocs during the open file and the reading metadata phases...  shown
> > below:
> >
> > ```
> > (gdb) where
> > #0  0x00007fdf40594801 in malloc () from /lib64/libc.so.6
> > #1  0x00007fdf40e52ecd in operator new(unsigned long) () from
> > /lib64/libstdc++.so.6
> > #2  0x0000000000ea16c0 in __gnu_cxx::new_allocator<std::string>::allocate
> > (this=0x33e6930, __n=3) at /usr/include/c++/4.8.2/ext/new_allocator.h:104
> > #3  0x0000000000e9eabb in std::_Vector_base<std::string,
> > std::allocator<std::string> >::_M_allocate (this=0x33e6930, __n=3) at
> > /usr/include/c++/4.8.2/bits/stl_vector.h:168
> > #4  0x0000000000ecf512 in std::vector<std::string,
> > std::allocator<std::string> >::_M_default_append (this=0x33e6930, __n=3)
> at
> > /usr/include/c++/4.8.2/bits/vector.tcc:549
> > #5  0x0000000000eca887 in std::vector<std::string,
> > std::allocator<std::string> >::resize (this=0x33e6930, __new_size=3) at
> > /usr/include/c++/4.8.2/bits/stl_vector.h:667
> > #6  0x0000000000ebd589 in parquet::format::ColumnMetaData::read
> > (this=0x33e6908, iprot=0x3337300) at
> > /opt/parquet-cpp/src/parquet/parquet_types.cpp:3845
> > #7  0x0000000000ebf9ed in parquet::format::ColumnChunk::read
> > (this=0x33e68f0, iprot=0x3337300) at
> > /opt/parquet-cpp/src/parquet/parquet_types.cpp:4246
> > #8  0x0000000000ec0cd2 in parquet::format::RowGroup::read
> (this=0x33cf7c0,
> > iprot=0x3337300) at /opt/parquet-cpp/src/parquet/parquet_types.cpp:4451
> > #9  0x0000000000ec4e22 in parquet::format::FileMetaData::read
> > (this=0x3337270, iprot=0x3337300) at
> > /opt/parquet-cpp/src/parquet/parquet_types.cpp:5385
> > #10 0x0000000000e9364d in
> > parquet::DeserializeThriftMsg<parquet::format::FileMetaData>
> > (buf=0x7fdf2cace040 "\025\002\031\374\313\004H\bsessions\025\374\005",
> > len=0x7ffc8c96ff34, deserialized_msg=0x3337270) at
> > /opt/parquet-cpp/src/parquet/thrift.h:119
> > #11 0x0000000000e8fda5 in
> > parquet::FileMetaData::FileMetaDataImpl::FileMetaDataImpl
> (this=0x3302fb0,
> > metadata=0x7fdf2cace040
> "\025\002\031\374\313\004H\bsessions\025\374\005",
> > metadata_len=0x7ffc8c96ff34) at
> /opt/parquet-cpp/src/parquet/metadata.cc:303
> > #12 0x0000000000e8bf4f in parquet::FileMetaData::FileMetaData
> > (this=0x31a4ca0, metadata=0x7fdf2cace040
> > "\025\002\031\374\313\004H\bsessions\025\374\005",
> > metadata_len=0x7ffc8c96ff34) at
> /opt/parquet-cpp/src/parquet/metadata.cc:403
> > #13 0x0000000000e8bee3 in parquet::FileMetaData::Make
> > (metadata=0x7fdf2cace040
> "\025\002\031\374\313\004H\bsessions\025\374\005",
> > metadata_len=0x7ffc8c96ff34) at
> /opt/parquet-cpp/src/parquet/metadata.cc:398
> > #14 0x0000000000e87572 in parquet::SerializedFile::ParseMetaData
> > (this=0x3241450) at /opt/parquet-cpp/src/parquet/file_reader.cc:213
> > #15 0x0000000000e858d4 in parquet::ParquetFileReader::Contents::Open
> > (source=std::unique_ptr<parquet::RandomAccessSource> containing 0x0,
> > props=..., metadata=std::shared_ptr (empty) 0x0) at
> > /opt/parquet-cpp/src/parquet/file_reader.cc:247
> > #16 0x0000000000e85a6f in parquet::ParquetFileReader::Open
> > (source=std::unique_ptr<parquet::RandomAccessSource> containing 0x0,
> > props=..., metadata=std::shared_ptr (empty) 0x0) at
> > /opt/parquet-cpp/src/parquet/file_reader.cc:265
> > #17 0x0000000000e859ba in parquet::ParquetFileReader::Open
> > (source=std::shared_ptr (count 2, weak 0) 0x32e2e80, props=...,
> > metadata=std::shared_ptr (empty) 0x0) at
> > /opt/parquet-cpp/src/parquet/file_reader.cc:259
> > #18 0x0000000000e85df4 in parquet::ParquetFileReader::OpenFile
> >
> (path="/data-slow/data0/test_parquet_file/seg0/0_1530129731023-1530136801030",
> > memory_map=false, props=..., metadata=std::shared_ptr (empty) 0x0) at
> > /opt/parquet-cpp/src/parquet/file_reader.cc:287
> >
> > (gdb) info br
> > Num     Type           Disp Enb Address            What
> > 1       breakpoint     keep y   <MULTIPLE>
> >         breakpoint already hit 2679 times
> >         ignore next 2321 hits
> > ```
> >
> > I set the breakpoint to `malloc`, above ^
> >
> > This seems to be the case regardless of mmap option.
> >
> > Would really appreciate some pointer on how to avoid this.
> >
> > Thanks,
> > Alex Wang,
> >
> > --
> > Alex Wang,
> > Open vSwitch developer
>

Re: Small malloc at file open and metadata parsing

Posted by Wes McKinney <we...@gmail.com>.
hi Alex,

It looks like the mallocs are coming from Thrift
(parquet/parquet_types.cpp is generated by Thrift). I'm not sure if we
can do much about this. I'm curious if it's possible to pass a custom
STL allocator to Thrift so we could use a different allocation
strategy than the default STL allocator

- Wes

On Mon, Jul 30, 2018 at 1:54 PM, ALeX Wang <ee...@gmail.com> wrote:
> Hi,
>
> I'm reading parquet file (generated by Java parquet library).  Our schema
> has 400 columns (including non-array elements, 1-dimensional array
> elements).
>
> I'm using release 1.3.1, gcc 4.8.5, boost static library 1.53,
>
> I compile parquet-cpp with following cmake options,
> ```
> cmake3    -DCMAKE_BUILD_TYPE=Debug     -DPARQUET_BUILD_EXAMPLES=OFF
>  -DPARQUET_BUILD_TESTS=OFF     -DPARQUET_ARROW_LINKAGE="static"
>  -DPARQUET_BUILD_SHARED=OFF     -DPARQUET_BOOST_USE_SHARED=OFF .
> ```
>
> One thing we noticed is that the cpp library conducts a lot of small
> mallocs during the open file and the reading metadata phases...  shown
> below:
>
> ```
> (gdb) where
> #0  0x00007fdf40594801 in malloc () from /lib64/libc.so.6
> #1  0x00007fdf40e52ecd in operator new(unsigned long) () from
> /lib64/libstdc++.so.6
> #2  0x0000000000ea16c0 in __gnu_cxx::new_allocator<std::string>::allocate
> (this=0x33e6930, __n=3) at /usr/include/c++/4.8.2/ext/new_allocator.h:104
> #3  0x0000000000e9eabb in std::_Vector_base<std::string,
> std::allocator<std::string> >::_M_allocate (this=0x33e6930, __n=3) at
> /usr/include/c++/4.8.2/bits/stl_vector.h:168
> #4  0x0000000000ecf512 in std::vector<std::string,
> std::allocator<std::string> >::_M_default_append (this=0x33e6930, __n=3) at
> /usr/include/c++/4.8.2/bits/vector.tcc:549
> #5  0x0000000000eca887 in std::vector<std::string,
> std::allocator<std::string> >::resize (this=0x33e6930, __new_size=3) at
> /usr/include/c++/4.8.2/bits/stl_vector.h:667
> #6  0x0000000000ebd589 in parquet::format::ColumnMetaData::read
> (this=0x33e6908, iprot=0x3337300) at
> /opt/parquet-cpp/src/parquet/parquet_types.cpp:3845
> #7  0x0000000000ebf9ed in parquet::format::ColumnChunk::read
> (this=0x33e68f0, iprot=0x3337300) at
> /opt/parquet-cpp/src/parquet/parquet_types.cpp:4246
> #8  0x0000000000ec0cd2 in parquet::format::RowGroup::read (this=0x33cf7c0,
> iprot=0x3337300) at /opt/parquet-cpp/src/parquet/parquet_types.cpp:4451
> #9  0x0000000000ec4e22 in parquet::format::FileMetaData::read
> (this=0x3337270, iprot=0x3337300) at
> /opt/parquet-cpp/src/parquet/parquet_types.cpp:5385
> #10 0x0000000000e9364d in
> parquet::DeserializeThriftMsg<parquet::format::FileMetaData>
> (buf=0x7fdf2cace040 "\025\002\031\374\313\004H\bsessions\025\374\005",
> len=0x7ffc8c96ff34, deserialized_msg=0x3337270) at
> /opt/parquet-cpp/src/parquet/thrift.h:119
> #11 0x0000000000e8fda5 in
> parquet::FileMetaData::FileMetaDataImpl::FileMetaDataImpl (this=0x3302fb0,
> metadata=0x7fdf2cace040 "\025\002\031\374\313\004H\bsessions\025\374\005",
> metadata_len=0x7ffc8c96ff34) at /opt/parquet-cpp/src/parquet/metadata.cc:303
> #12 0x0000000000e8bf4f in parquet::FileMetaData::FileMetaData
> (this=0x31a4ca0, metadata=0x7fdf2cace040
> "\025\002\031\374\313\004H\bsessions\025\374\005",
> metadata_len=0x7ffc8c96ff34) at /opt/parquet-cpp/src/parquet/metadata.cc:403
> #13 0x0000000000e8bee3 in parquet::FileMetaData::Make
> (metadata=0x7fdf2cace040 "\025\002\031\374\313\004H\bsessions\025\374\005",
> metadata_len=0x7ffc8c96ff34) at /opt/parquet-cpp/src/parquet/metadata.cc:398
> #14 0x0000000000e87572 in parquet::SerializedFile::ParseMetaData
> (this=0x3241450) at /opt/parquet-cpp/src/parquet/file_reader.cc:213
> #15 0x0000000000e858d4 in parquet::ParquetFileReader::Contents::Open
> (source=std::unique_ptr<parquet::RandomAccessSource> containing 0x0,
> props=..., metadata=std::shared_ptr (empty) 0x0) at
> /opt/parquet-cpp/src/parquet/file_reader.cc:247
> #16 0x0000000000e85a6f in parquet::ParquetFileReader::Open
> (source=std::unique_ptr<parquet::RandomAccessSource> containing 0x0,
> props=..., metadata=std::shared_ptr (empty) 0x0) at
> /opt/parquet-cpp/src/parquet/file_reader.cc:265
> #17 0x0000000000e859ba in parquet::ParquetFileReader::Open
> (source=std::shared_ptr (count 2, weak 0) 0x32e2e80, props=...,
> metadata=std::shared_ptr (empty) 0x0) at
> /opt/parquet-cpp/src/parquet/file_reader.cc:259
> #18 0x0000000000e85df4 in parquet::ParquetFileReader::OpenFile
> (path="/data-slow/data0/test_parquet_file/seg0/0_1530129731023-1530136801030",
> memory_map=false, props=..., metadata=std::shared_ptr (empty) 0x0) at
> /opt/parquet-cpp/src/parquet/file_reader.cc:287
>
> (gdb) info br
> Num     Type           Disp Enb Address            What
> 1       breakpoint     keep y   <MULTIPLE>
>         breakpoint already hit 2679 times
>         ignore next 2321 hits
> ```
>
> I set the breakpoint to `malloc`, above ^
>
> This seems to be the case regardless of mmap option.
>
> Would really appreciate some pointer on how to avoid this.
>
> Thanks,
> Alex Wang,
>
> --
> Alex Wang,
> Open vSwitch developer