You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "DB Tsai (JIRA)" <ji...@apache.org> on 2017/11/30 01:53:00 UTC
[jira] [Updated] (ARROW-1873) Segmentation fault when loading total
2GB of parquet files
[ https://issues.apache.org/jira/browse/ARROW-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
DB Tsai updated ARROW-1873:
---------------------------
Description:
We are trying to load 100 parquet files, and each of them is around 20MB. Before we port [ARROW-1830] into our pyarrow distribution, we use {{glob}} to list all the files, and then load them as pandas dataframe through pyarrow.
The schema of the parquet files is like
{code:java}
root
|-- dateint: integer (nullable = true)
|-- profileid: long (nullable = true)
|-- time: long (nullable = true)
|-- label: double (nullable = true)
|-- weight: double (nullable = true)
|-- features: array (nullable = true)
| |-- element: double (containsNull = true)
{code}
If we only load couple of them, it works without any issue. However, when loading 100 of them, we got segmentation fault as the following. FYI, if we flatten {{features: array[double]}} into top level, the file sizes are around the same, and work fine too.
Is there anything we can try to eliminate this issue? Thanks.
{code}
>>> import glob
>>> files = glob.glob("/home/dbt/data/*")
>>> data = pq.ParquetDataset(files).read().to_pandas()
[New Thread 0x7fffe8f84700 (LWP 23769)]
[New Thread 0x7fffe3b93700 (LWP 23770)]
[New Thread 0x7fffe3392700 (LWP 23771)]
[New Thread 0x7fffe2b91700 (LWP 23772)]
[Thread 0x7fffe2b91700 (LWP 23772) exited]
[Thread 0x7fffe3b93700 (LWP 23770) exited]
Thread 4 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe3392700 (LWP 23771)]
0x00007ffff270fc94 in arrow::Status arrow::VisitTypeInline<arrow::py::ArrowDeserializer>(arrow::DataType const&, arrow::py::ArrowDeserializer*) ()
from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
(gdb) backtrace
#0 0x00007ffff270fc94 in arrow::Status arrow::VisitTypeInline<arrow::py::ArrowDeserializer>(arrow::DataType const&, arrow::py::ArrowDeserializer*) ()
from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
#1 0x00007ffff2700b5a in arrow::py::ConvertColumnToPandas(arrow::py::PandasOptions, std::shared_ptr<arrow::Column> const&, _object*, _object**) ()
from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
#2 0x00007ffff2714985 in arrow::Status arrow::py::ConvertListsLike<arrow::DoubleType>(arrow::py::PandasOptions, std::shared_ptr<arrow::Column> const&, _object**) () from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
#3 0x00007ffff2716b92 in arrow::py::ObjectBlock::Write(std::shared_ptr<arrow::Column> const&, long, long) ()
from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
#4 0x00007ffff270a489 in arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}::operator()(int) const ()
from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
#5 0x00007ffff270a67c in std::thread::_Impl<std::_Bind_simple<arrow::Status arrow::ParallelFor<arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}&>(int, int, arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}&)::{lambda()#1} ()> >::_M_run() ()
from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
#6 0x00007ffff1e30c5c in std::execute_native_thread_routine_compat (__p=<optimized out>)
at /opt/conda/conda-bld/compilers_linux-64_1505664199673/work/.build/src/gcc-7.2.0/libstdc++-v3/src/c++11/thread.cc:110
#7 0x00007ffff7bc16ba in start_thread (arg=0x7fffe3392700) at pthread_create.c:333
#8 0x00007ffff78f73dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
{code}
was:
We are trying to load 100 parquet files, and each of them is around 20MB. Before we port [ARROW-1830] into our pyarrow distribution, we use {{glob}} to list all the files, and then load them as pandas dataframe through pyarrow.
The schema of the parquet files is like
{code:java}
root
|-- dateint: integer (nullable = true)
|-- profileid: long (nullable = true)
|-- time: long (nullable = true)
|-- label: double (nullable = true)
|-- weight: double (nullable = true)
|-- features: array (nullable = true)
| |-- element: double (containsNull = true)
{code}
If we only load couple of them, it works without any issue. However, then loading 100 of them, we got segmentation fault as the following. FYI, if we flatten {{features: array[double]}} into top level, the file sizes are the same, and work fine too.
Is there anything we can try to eliminate this issue? Thanks.
{code}
>>> import glob
>>> files = glob.glob("/home/dbt/data/*")
>>> data = pq.ParquetDataset(files).read().to_pandas()
[New Thread 0x7fffe8f84700 (LWP 23769)]
[New Thread 0x7fffe3b93700 (LWP 23770)]
[New Thread 0x7fffe3392700 (LWP 23771)]
[New Thread 0x7fffe2b91700 (LWP 23772)]
[Thread 0x7fffe2b91700 (LWP 23772) exited]
[Thread 0x7fffe3b93700 (LWP 23770) exited]
Thread 4 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe3392700 (LWP 23771)]
0x00007ffff270fc94 in arrow::Status arrow::VisitTypeInline<arrow::py::ArrowDeserializer>(arrow::DataType const&, arrow::py::ArrowDeserializer*) ()
from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
(gdb) backtrace
#0 0x00007ffff270fc94 in arrow::Status arrow::VisitTypeInline<arrow::py::ArrowDeserializer>(arrow::DataType const&, arrow::py::ArrowDeserializer*) ()
from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
#1 0x00007ffff2700b5a in arrow::py::ConvertColumnToPandas(arrow::py::PandasOptions, std::shared_ptr<arrow::Column> const&, _object*, _object**) ()
from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
#2 0x00007ffff2714985 in arrow::Status arrow::py::ConvertListsLike<arrow::DoubleType>(arrow::py::PandasOptions, std::shared_ptr<arrow::Column> const&, _object**) () from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
#3 0x00007ffff2716b92 in arrow::py::ObjectBlock::Write(std::shared_ptr<arrow::Column> const&, long, long) ()
from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
#4 0x00007ffff270a489 in arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}::operator()(int) const ()
from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
#5 0x00007ffff270a67c in std::thread::_Impl<std::_Bind_simple<arrow::Status arrow::ParallelFor<arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}&>(int, int, arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}&)::{lambda()#1} ()> >::_M_run() ()
from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
#6 0x00007ffff1e30c5c in std::execute_native_thread_routine_compat (__p=<optimized out>)
at /opt/conda/conda-bld/compilers_linux-64_1505664199673/work/.build/src/gcc-7.2.0/libstdc++-v3/src/c++11/thread.cc:110
#7 0x00007ffff7bc16ba in start_thread (arg=0x7fffe3392700) at pthread_create.c:333
#8 0x00007ffff78f73dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
{code}
> Segmentation fault when loading total 2GB of parquet files
> ----------------------------------------------------------
>
> Key: ARROW-1873
> URL: https://issues.apache.org/jira/browse/ARROW-1873
> Project: Apache Arrow
> Issue Type: Bug
> Reporter: DB Tsai
>
> We are trying to load 100 parquet files, and each of them is around 20MB. Before we port [ARROW-1830] into our pyarrow distribution, we use {{glob}} to list all the files, and then load them as pandas dataframe through pyarrow.
> The schema of the parquet files is like
> {code:java}
> root
> |-- dateint: integer (nullable = true)
> |-- profileid: long (nullable = true)
> |-- time: long (nullable = true)
> |-- label: double (nullable = true)
> |-- weight: double (nullable = true)
> |-- features: array (nullable = true)
> | |-- element: double (containsNull = true)
> {code}
> If we only load couple of them, it works without any issue. However, when loading 100 of them, we got segmentation fault as the following. FYI, if we flatten {{features: array[double]}} into top level, the file sizes are around the same, and work fine too.
> Is there anything we can try to eliminate this issue? Thanks.
> {code}
> >>> import glob
> >>> files = glob.glob("/home/dbt/data/*")
> >>> data = pq.ParquetDataset(files).read().to_pandas()
> [New Thread 0x7fffe8f84700 (LWP 23769)]
> [New Thread 0x7fffe3b93700 (LWP 23770)]
> [New Thread 0x7fffe3392700 (LWP 23771)]
> [New Thread 0x7fffe2b91700 (LWP 23772)]
> [Thread 0x7fffe2b91700 (LWP 23772) exited]
> [Thread 0x7fffe3b93700 (LWP 23770) exited]
> Thread 4 "python" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fffe3392700 (LWP 23771)]
> 0x00007ffff270fc94 in arrow::Status arrow::VisitTypeInline<arrow::py::ArrowDeserializer>(arrow::DataType const&, arrow::py::ArrowDeserializer*) ()
> from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> (gdb) backtrace
> #0 0x00007ffff270fc94 in arrow::Status arrow::VisitTypeInline<arrow::py::ArrowDeserializer>(arrow::DataType const&, arrow::py::ArrowDeserializer*) ()
> from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #1 0x00007ffff2700b5a in arrow::py::ConvertColumnToPandas(arrow::py::PandasOptions, std::shared_ptr<arrow::Column> const&, _object*, _object**) ()
> from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #2 0x00007ffff2714985 in arrow::Status arrow::py::ConvertListsLike<arrow::DoubleType>(arrow::py::PandasOptions, std::shared_ptr<arrow::Column> const&, _object**) () from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #3 0x00007ffff2716b92 in arrow::py::ObjectBlock::Write(std::shared_ptr<arrow::Column> const&, long, long) ()
> from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #4 0x00007ffff270a489 in arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}::operator()(int) const ()
> from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #5 0x00007ffff270a67c in std::thread::_Impl<std::_Bind_simple<arrow::Status arrow::ParallelFor<arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}&>(int, int, arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}&)::{lambda()#1} ()> >::_M_run() ()
> from /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #6 0x00007ffff1e30c5c in std::execute_native_thread_routine_compat (__p=<optimized out>)
> at /opt/conda/conda-bld/compilers_linux-64_1505664199673/work/.build/src/gcc-7.2.0/libstdc++-v3/src/c++11/thread.cc:110
> #7 0x00007ffff7bc16ba in start_thread (arg=0x7fffe3392700) at pthread_create.c:333
> #8 0x00007ffff78f73dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)