You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Chris Nyland <me...@gmail.com> on 2022/01/25 04:18:39 UTC

pyarrow write_dataset Illegal Instruction

Hello,

I was just taking a look at pyarrow in my off hours. I was trying to write
a partitioned data set based on the birthdays example in the pyarrow cook
book. However when I run the script I get no data written and a "Illegal
Instruction" message prints to screen, no exception is raised. I installed
the pyarrow manylinux x86_64 version 6.0.1 wheel via pip for Python 3.7
using a virtual environment. I suspect that if I build pyarrow myself it
would work, it doesn't look too terribly difficult, but it is still kind of
a drag since I was looking to make some quick progress on an off hours
project.

If anyone has any ideas on what else it would be I would like to try it
before building the library myself. Also is this a pretty typical issue to
run into? At work I primarily do Python on Windows and really haven't had
any build issues there since the Python 2.7 days.

Thanks

Chris

Re: pyarrow write_dataset Illegal Instruction

Posted by Chris Nyland <me...@gmail.com>.
Yes I am following the instructions on arrow.apache.org under build from
source python development. So I am pretty sure that I ran the make command
as indicated in the tutorial meaning I ran with Parquet on but I will do a
make clean and go back and run it again and let you know if I get a
different result.

Chris

On Fri, Jan 28, 2022, 00:35 Weston Pace <we...@gmail.com> wrote:

> How did you build Arrow C++?  That error most likely means that the
> C++ parquet module was not turned on when the C++ was built.  For
> example, in [1] an example command to build C++ is:
>
> ---
>
> mkdir arrow/cpp/build
> pushd arrow/cpp/build
>
> cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
>       -DCMAKE_INSTALL_LIBDIR=lib \
>       -DARROW_WITH_BZ2=ON \
>       -DARROW_WITH_ZLIB=ON \
>       -DARROW_WITH_ZSTD=ON \
>       -DARROW_WITH_LZ4=ON \
>       -DARROW_WITH_SNAPPY=ON \
>       -DARROW_WITH_BROTLI=ON \
>       -DARROW_PARQUET=ON \
>       -DARROW_PYTHON=ON \
>       -DARROW_BUILD_TESTS=ON \
>       ..
> make -j4
> make install
> popd
>
> ---
>
> Specifically the -DARROW_PARQUET=ON part tells the build to build the
> parquet module.  The Arrow C++ implementation is broken up into a
> bunch of small modules.  When we build for pip/conda we normally turn
> on a "stock set" of modules so users that get pyarrow from those
> sources don't often have to worry about this detail.
>
> Another option is to disable parquet support when you build python.
> Similar to C++, the python module is also broken up into smaller
> submodules.  I'm guessing you were following our guides and you ran
> `export PYARROW_WITH_PARQUET=1` which tells the python build to build
> the parquet module.  You could set that to 0 and the python build
> would not build the parquet module.  However, given your original plan
> was to play with datasets you probably want to build both the parquet
> module and the datasets module.  I'd recommend you include
> -DARROW_PYTHON=ON, -DARROW_PARQUET=ON, and -DARROW_DATASET=ON in your
> cmake build.
>
> When building python you can either run both `export
> PYARROW_WITH_PARQUET=1` and `export PYARROW_WITH_DATASET=1` or you can
> run the following build command:
>
>     python setup.py build_ext --inplace --with-parquet --with-dataset
>
> The `--with-dataset` flag achieves the same thing as `export
> PYARROW_WITH_DATASET=1`.
>
> [1] https://arrow.apache.org/docs/developers/python.html#build-and-test
>
> On Thu, Jan 27, 2022 at 6:17 PM Chris Nyland <me...@gmail.com> wrote:
> >
> >
> > Good guess yes I am working on my beater laptop which is an old thinkpad
> x200. So I started running down the compile instructions and I was going
> along pretty good till I got to the point to build the Python extensions.
> When I run those commands I get a message basically that it can't find
> parquet. Full output is below.
> >
> > Any ideas? I did look in the CMakeOutput.log but didn't see anything
> that made obvious sense to me.
> >
> > The result of running
> >
> > python setup.py build_ext --inplace
> >
> > running build_ext
> > -- Running cmake for pyarrow
> > cmake -DPYTHON_EXECUTABLE=~/build_arrow/pyarrow/bin/python
> -DPython3_EXECUTABLE=~/build_arrow/pyarrow/bin/python ""
> -DPYARROW_BUILD_CUDA=off -DPYARROW_BUILD_FLIGHT=off
> -DPYARROW_BUILD_GANDIVA=off -DPYARROW_BUILD_DATASET=off
> -DPYARROW_BUILD_ORC=off -DPYARROW_BUILD_PARQUET=on
> -DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off -DPYARROW_BUILD_HDFS=off
> -DPYARROW_USE_TENSORFLOW=off -DPYARROW_BUNDLE_ARROW_CPP=off
> -DPYARROW_BUNDLE_BOOST=off -DPYARROW_GENERATE_COVERAGE=off
> -DPYARROW_BOOST_USE_SHARED=on -DPYARROW_PARQUET_USE_SHARED=on
> -DCMAKE_BUILD_TYPE=release ~/build_arrow/arrow/python
> > -- System processor: x86_64
> > -- Arrow build warning level: PRODUCTION
> > Using ld linker
> > Configured for RELEASE build (set with cmake
> -DCMAKE_BUILD_TYPE={release,debug,...})
> > -- Build Type: RELEASE
> > -- Generator: Unix Makefiles
> > -- Build output directory:
> ~/build_arrow/arrow/python/build/temp.linux-x86_64-3.7/release
> > -- Searching for Python libs in
> ~/build_arrow/pyarrow/lib64;~/build_arrow/pyarrow/lib;/usr/lib/python3.7/config-3.7m-x86_64-linux-gnu
> > -- Looking for python3.7m
> > -- Found Python lib /usr/lib/python3.7/config-3.7m-x86_64-linux-gnu/
> libpython3.7m.so
> > -- Searching for Python libs in
> ~/build_arrow/pyarrow/lib64;~/build_arrow/pyarrow/lib;/usr/lib/python3.7/config-3.7m-x86_64-linux-gnu
> > -- Looking for python3.7m
> > -- Found Python lib /usr/lib/python3.7/config-3.7m-x86_64-linux-gnu/
> libpython3.7m.so
> > -- Arrow version: 7.0.0 (HOME: ~/build_arrow/dist)
> > -- Arrow SO and ABI version: 700
> > -- Arrow full SO version: 700.0.0
> > -- Found the Arrow core shared library:
> ~/build_arrow/dist/lib/libarrow.so
> > -- Found the Arrow core import library:
> ~/build_arrow/dist/lib/libarrow.so
> > -- Found the Arrow core static library: ~/build_arrow/dist/lib/libarrow.a
> > -- Found the Arrow Python by HOME: ~/build_arrow/dist
> > -- Found the Arrow Python shared library:
> ~/build_arrow/dist/lib/libarrow_python.so
> > -- Found the Arrow Python import library:
> ~/build_arrow/dist/lib/libarrow_python.so
> > -- Found the Arrow Python static library:
> ~/build_arrow/dist/lib/libarrow_python.a
> > CMake Error at
> /usr/share/cmake-3.13/Modules/FindPackageHandleStandardArgs.cmake:137
> (message):
> >   Could NOT find Parquet (missing: PARQUET_INCLUDE_DIR PARQUET_LIB_DIR
> >   PARQUET_SO_VERSION)
> > Call Stack (most recent call first):
> >   /usr/share/cmake-3.13/Modules/FindPackageHandleStandardArgs.cmake:378
> (_FPHSA_FAILURE_MESSAGE)
> >   cmake_modules/FindParquet.cmake:115 (find_package_handle_standard_args)
> >   CMakeLists.txt:447 (find_package)
> >
> >
> > -- Configuring incomplete, errors occurred!
> > See also
> "~/build_arrow/arrow/python/build/temp.linux-x86_64-3.7/CMakeFiles/CMakeOutput.log".
> > error: command '/usr/bin/cmake' failed with exit code 1
> >
> > On Tue, Jan 25, 2022 at 12:27 AM Weston Pace <we...@gmail.com>
> wrote:
> >>
> >> Your problem is probably old hardware, specifically an older CPU.  Pip
> builds rely on popcnt (which I think is SSE4.1?)
> >>
> >> I'm pretty sure you are right that you can compile from source and be
> ok.  It's a performance / portability tradeoff that has to be made when
> packaging prebuilt binaries.
> >>
> >> On Mon, Jan 24, 2022, 6:18 PM Chris Nyland <me...@gmail.com> wrote:
> >>>
> >>> Hello,
> >>>
> >>> I was just taking a look at pyarrow in my off hours. I was trying to
> write a partitioned data set based on the birthdays example in the pyarrow
> cook book. However when I run the script I get no data written and a
> "Illegal Instruction" message prints to screen, no exception is raised. I
> installed the pyarrow manylinux x86_64 version 6.0.1 wheel via pip for
> Python 3.7 using a virtual environment. I suspect that if I build pyarrow
> myself it would work, it doesn't look too terribly difficult, but it is
> still kind of a drag since I was looking to make some quick progress on an
> off hours project.
> >>>
> >>> If anyone has any ideas on what else it would be I would like to try
> it before building the library myself. Also is this a pretty typical issue
> to run into? At work I primarily do Python on Windows and really haven't
> had any build issues there since the Python 2.7 days.
> >>>
> >>> Thanks
> >>>
> >>> Chris
>

Re: pyarrow write_dataset Illegal Instruction

Posted by Weston Pace <we...@gmail.com>.
How did you build Arrow C++?  That error most likely means that the
C++ parquet module was not turned on when the C++ was built.  For
example, in [1] an example command to build C++ is:

---

mkdir arrow/cpp/build
pushd arrow/cpp/build

cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
      -DCMAKE_INSTALL_LIBDIR=lib \
      -DARROW_WITH_BZ2=ON \
      -DARROW_WITH_ZLIB=ON \
      -DARROW_WITH_ZSTD=ON \
      -DARROW_WITH_LZ4=ON \
      -DARROW_WITH_SNAPPY=ON \
      -DARROW_WITH_BROTLI=ON \
      -DARROW_PARQUET=ON \
      -DARROW_PYTHON=ON \
      -DARROW_BUILD_TESTS=ON \
      ..
make -j4
make install
popd

---

Specifically the -DARROW_PARQUET=ON part tells the build to build the
parquet module.  The Arrow C++ implementation is broken up into a
bunch of small modules.  When we build for pip/conda we normally turn
on a "stock set" of modules so users that get pyarrow from those
sources don't often have to worry about this detail.

Another option is to disable parquet support when you build python.
Similar to C++, the python module is also broken up into smaller
submodules.  I'm guessing you were following our guides and you ran
`export PYARROW_WITH_PARQUET=1` which tells the python build to build
the parquet module.  You could set that to 0 and the python build
would not build the parquet module.  However, given your original plan
was to play with datasets you probably want to build both the parquet
module and the datasets module.  I'd recommend you include
-DARROW_PYTHON=ON, -DARROW_PARQUET=ON, and -DARROW_DATASET=ON in your
cmake build.

When building python you can either run both `export
PYARROW_WITH_PARQUET=1` and `export PYARROW_WITH_DATASET=1` or you can
run the following build command:

    python setup.py build_ext --inplace --with-parquet --with-dataset

The `--with-dataset` flag achieves the same thing as `export
PYARROW_WITH_DATASET=1`.

[1] https://arrow.apache.org/docs/developers/python.html#build-and-test

On Thu, Jan 27, 2022 at 6:17 PM Chris Nyland <me...@gmail.com> wrote:
>
>
> Good guess yes I am working on my beater laptop which is an old thinkpad x200. So I started running down the compile instructions and I was going along pretty good till I got to the point to build the Python extensions. When I run those commands I get a message basically that it can't find parquet. Full output is below.
>
> Any ideas? I did look in the CMakeOutput.log but didn't see anything that made obvious sense to me.
>
> The result of running
>
> python setup.py build_ext --inplace
>
> running build_ext
> -- Running cmake for pyarrow
> cmake -DPYTHON_EXECUTABLE=~/build_arrow/pyarrow/bin/python -DPython3_EXECUTABLE=~/build_arrow/pyarrow/bin/python "" -DPYARROW_BUILD_CUDA=off -DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off -DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off -DPYARROW_BUILD_PARQUET=on -DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off -DPYARROW_BUILD_HDFS=off -DPYARROW_USE_TENSORFLOW=off -DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_BOOST=off -DPYARROW_GENERATE_COVERAGE=off -DPYARROW_BOOST_USE_SHARED=on -DPYARROW_PARQUET_USE_SHARED=on -DCMAKE_BUILD_TYPE=release ~/build_arrow/arrow/python
> -- System processor: x86_64
> -- Arrow build warning level: PRODUCTION
> Using ld linker
> Configured for RELEASE build (set with cmake -DCMAKE_BUILD_TYPE={release,debug,...})
> -- Build Type: RELEASE
> -- Generator: Unix Makefiles
> -- Build output directory: ~/build_arrow/arrow/python/build/temp.linux-x86_64-3.7/release
> -- Searching for Python libs in ~/build_arrow/pyarrow/lib64;~/build_arrow/pyarrow/lib;/usr/lib/python3.7/config-3.7m-x86_64-linux-gnu
> -- Looking for python3.7m
> -- Found Python lib /usr/lib/python3.7/config-3.7m-x86_64-linux-gnu/libpython3.7m.so
> -- Searching for Python libs in ~/build_arrow/pyarrow/lib64;~/build_arrow/pyarrow/lib;/usr/lib/python3.7/config-3.7m-x86_64-linux-gnu
> -- Looking for python3.7m
> -- Found Python lib /usr/lib/python3.7/config-3.7m-x86_64-linux-gnu/libpython3.7m.so
> -- Arrow version: 7.0.0 (HOME: ~/build_arrow/dist)
> -- Arrow SO and ABI version: 700
> -- Arrow full SO version: 700.0.0
> -- Found the Arrow core shared library: ~/build_arrow/dist/lib/libarrow.so
> -- Found the Arrow core import library: ~/build_arrow/dist/lib/libarrow.so
> -- Found the Arrow core static library: ~/build_arrow/dist/lib/libarrow.a
> -- Found the Arrow Python by HOME: ~/build_arrow/dist
> -- Found the Arrow Python shared library: ~/build_arrow/dist/lib/libarrow_python.so
> -- Found the Arrow Python import library: ~/build_arrow/dist/lib/libarrow_python.so
> -- Found the Arrow Python static library: ~/build_arrow/dist/lib/libarrow_python.a
> CMake Error at /usr/share/cmake-3.13/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
>   Could NOT find Parquet (missing: PARQUET_INCLUDE_DIR PARQUET_LIB_DIR
>   PARQUET_SO_VERSION)
> Call Stack (most recent call first):
>   /usr/share/cmake-3.13/Modules/FindPackageHandleStandardArgs.cmake:378 (_FPHSA_FAILURE_MESSAGE)
>   cmake_modules/FindParquet.cmake:115 (find_package_handle_standard_args)
>   CMakeLists.txt:447 (find_package)
>
>
> -- Configuring incomplete, errors occurred!
> See also "~/build_arrow/arrow/python/build/temp.linux-x86_64-3.7/CMakeFiles/CMakeOutput.log".
> error: command '/usr/bin/cmake' failed with exit code 1
>
> On Tue, Jan 25, 2022 at 12:27 AM Weston Pace <we...@gmail.com> wrote:
>>
>> Your problem is probably old hardware, specifically an older CPU.  Pip builds rely on popcnt (which I think is SSE4.1?)
>>
>> I'm pretty sure you are right that you can compile from source and be ok.  It's a performance / portability tradeoff that has to be made when packaging prebuilt binaries.
>>
>> On Mon, Jan 24, 2022, 6:18 PM Chris Nyland <me...@gmail.com> wrote:
>>>
>>> Hello,
>>>
>>> I was just taking a look at pyarrow in my off hours. I was trying to write a partitioned data set based on the birthdays example in the pyarrow cook book. However when I run the script I get no data written and a "Illegal Instruction" message prints to screen, no exception is raised. I installed the pyarrow manylinux x86_64 version 6.0.1 wheel via pip for Python 3.7 using a virtual environment. I suspect that if I build pyarrow myself it would work, it doesn't look too terribly difficult, but it is still kind of a drag since I was looking to make some quick progress on an off hours project.
>>>
>>> If anyone has any ideas on what else it would be I would like to try it before building the library myself. Also is this a pretty typical issue to run into? At work I primarily do Python on Windows and really haven't had any build issues there since the Python 2.7 days.
>>>
>>> Thanks
>>>
>>> Chris

Re: pyarrow write_dataset Illegal Instruction

Posted by Chris Nyland <me...@gmail.com>.
Good guess yes I am working on my beater laptop which is an old thinkpad
x200. So I started running down the compile instructions and I was going
along pretty good till I got to the point to build the Python extensions.
When I run those commands I get a message basically that it can't find
parquet. Full output is below.

Any ideas? I did look in the CMakeOutput.log but didn't see anything that
made obvious sense to me.

The result of running

python setup.py build_ext --inplace

running build_ext
-- Running cmake for pyarrow
cmake -DPYTHON_EXECUTABLE=~/build_arrow/pyarrow/bin/python
-DPython3_EXECUTABLE=~/build_arrow/pyarrow/bin/python ""
-DPYARROW_BUILD_CUDA=off -DPYARROW_BUILD_FLIGHT=off
-DPYARROW_BUILD_GANDIVA=off -DPYARROW_BUILD_DATASET=off
-DPYARROW_BUILD_ORC=off -DPYARROW_BUILD_PARQUET=on
-DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off -DPYARROW_BUILD_HDFS=off
-DPYARROW_USE_TENSORFLOW=off -DPYARROW_BUNDLE_ARROW_CPP=off
-DPYARROW_BUNDLE_BOOST=off -DPYARROW_GENERATE_COVERAGE=off
-DPYARROW_BOOST_USE_SHARED=on -DPYARROW_PARQUET_USE_SHARED=on
-DCMAKE_BUILD_TYPE=release ~/build_arrow/arrow/python
-- System processor: x86_64
-- Arrow build warning level: PRODUCTION
Using ld linker
Configured for RELEASE build (set with cmake
-DCMAKE_BUILD_TYPE={release,debug,...})
-- Build Type: RELEASE
-- Generator: Unix Makefiles
-- Build output directory:
~/build_arrow/arrow/python/build/temp.linux-x86_64-3.7/release
-- Searching for Python libs in
~/build_arrow/pyarrow/lib64;~/build_arrow/pyarrow/lib;/usr/lib/python3.7/config-3.7m-x86_64-linux-gnu
-- Looking for python3.7m
-- Found Python lib /usr/lib/python3.7/config-3.7m-x86_64-linux-gnu/
libpython3.7m.so
-- Searching for Python libs in
~/build_arrow/pyarrow/lib64;~/build_arrow/pyarrow/lib;/usr/lib/python3.7/config-3.7m-x86_64-linux-gnu
-- Looking for python3.7m
-- Found Python lib /usr/lib/python3.7/config-3.7m-x86_64-linux-gnu/
libpython3.7m.so
-- Arrow version: 7.0.0 (HOME: ~/build_arrow/dist)
-- Arrow SO and ABI version: 700
-- Arrow full SO version: 700.0.0
-- Found the Arrow core shared library: ~/build_arrow/dist/lib/libarrow.so
-- Found the Arrow core import library: ~/build_arrow/dist/lib/libarrow.so
-- Found the Arrow core static library: ~/build_arrow/dist/lib/libarrow.a
-- Found the Arrow Python by HOME: ~/build_arrow/dist
-- Found the Arrow Python shared library:
~/build_arrow/dist/lib/libarrow_python.so
-- Found the Arrow Python import library:
~/build_arrow/dist/lib/libarrow_python.so
-- Found the Arrow Python static library:
~/build_arrow/dist/lib/libarrow_python.a
CMake Error at
/usr/share/cmake-3.13/Modules/FindPackageHandleStandardArgs.cmake:137
(message):
  Could NOT find Parquet (missing: PARQUET_INCLUDE_DIR PARQUET_LIB_DIR
  PARQUET_SO_VERSION)
Call Stack (most recent call first):
  /usr/share/cmake-3.13/Modules/FindPackageHandleStandardArgs.cmake:378
(_FPHSA_FAILURE_MESSAGE)
  cmake_modules/FindParquet.cmake:115 (find_package_handle_standard_args)
  CMakeLists.txt:447 (find_package)


-- Configuring incomplete, errors occurred!
See also
"~/build_arrow/arrow/python/build/temp.linux-x86_64-3.7/CMakeFiles/CMakeOutput.log".
error: command '/usr/bin/cmake' failed with exit code 1

On Tue, Jan 25, 2022 at 12:27 AM Weston Pace <we...@gmail.com> wrote:

> Your problem is probably old hardware, specifically an older CPU.  Pip
> builds rely on popcnt (which I think is SSE4.1?)
>
> I'm pretty sure you are right that you can compile from source and be ok.
> It's a performance / portability tradeoff that has to be made when
> packaging prebuilt binaries.
>
> On Mon, Jan 24, 2022, 6:18 PM Chris Nyland <me...@gmail.com> wrote:
>
>> Hello,
>>
>> I was just taking a look at pyarrow in my off hours. I was trying to
>> write a partitioned data set based on the birthdays example in the pyarrow
>> cook book. However when I run the script I get no data written and a
>> "Illegal Instruction" message prints to screen, no exception is raised. I
>> installed the pyarrow manylinux x86_64 version 6.0.1 wheel via pip for
>> Python 3.7 using a virtual environment. I suspect that if I build pyarrow
>> myself it would work, it doesn't look too terribly difficult, but it is
>> still kind of a drag since I was looking to make some quick progress on an
>> off hours project.
>>
>> If anyone has any ideas on what else it would be I would like to try it
>> before building the library myself. Also is this a pretty typical issue to
>> run into? At work I primarily do Python on Windows and really haven't had
>> any build issues there since the Python 2.7 days.
>>
>> Thanks
>>
>> Chris
>>
>

Re: pyarrow write_dataset Illegal Instruction

Posted by Weston Pace <we...@gmail.com>.
Your problem is probably old hardware, specifically an older CPU.  Pip
builds rely on popcnt (which I think is SSE4.1?)

I'm pretty sure you are right that you can compile from source and be ok.
It's a performance / portability tradeoff that has to be made when
packaging prebuilt binaries.

On Mon, Jan 24, 2022, 6:18 PM Chris Nyland <me...@gmail.com> wrote:

> Hello,
>
> I was just taking a look at pyarrow in my off hours. I was trying to write
> a partitioned data set based on the birthdays example in the pyarrow cook
> book. However when I run the script I get no data written and a "Illegal
> Instruction" message prints to screen, no exception is raised. I installed
> the pyarrow manylinux x86_64 version 6.0.1 wheel via pip for Python 3.7
> using a virtual environment. I suspect that if I build pyarrow myself it
> would work, it doesn't look too terribly difficult, but it is still kind of
> a drag since I was looking to make some quick progress on an off hours
> project.
>
> If anyone has any ideas on what else it would be I would like to try it
> before building the library myself. Also is this a pretty typical issue to
> run into? At work I primarily do Python on Windows and really haven't had
> any build issues there since the Python 2.7 days.
>
> Thanks
>
> Chris
>