You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "LinGeLin (Jira)" <ji...@apache.org> on 2022/09/15 14:26:00 UTC

[jira] [Created] (ARROW-17740) [c++][compute]Is there any other way to use Join besides Acero？

LinGeLin created ARROW-17740:
--------------------------------

             Summary: [c++][compute]Is there any other way to use Join besides Acero？
                 Key: ARROW-17740
                 URL: https://issues.apache.org/jira/browse/ARROW-17740
             Project: Apache Arrow
          Issue Type: Improvement
            Reporter: LinGeLin
         Attachments: data.zip, test.cpp, v4test.py

Acero performs poorly, and coredump occurs frequently！
 
In the scenario I'm working on, I'll read one Parquet file and then several other Parquet files. These files will have the same column name (UUID). I need to join (by UUID), project (remove UUID), and filter (some custom filtering) the results of the two reads. I found that Acero could only be used to do join, but when I tested it, Acero performance was very poor and very unstable, coredump often happened. Is there another way? Or just another way to do a join!
 
my project commit: [链接|https://github.com/tensorflow/io/commit/57f373b352ea0181d65e12ac834ed9b2a3b31ef5a] 

tensorflow ==2.6.2
you can build tfio as follows:
./configure.sh
bazel build -s  --verbose_failures $BAZEL_OPTIMIZATION //tensorflow_io/... //tensorflow_io_gcs_filesystem/... --compilation_mode=opt --copt=-msse4.2 --copt=-mfma --copt=-mavx2 
python setup.py bdist_wheel --data bazel-bin
pip install dist/tensorflow_io-0.21.0-cp38-cp38-linux_x86_64.whl --force-reinstall --no-deps
 
run v4test.py to test the dataset
 
Data.zip contains several parquet files, which are stored on S3 in my scenario.

I have copied some of the code into test.cpp and can only see the general flow, not compiled

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)