You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2022/10/05 20:00:00 UTC

[jira] [Commented] (ARROW-17740) [c++][compute]Is there any other way to use Join besides Acero?

    [ https://issues.apache.org/jira/browse/ARROW-17740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17613157#comment-17613157 ] 

Weston Pace commented on ARROW-17740:
-------------------------------------

{quote}
Looks like there's something even more deadly. The following error is reported even if the column used for the join and ultimately the column to be selected is not of type List.
{quote}

Can you clarify?  Is the column part of the join payload or not?  If it is not part of the payload at all then that is a new issue.  If it is part of the payload then I think that is ARROW-17216

{quote}
I'm giving up on Acero。I have also tested DuckDB and DuckDB also performs poorly when reading many columns. Any other suggestions? Otherwise, I'm gonna have to tank.
{quote}

I am not aware of anyone actively working on performance for the many-columns case.

> [c++][compute]Is there any other way to use Join besides Acero?
> ---------------------------------------------------------------
>
>                 Key: ARROW-17740
>                 URL: https://issues.apache.org/jira/browse/ARROW-17740
>             Project: Apache Arrow
>          Issue Type: Improvement
>            Reporter: LinGeLin
>            Priority: Major
>         Attachments: data.zip, image-2022-09-30-14-32-48-405.png, join_test.zip, test.cpp, test_join.cpp, test_join1.cpp, v4test.py
>
>
> Acero performs poorly, and coredump occurs frequently!
>  
> In the scenario I'm working on, I'll read one Parquet file and then several other Parquet files. These files will have the same column name (UUID). I need to join (by UUID), project (remove UUID), and filter (some custom filtering) the results of the two reads. I found that Acero could only be used to do join, but when I tested it, Acero performance was very poor and very unstable, coredump often happened. Is there another way? Or just another way to do a join!
>  
> my project commit: [链接|https://github.com/LinGeLin/io/commit/9b1b06d8d74154f0768bf5258cc3eaa2b9e20701]
> tensorflow ==2.6.2
> you can build tfio as follows:
> ./configure.sh
> bazel build -s-  -verbose_failures $BAZEL_OPTIMIZATION //tensorflow_io/... //tensorflow_io_gcs_filesystem/... --compilation_mode=opt --copt=-msse4.2 --copt=-mfma --copt=-mavx2 
> python setup.py bdist_wheel --data bazel-bin
> pip install dist/tensorflow_io-0.21.0-cp38-cp38-linux_x86_64.whl --force-reinstall --no-deps
>  
> run v4test.py to test the dataset
>  
> Data.zip contains several parquet files, which are stored on S3 in my scenario.
> I have copied some of the code into test.cpp and can only see the general flow, not compiled
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)