You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Aldrin Montana (Jira)" <ji...@apache.org> on 2022/09/16 13:31:00 UTC

[jira] [Commented] (ARROW-17740) [c++][compute]Is there any other way to use Join besides Acero?

    [ https://issues.apache.org/jira/browse/ARROW-17740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17605827#comment-17605827 ] 

Aldrin Montana commented on ARROW-17740:
----------------------------------------

is test.cpp just a snippet?

 
{code:cpp}
      ...
      // auto arrow_status = tr.ReadNext(&batch);
      auto record_batchs_result = tr.ToRecordBatches();
      if (!record_batchs_result.ok()) {
        res = arrow::Status::Invalid("response table ToRecordBatches failed: " +
                                                   record_batchs_result.status().ToString());
        break;
      }
      if (background) {
        next_record_batches_ = record_batchs_result.ValueOrDie();

      } else {
        record_batches_ = record_batchs_result.ValueOrDie();
      }
    } while (0);
    ...
  }
{code}

There is 1 instance of `record_batches_` and there's a member variable `record_batchs_`. It's probably an unrelated error, but still seems supsicious.

I also cannot view the project commit. not sure if it's a private fork?

> [c++][compute]Is there any other way to use Join besides Acero?
> ---------------------------------------------------------------
>
>                 Key: ARROW-17740
>                 URL: https://issues.apache.org/jira/browse/ARROW-17740
>             Project: Apache Arrow
>          Issue Type: Improvement
>            Reporter: LinGeLin
>            Priority: Major
>         Attachments: data.zip, test.cpp, v4test.py
>
>
> Acero performs poorly, and coredump occurs frequently!
>  
> In the scenario I'm working on, I'll read one Parquet file and then several other Parquet files. These files will have the same column name (UUID). I need to join (by UUID), project (remove UUID), and filter (some custom filtering) the results of the two reads. I found that Acero could only be used to do join, but when I tested it, Acero performance was very poor and very unstable, coredump often happened. Is there another way? Or just another way to do a join!
>  
> my project commit: [链接|https://github.com/tensorflow/io/commit/57f373b352ea0181d65e12ac834ed9b2a3b31ef5a] 
> tensorflow ==2.6.2
> you can build tfio as follows:
> ./configure.sh
> bazel build -s  --verbose_failures $BAZEL_OPTIMIZATION //tensorflow_io/... //tensorflow_io_gcs_filesystem/... --compilation_mode=opt --copt=-msse4.2 --copt=-mfma --copt=-mavx2 
> python setup.py bdist_wheel --data bazel-bin
> pip install dist/tensorflow_io-0.21.0-cp38-cp38-linux_x86_64.whl --force-reinstall --no-deps
>  
> run v4test.py to test the dataset
>  
> Data.zip contains several parquet files, which are stored on S3 in my scenario.
> I have copied some of the code into test.cpp and can only see the general flow, not compiled
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)