You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Aldrin Montana (Jira)" <ji...@apache.org> on 2022/09/16 13:31:00 UTC
[jira] [Commented] (ARROW-17740) [c++][compute]Is there any other way to use Join besides Acero?
[ https://issues.apache.org/jira/browse/ARROW-17740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17605827#comment-17605827 ]
Aldrin Montana commented on ARROW-17740:
----------------------------------------
is test.cpp just a snippet?
{code:cpp}
...
// auto arrow_status = tr.ReadNext(&batch);
auto record_batchs_result = tr.ToRecordBatches();
if (!record_batchs_result.ok()) {
res = arrow::Status::Invalid("response table ToRecordBatches failed: " +
record_batchs_result.status().ToString());
break;
}
if (background) {
next_record_batches_ = record_batchs_result.ValueOrDie();
} else {
record_batches_ = record_batchs_result.ValueOrDie();
}
} while (0);
...
}
{code}
There is 1 instance of `record_batches_` and there's a member variable `record_batchs_`. It's probably an unrelated error, but still seems supsicious.
I also cannot view the project commit. not sure if it's a private fork?
> [c++][compute]Is there any other way to use Join besides Acero?
> ---------------------------------------------------------------
>
> Key: ARROW-17740
> URL: https://issues.apache.org/jira/browse/ARROW-17740
> Project: Apache Arrow
> Issue Type: Improvement
> Reporter: LinGeLin
> Priority: Major
> Attachments: data.zip, test.cpp, v4test.py
>
>
> Acero performs poorly, and coredump occurs frequently!
>
> In the scenario I'm working on, I'll read one Parquet file and then several other Parquet files. These files will have the same column name (UUID). I need to join (by UUID), project (remove UUID), and filter (some custom filtering) the results of the two reads. I found that Acero could only be used to do join, but when I tested it, Acero performance was very poor and very unstable, coredump often happened. Is there another way? Or just another way to do a join!
>
> my project commit: [链接|https://github.com/tensorflow/io/commit/57f373b352ea0181d65e12ac834ed9b2a3b31ef5a]
> tensorflow ==2.6.2
> you can build tfio as follows:
> ./configure.sh
> bazel build -s --verbose_failures $BAZEL_OPTIMIZATION //tensorflow_io/... //tensorflow_io_gcs_filesystem/... --compilation_mode=opt --copt=-msse4.2 --copt=-mfma --copt=-mavx2
> python setup.py bdist_wheel --data bazel-bin
> pip install dist/tensorflow_io-0.21.0-cp38-cp38-linux_x86_64.whl --force-reinstall --no-deps
>
> run v4test.py to test the dataset
>
> Data.zip contains several parquet files, which are stored on S3 in my scenario.
> I have copied some of the code into test.cpp and can only see the general flow, not compiled
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)