You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by 1057445597 <10...@qq.com> on 2022/07/26 03:49:06 UTC

how to filter a table to select rows

I use the follows code to filter table, but always core dump at scanner_builder-&gt;Filter(filter_expression_). Is there a better way to filter a table? or a Recordbatch?


by the way dataset::ScannerBuilder always core dump when I used it in tfio to create a tensorflow dataset,&nbsp;It's most likely buggy




        // Read file columns and build a table
        std::shared_ptr<::arrow::Table&gt; table;
        CHECK_ARROW(reader-&gt;ReadTable(column_indices_, &amp;table));
        // Convert the table to a sequence of batches
        auto tr = std::make_shared<arrow::TableBatchReader&gt;(*table.get());

        // filter
        auto scanner_builder = arrow::dataset::ScannerBuilder::FromRecordBatchReader(tr);
        if (!dataset()-&gt;filter_.empty()) {
          std::cout << filter_expression_.ToString() << std::endl;
          scanner_builder-&gt;Filter(filter_expression_);
        }






1057445597
1057445597@qq.com



&nbsp;

Re: how to filter a table to select rows

Posted by Sasha Krassovsky <kr...@gmail.com>.
Hi 1057445597,
Could you provide more information about your core dump? What backtrace does it give? I notice you’re not checking the Status returned by scanner_builder->Filter. That could be a place to start.

Sasha Krassovsky


> On Jul 26, 2022, at 8:27 AM, Aldrin <ak...@ucsc.edu> wrote:
> 
> You can create an InMemoryDataset from a RecordBatch. See [1] for docs and [2] for example code. You may be able to find something similar for filtering tables.
> 
> [1]: https://arrow.apache.org/docs/cpp/api/dataset.html#_CPPv4N5arrow7dataset15InMemoryDataset15InMemoryDatasetENSt10shared_ptrI6SchemaEE17RecordBatchVector <https://arrow.apache.org/docs/cpp/api/dataset.html#_CPPv4N5arrow7dataset15InMemoryDataset15InMemoryDatasetENSt10shared_ptrI6SchemaEE17RecordBatchVector>
> [2]: https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/mainline/src/cpp/processing/operators.cpp#L50 <https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/mainline/src/cpp/processing/operators.cpp#L50>
> 
> Aldrin Montana
> Computer Science PhD Student
> UC Santa Cruz
> 
> 
> On Mon, Jul 25, 2022 at 8:49 PM 1057445597 <1057445597@qq.com <ma...@qq.com>> wrote:
> I use the follows code to filter table, but always core dump at scanner_builder->Filter(filter_expression_). Is there a better way to filter a table? or a Recordbatch?
> 
> by the way dataset::ScannerBuilder always core dump when I used it in tfio to create a tensorflow dataset, It's most likely buggy
> 
> 
> // Read file columns and build a table
> std::shared_ptr<::arrow::Table> table;
> CHECK_ARROW(reader->ReadTable(column_indices_, &table));
> // Convert the table to a sequence of batches
> auto tr = std::make_shared<arrow::TableBatchReader>(*table.get());
> 
> // filter
> auto scanner_builder = arrow::dataset::ScannerBuilder::FromRecordBatchReader(tr);
> if (!dataset()->filter_.empty()) {
> std::cout << filter_expression_.ToString() << std::endl;
> scanner_builder->Filter(filter_expression_);
> }
> 
> 	
> 1057445597
> 1057445597@qq.com
>  <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&nocheck=true&name=1057445597&icon=http%3A%2F%2Fthirdqq.qlogo.cn%2Fg%3Fb%3Dsdk%26k%3DIlyZtc5eQb1ZfPd0rzpQlQ%26s%3D100%26t%3D1551800738%3Frand%3D1648208978&mail=1057445597%40qq.com&code=>
>  


Re: how to filter a table to select rows

Posted by Aldrin <ak...@ucsc.edu>.
You can create an InMemoryDataset from a RecordBatch. See [1] for docs and
[2] for example code. You may be able to find something similar for
filtering tables.

[1]:
https://arrow.apache.org/docs/cpp/api/dataset.html#_CPPv4N5arrow7dataset15InMemoryDataset15InMemoryDatasetENSt10shared_ptrI6SchemaEE17RecordBatchVector
[2]:
https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/mainline/src/cpp/processing/operators.cpp#L50

Aldrin Montana
Computer Science PhD Student
UC Santa Cruz


On Mon, Jul 25, 2022 at 8:49 PM 1057445597 <10...@qq.com> wrote:

> I use the follows code to filter table, but always core dump at
> scanner_builder->Filter(filter_expression_). Is there a better way to
> filter a table? or a Recordbatch?
>
> by the way dataset::ScannerBuilder always core dump when I used it in tfio
> to create a tensorflow dataset, It's most likely buggy
>
>
> // Read file columns and build a table
> std::shared_ptr<::arrow::Table> table;
> CHECK_ARROW(reader->ReadTable(column_indices_, &table));
> // Convert the table to a sequence of batches
> auto tr = std::make_shared<arrow::TableBatchReader>(*table.get());
>
> // filter
> auto scanner_builder = arrow::dataset::ScannerBuilder::
> FromRecordBatchReader(tr);
> if (!dataset()->filter_.empty()) {
> std::cout << filter_expression_.ToString() << std::endl;
> scanner_builder->Filter(filter_expression_);
> }
>
> ------------------------------
> 1057445597
> 1057445597@qq.com
>
> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&nocheck=true&name=1057445597&icon=http%3A%2F%2Fthirdqq.qlogo.cn%2Fg%3Fb%3Dsdk%26k%3DIlyZtc5eQb1ZfPd0rzpQlQ%26s%3D100%26t%3D1551800738%3Frand%3D1648208978&mail=1057445597%40qq.com&code=>
>
>