You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by 1057445597 <10...@qq.com> on 2022/09/15 03:13:51 UTC

[c++][compute]Is there any other way to use Join besides Acero?

Acero performs poorly, and coredump occurs frequently!


In the scenario I'm working on, I'll read one Parquet file and then several other Parquet files. These files will have the same column name (UUID). I need to join (by UUID), project (remove UUID), and filter (some custom filtering) the results of the two reads. I found that Acero could only be used to do join, but when I tested it, Acero performance was very poor and very unstable, coredump often happened. Is there another way? Or just another way to do a join!







1057445597
1057445597@qq.com



&nbsp;

回复: [c++][compute]Is there any other way to use Join besides Acero?

Posted by 1057445597 <10...@qq.com>.
Thank you very much, your reply is very helpful. I have one more question to ask. Since our data is actually stored in S3, I would like to ask if we can set project during SCAN. My understanding is that we will only get the columns we need from S3 instead of scanning the entire file. This will greatly reduce the network bandwidth usage. Or did I misunderstand that even if I do project after SCAN, it will also only read the required columns?




1057445597
1057445597@qq.com



&nbsp;




------------------&nbsp;原始邮件&nbsp;------------------
发件人:                                                                                                                        "user"                                                                                    <weston.pace@gmail.com&gt;;
发送时间:&nbsp;2022年9月21日(星期三) 上午9:01
收件人:&nbsp;"user"<user@arrow.apache.org&gt;;

主题:&nbsp;Re: [c++][compute]Is there any other way to use Join besides Acero?



Thanks for the detailed reproducer.&nbsp; I've added a few notes on the JIRA that I hope will help.

On Tue, Sep 20, 2022, 5:10 AM 1057445597 <1057445597@qq.com&gt; wrote:

I re-uploaded a copy of the code that can be compiled and run in join_test.zip, including cmakelists.txt, the test data files and the Python code that generated the test files. There is also Python code to view the data files. You will need to compile Arrow 9.0 yourself.




1057445597
1057445597@qq.com



&nbsp;




------------------&nbsp;原始邮件&nbsp;------------------
发件人:                                                                                                                        "user"                                                                                    <1057445597@qq.com&gt;;
发送时间:&nbsp;2022年9月15日(星期四) 晚上10:27
收件人:&nbsp;"user"<user@arrow.apache.org&gt;;

主题:&nbsp;回复: [c++][compute]Is there any other way to use Join besides Acero?



this jira


https://issues.apache.org/jira/browse/ARROW-17740


1057445597
1057445597@qq.com



&nbsp;




------------------ 原始邮件 ------------------
发件人:                                                                                                                        "user"                                                                                    <weston.pace@gmail.com&gt;;
发送时间:&nbsp;2022年9月15日(星期四) 中午12:15
收件人:&nbsp;"user"<user@arrow.apache.org&gt;;

主题:&nbsp;Re: [c++][compute]Is there any other way to use Join besides Acero?



Within Arrow-C++ that is the only way I am aware of.&nbsp; You might be able to use DuckDb.&nbsp; It should be able to scan parquet files.

Is this the same program that you shared before?&nbsp; Were you able to figure out threading?&nbsp; Can you create a JIRA with some sample input files and a reproducible example?


On Wed, Sep 14, 2022 at 5:14 PM 1057445597 <1057445597@qq.com&gt; wrote:

Acero performs poorly, and coredump occurs frequently!


In the scenario I'm working on, I'll read one Parquet file and then several other Parquet files. These files will have the same column name (UUID). I need to join (by UUID), project (remove UUID), and filter (some custom filtering) the results of the two reads. I found that Acero could only be used to do join, but when I tested it, Acero performance was very poor and very unstable, coredump often happened. Is there another way? Or just another way to do a join!







1057445597
1057445597@qq.com



&nbsp;

Re: [c++][compute]Is there any other way to use Join besides Acero?

Posted by Weston Pace <we...@gmail.com>.
Thanks for the detailed reproducer.  I've added a few notes on the JIRA
that I hope will help.

On Tue, Sep 20, 2022, 5:10 AM 1057445597 <10...@qq.com> wrote:

> I re-uploaded a copy of the code that can be compiled and run in
> join_test.zip, including cmakelists.txt, the test data files and the Python
> code that generated the test files. There is also Python code to view the
> data files. You will need to compile Arrow 9.0 yourself.
>
> ------------------------------
> 1057445597
> 1057445597@qq.com
>
> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&nocheck=true&name=1057445597&icon=http%3A%2F%2Fthirdqq.qlogo.cn%2Fg%3Fb%3Dsdk%26k%3DIlyZtc5eQb1ZfPd0rzpQlQ%26s%3D100%26t%3D1551800738%3Frand%3D1648208978&mail=1057445597%40qq.com&code=>
>
>
>
> ------------------ 原始邮件 ------------------
> *发件人:* "user" <10...@qq.com>;
> *发送时间:* 2022年9月15日(星期四) 晚上10:27
> *收件人:* "user"<us...@arrow.apache.org>;
> *主题:* 回复: [c++][compute]Is there any other way to use Join besides Acero?
>
> this jira
>
> https://issues.apache.org/jira/browse/ARROW-17740
> ------------------------------
> 1057445597
> 1057445597@qq.com
>
> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&nocheck=true&name=1057445597&icon=http%3A%2F%2Fthirdqq.qlogo.cn%2Fg%3Fb%3Dsdk%26k%3DIlyZtc5eQb1ZfPd0rzpQlQ%26s%3D100%26t%3D1551800738%3Frand%3D1648208978&mail=1057445597%40qq.com&code=>
>
>
>
> ------------------ 原始邮件 ------------------
> *发件人:* "user" <we...@gmail.com>;
> *发送时间:* 2022年9月15日(星期四) 中午12:15
> *收件人:* "user"<us...@arrow.apache.org>;
> *主题:* Re: [c++][compute]Is there any other way to use Join besides Acero?
>
> Within Arrow-C++ that is the only way I am aware of.  You might be able to
> use DuckDb.  It should be able to scan parquet files.
>
> Is this the same program that you shared before?  Were you able to figure
> out threading?  Can you create a JIRA with some sample input files and a
> reproducible example?
>
> On Wed, Sep 14, 2022 at 5:14 PM 1057445597 <10...@qq.com> wrote:
>
>> Acero performs poorly, and coredump occurs frequently!
>>
>> In the scenario I'm working on, I'll read one Parquet file and then
>> several other Parquet files. These files will have the same column name
>> (UUID). I need to join (by UUID), project (remove UUID), and filter (some
>> custom filtering) the results of the two reads. I found that Acero could
>> only be used to do join, but when I tested it, Acero performance was very
>> poor and very unstable, coredump often happened. Is there another way? Or
>> just another way to do a join!
>>
>>
>> ------------------------------
>> 1057445597
>> 1057445597@qq.com
>>
>> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&nocheck=true&name=1057445597&icon=http%3A%2F%2Fthirdqq.qlogo.cn%2Fg%3Fb%3Dsdk%26k%3DIlyZtc5eQb1ZfPd0rzpQlQ%26s%3D100%26t%3D1551800738%3Frand%3D1648208978&mail=1057445597%40qq.com&code=>
>>
>>
>

回复: [c++][compute]Is there any other way to use Join besides Acero?

Posted by 1057445597 <10...@qq.com>.
I re-uploaded a copy of the code that can be compiled and run in join_test.zip, including cmakelists.txt, the test data files and the Python code that generated the test files. There is also Python code to view the data files. You will need to compile Arrow 9.0 yourself.




1057445597
1057445597@qq.com



&nbsp;




------------------&nbsp;原始邮件&nbsp;------------------
发件人:                                                                                                                        "user"                                                                                    <1057445597@qq.com&gt;;
发送时间:&nbsp;2022年9月15日(星期四) 晚上10:27
收件人:&nbsp;"user"<user@arrow.apache.org&gt;;

主题:&nbsp;回复: [c++][compute]Is there any other way to use Join besides Acero?



this jira


https://issues.apache.org/jira/browse/ARROW-17740


1057445597
1057445597@qq.com



&nbsp;




------------------ 原始邮件 ------------------
发件人:                                                                                                                        "user"                                                                                    <weston.pace@gmail.com&gt;;
发送时间:&nbsp;2022年9月15日(星期四) 中午12:15
收件人:&nbsp;"user"<user@arrow.apache.org&gt;;

主题:&nbsp;Re: [c++][compute]Is there any other way to use Join besides Acero?



Within Arrow-C++ that is the only way I am aware of.&nbsp; You might be able to use DuckDb.&nbsp; It should be able to scan parquet files.

Is this the same program that you shared before?&nbsp; Were you able to figure out threading?&nbsp; Can you create a JIRA with some sample input files and a reproducible example?


On Wed, Sep 14, 2022 at 5:14 PM 1057445597 <1057445597@qq.com&gt; wrote:

Acero performs poorly, and coredump occurs frequently!


In the scenario I'm working on, I'll read one Parquet file and then several other Parquet files. These files will have the same column name (UUID). I need to join (by UUID), project (remove UUID), and filter (some custom filtering) the results of the two reads. I found that Acero could only be used to do join, but when I tested it, Acero performance was very poor and very unstable, coredump often happened. Is there another way? Or just another way to do a join!







1057445597
1057445597@qq.com



&nbsp;

回复: [c++][compute]Is there any other way to use Join besides Acero?

Posted by 1057445597 <10...@qq.com>.
this jira


https://issues.apache.org/jira/browse/ARROW-17740


1057445597
1057445597@qq.com



&nbsp;




------------------&nbsp;原始邮件&nbsp;------------------
发件人:                                                                                                                        "user"                                                                                    <weston.pace@gmail.com&gt;;
发送时间:&nbsp;2022年9月15日(星期四) 中午12:15
收件人:&nbsp;"user"<user@arrow.apache.org&gt;;

主题:&nbsp;Re: [c++][compute]Is there any other way to use Join besides Acero?



Within Arrow-C++ that is the only way I am aware of.&nbsp; You might be able to use DuckDb.&nbsp; It should be able to scan parquet files.

Is this the same program that you shared before?&nbsp; Were you able to figure out threading?&nbsp; Can you create a JIRA with some sample input files and a reproducible example?


On Wed, Sep 14, 2022 at 5:14 PM 1057445597 <1057445597@qq.com&gt; wrote:

Acero performs poorly, and coredump occurs frequently!


In the scenario I'm working on, I'll read one Parquet file and then several other Parquet files. These files will have the same column name (UUID). I need to join (by UUID), project (remove UUID), and filter (some custom filtering) the results of the two reads. I found that Acero could only be used to do join, but when I tested it, Acero performance was very poor and very unstable, coredump often happened. Is there another way? Or just another way to do a join!







1057445597
1057445597@qq.com



&nbsp;

Re: [c++][compute]Is there any other way to use Join besides Acero?

Posted by Niranda Perera <ni...@gmail.com>.
Hi,
You can give pycylon a try [1]. It has a similar API endpoint in
pycylon.dataframe interface [2].

Best

[1] https://github.com/cylondata/cylon
[2]
https://github.com/cylondata/cylon/blob/main/python/pycylon/examples/dataframe/join.py


On Thu, Sep 15, 2022 at 10:04 AM 1057445597 <10...@qq.com> wrote:

> Is there a same interface in c++?
>
> ------------------------------
> 1057445597
> 1057445597@qq.com
>
> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&nocheck=true&name=1057445597&icon=http%3A%2F%2Fthirdqq.qlogo.cn%2Fg%3Fb%3Dsdk%26k%3DIlyZtc5eQb1ZfPd0rzpQlQ%26s%3D100%26t%3D1551800738%3Frand%3D1648208978&mail=1057445597%40qq.com&code=>
>
>
>
> ------------------ 原始邮件 ------------------
> *发件人:* "user" <ja...@gmail.com>;
> *发送时间:* 2022年9月15日(星期四) 晚上9:47
> *收件人:* "user"<us...@arrow.apache.org>;
> *主题:* Re: [c++][compute]Is there any other way to use Join besides Acero?
>
> Hi!
>
> Why don't you use arrow Table join directly ?
>
>
> https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.join
>
> Though you need to be careful with join order as speed may be differ
> depending on order of the joined tables.
>
> BR,
>
> Jacek
>
>
> czw., 15 wrz 2022 o 06:15 Weston Pace <we...@gmail.com> napisał(a):
>
>> Within Arrow-C++ that is the only way I am aware of.  You might be able
>> to use DuckDb.  It should be able to scan parquet files.
>>
>> Is this the same program that you shared before?  Were you able to figure
>> out threading?  Can you create a JIRA with some sample input files and a
>> reproducible example?
>>
>> On Wed, Sep 14, 2022 at 5:14 PM 1057445597 <10...@qq.com> wrote:
>>
>>> Acero performs poorly, and coredump occurs frequently!
>>>
>>> In the scenario I'm working on, I'll read one Parquet file and then
>>> several other Parquet files. These files will have the same column name
>>> (UUID). I need to join (by UUID), project (remove UUID), and filter (some
>>> custom filtering) the results of the two reads. I found that Acero could
>>> only be used to do join, but when I tested it, Acero performance was very
>>> poor and very unstable, coredump often happened. Is there another way? Or
>>> just another way to do a join!
>>>
>>>
>>> ------------------------------
>>> 1057445597
>>> 1057445597@qq.com
>>>
>>> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&nocheck=true&name=1057445597&icon=http%3A%2F%2Fthirdqq.qlogo.cn%2Fg%3Fb%3Dsdk%26k%3DIlyZtc5eQb1ZfPd0rzpQlQ%26s%3D100%26t%3D1551800738%3Frand%3D1648208978&mail=1057445597%40qq.com&code=>
>>>
>>>
>>

-- 
Niranda Perera
https://niranda.dev/
@n1r44 <https://twitter.com/N1R44>

回复: [c++][compute]Is there any other way to use Join besides Acero?

Posted by 1057445597 <10...@qq.com>.
Is there a same interface in c++?




1057445597
1057445597@qq.com



&nbsp;




------------------&nbsp;原始邮件&nbsp;------------------
发件人:                                                                                                                        "user"                                                                                    <jacek.pliszka@gmail.com&gt;;
发送时间:&nbsp;2022年9月15日(星期四) 晚上9:47
收件人:&nbsp;"user"<user@arrow.apache.org&gt;;

主题:&nbsp;Re: [c++][compute]Is there any other way to use Join besides Acero?



Hi!


Why don't you use arrow Table join directly ?


https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.join


Though you need to be careful with join order as speed may be differ depending on order of the joined tables.


BR,


Jacek





czw., 15 wrz 2022 o 06:15&nbsp;Weston Pace <weston.pace@gmail.com&gt; napisał(a):

Within Arrow-C++ that is the only way I am aware of.&nbsp; You might be able to use DuckDb.&nbsp; It should be able to scan parquet files.

Is this the same program that you shared before?&nbsp; Were you able to figure out threading?&nbsp; Can you create a JIRA with some sample input files and a reproducible example?


On Wed, Sep 14, 2022 at 5:14 PM 1057445597 <1057445597@qq.com&gt; wrote:

Acero performs poorly, and coredump occurs frequently!


In the scenario I'm working on, I'll read one Parquet file and then several other Parquet files. These files will have the same column name (UUID). I need to join (by UUID), project (remove UUID), and filter (some custom filtering) the results of the two reads. I found that Acero could only be used to do join, but when I tested it, Acero performance was very poor and very unstable, coredump often happened. Is there another way? Or just another way to do a join!







1057445597
1057445597@qq.com



&nbsp;

Re: [c++][compute]Is there any other way to use Join besides Acero?

Posted by Jacek Pliszka <ja...@gmail.com>.
Hi!

Why don't you use arrow Table join directly ?

https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.join

Though you need to be careful with join order as speed may be differ
depending on order of the joined tables.

BR,

Jacek


czw., 15 wrz 2022 o 06:15 Weston Pace <we...@gmail.com> napisał(a):

> Within Arrow-C++ that is the only way I am aware of.  You might be able to
> use DuckDb.  It should be able to scan parquet files.
>
> Is this the same program that you shared before?  Were you able to figure
> out threading?  Can you create a JIRA with some sample input files and a
> reproducible example?
>
> On Wed, Sep 14, 2022 at 5:14 PM 1057445597 <10...@qq.com> wrote:
>
>> Acero performs poorly, and coredump occurs frequently!
>>
>> In the scenario I'm working on, I'll read one Parquet file and then
>> several other Parquet files. These files will have the same column name
>> (UUID). I need to join (by UUID), project (remove UUID), and filter (some
>> custom filtering) the results of the two reads. I found that Acero could
>> only be used to do join, but when I tested it, Acero performance was very
>> poor and very unstable, coredump often happened. Is there another way? Or
>> just another way to do a join!
>>
>>
>> ------------------------------
>> 1057445597
>> 1057445597@qq.com
>>
>> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&nocheck=true&name=1057445597&icon=http%3A%2F%2Fthirdqq.qlogo.cn%2Fg%3Fb%3Dsdk%26k%3DIlyZtc5eQb1ZfPd0rzpQlQ%26s%3D100%26t%3D1551800738%3Frand%3D1648208978&mail=1057445597%40qq.com&code=>
>>
>>
>

Re: [c++][compute]Is there any other way to use Join besides Acero?

Posted by Weston Pace <we...@gmail.com>.
Within Arrow-C++ that is the only way I am aware of.  You might be able to
use DuckDb.  It should be able to scan parquet files.

Is this the same program that you shared before?  Were you able to figure
out threading?  Can you create a JIRA with some sample input files and a
reproducible example?

On Wed, Sep 14, 2022 at 5:14 PM 1057445597 <10...@qq.com> wrote:

> Acero performs poorly, and coredump occurs frequently!
>
> In the scenario I'm working on, I'll read one Parquet file and then
> several other Parquet files. These files will have the same column name
> (UUID). I need to join (by UUID), project (remove UUID), and filter (some
> custom filtering) the results of the two reads. I found that Acero could
> only be used to do join, but when I tested it, Acero performance was very
> poor and very unstable, coredump often happened. Is there another way? Or
> just another way to do a join!
>
>
> ------------------------------
> 1057445597
> 1057445597@qq.com
>
> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&nocheck=true&name=1057445597&icon=http%3A%2F%2Fthirdqq.qlogo.cn%2Fg%3Fb%3Dsdk%26k%3DIlyZtc5eQb1ZfPd0rzpQlQ%26s%3D100%26t%3D1551800738%3Frand%3D1648208978&mail=1057445597%40qq.com&code=>
>
>