You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/14 01:55:06 UTC
[GitHub] [arrow-datafusion] houqp opened a new pull request #1556: Arrow2 merge
houqp opened a new pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556
# Which issue does this PR close?
Close arrow2 milestone https://github.com/apache/arrow-datafusion/milestone/3
# Rationale for this change
Provide a complete arrow2 based datafusion implementation for full evaluation of the migration. This should give us a good feeling of the arrow2 API UX as well as a starting point for performance benchmarks within datafusion and downstream projects.
The goal is to merge the code into an official arrow2 branch in the short run, until we are comfortable doing the switch in master.
# What changes are included in this PR?
* Switched to arrow2
* Enabled miri test
Here is a TPCH benchmark I ran on my Linux laptop:
![Screenshot_20220113_174918](https://user-images.githubusercontent.com/670302/149437604-27fca7a1-55e4-48fc-a02a-bc9ee7d2ed74.png)
On avg, we are getting around 5% speed up across the board, with q5 at 11% gain and q12 at only 1%. If this performance gain can also be replicated in downstream projects, then I think it would be a strong case for us to do the arrow2 swtich.
# Are there any user-facing changes?
Yes, downstream consumer of datafusion will need to switch to arrow2 as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp commented on a change in pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
houqp commented on a change in pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#discussion_r785402462
##########
File path: datafusion/Cargo.toml
##########
@@ -39,25 +39,27 @@ path = "src/lib.rs"
[features]
default = ["crypto_expressions", "regex_expressions", "unicode_expressions"]
-simd = ["arrow/simd"]
+# FIXME: https://github.com/jorgecarleitao/arrow2/issues/580
+#simd = ["arrow/simd"]
+simd = []
crypto_expressions = ["md-5", "sha2", "blake2", "blake3"]
regex_expressions = ["regex"]
unicode_expressions = ["unicode-segmentation"]
-pyarrow = ["pyo3", "arrow/pyarrow"]
+# FIXME: add pyarrow support to arrow2 pyarrow = ["pyo3", "arrow/pyarrow"]
+pyarrow = ["pyo3"]
# Used for testing ONLY: causes all values to hash to the same value (test for collisions)
force_hash_collisions = []
# Used to enable the avro format
-avro = ["avro-rs", "num-traits"]
+avro = ["arrow/io_avro", "arrow/io_avro_async", "arrow/io_avro_compression", "num-traits", "avro-schema"]
[dependencies]
ahash = { version = "0.7", default-features = false }
hashbrown = { version = "0.11", features = ["raw"] }
-arrow = { version = "6.4.0", features = ["prettyprint"] }
-parquet = { version = "6.4.0", features = ["arrow"] }
+parquet = { package = "parquet2", version = "0.8", default_features = false, features = ["stream"] }
sqlparser = "0.13"
paste = "^1.0"
num_cpus = "1.13.0"
-chrono = { version = "0.4", default-features = false }
+chrono = { version = "0.4", default-features = false, features = ["clock"] }
Review comment:
It looks like datafusion is only depending on this to get the current system time in various places. I will file https://github.com/apache/arrow-datafusion/issues/1584 as a follow up
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp edited a comment on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
houqp edited a comment on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1017175870
All integration and unit tests are passing now, the MIRI check is failing due to an upstream tokio issue I believe. I will file some follow up issues tomorrow to track the remaining work needed for us to make the final call on master merge.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp commented on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1017175870
All tests are passing now, the MIRI check is failing due to an upstream tokio issue I believe. I will file some follow up issues tomorrow to track the remaining work needed for us to make the final call on master merge.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp merged pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
houqp merged pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] xudong963 commented on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
xudong963 commented on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1017537586
Thanks @houqp , epic work !
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] ic4y commented on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
ic4y commented on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1012809108
I found that the peak memory usage of this branch increases by 80% compared to the master branch。
sql : select avg(user_id) from parquet_event_1 group by user_name limit 10
test dataset : total 450 million, 50 million users
| branch | peak memory |
| ---- | ---- |
| master(d7e465 and 35d65fc) | 10G |
| arrow2_merge | 6G |
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] andygrove edited a comment on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
andygrove edited a comment on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1013730047
Metrics and CPU activity charts for query 1 from master & arrow2.
## Master
```
=== Physical plan with metrics ===
SortExec: [l_returnflag@0 ASC NULLS LAST,l_linestatus@1 ASC NULLS LAST], metrics=[output_rows=4, elapsed_compute=148.532µs]
CoalescePartitionsExec, metrics=[output_rows=4, elapsed_compute=26.889µs]
ProjectionExec: expr=[l_returnflag@0 as l_returnflag, l_linestatus@1 as l_linestatus, SUM(lineitem.l_quantity)@2 as sum_qty, SUM(lineitem.l_extendedprice)@3 as sum_base_price, SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount)@4 as sum_disc_price, SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax)@5 as sum_charge, AVG(lineitem.l_quantity)@6 as avg_qty, AVG(lineitem.l_extendedprice)@7 as avg_price, AVG(lineitem.l_discount)@8 as avg_disc, COUNT(UInt8(1))@9 as count_order], metrics=[output_rows=4, elapsed_compute=162.216µs]
HashAggregateExec: mode=FinalPartitioned, gby=[l_returnflag@0 as l_returnflag, l_linestatus@1 as l_linestatus], aggr=[SUM(lineitem.l_quantity), SUM(lineitem.l_extendedprice), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount Multiply Int64(1) Plus lineitem.l_tax), AVG(lineitem.l_quantity), AVG(lineitem.l_extendedprice), AVG(lineitem.l_discount), COUNT(UInt8(1))], metrics=[output_rows=4, elapsed_compute=836.092µs]
CoalesceBatchesExec: target_batch_size=2048, metrics=[output_rows=96, elapsed_compute=2.467991ms]
RepartitionExec: partitioning=Hash([Column { name: "l_returnflag", index: 0 }, Column { name: "l_linestatus", index: 1 }], 24), metrics=[fetch_time{inputPartition=0}=121.646159259s, repart_time{inputPartition=0}=1.068665ms, send_time{inputPartition=0}=181.052µs]
HashAggregateExec: mode=Partial, gby=[l_returnflag@5 as l_returnflag, l_linestatus@6 as l_linestatus], aggr=[SUM(lineitem.l_quantity), SUM(lineitem.l_extendedprice), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount Multiply Int64(1) Plus lineitem.l_tax), AVG(lineitem.l_quantity), AVG(lineitem.l_extendedprice), AVG(lineitem.l_discount), COUNT(UInt8(1))], metrics=[output_rows=96, elapsed_compute=62.022452324s]
ProjectionExec: expr=[l_extendedprice@1 * CAST(1 AS Float64) - l_discount@2 as BinaryExpr-*BinaryExpr--Column-lineitem.l_discountLiteral1Column-lineitem.l_extendedprice, l_quantity@0 as l_quantity, l_extendedprice@1 as l_extendedprice, l_discount@2 as l_discount, l_tax@3 as l_tax, l_returnflag@4 as l_returnflag, l_linestatus@5 as l_linestatus], metrics=[output_rows=591599326, elapsed_compute=2.333647686s]
CoalesceBatchesExec: target_batch_size=2048, metrics=[output_rows=591599326, elapsed_compute=18.07630248s]
FilterExec: l_shipdate@6 <= 10471, metrics=[output_rows=591599326, elapsed_compute=14.438278954s]
ParquetExec: batch_size=4096, limit=None, partitions=[/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-5.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-10.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-11.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-13.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-6.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-14.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-20.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-2.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-22.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-19.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-0.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-16.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-21.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-23.parquet, /mnt/bigdata/tp
ch/sf100-24part-parquet/lineitem/part-4.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-17.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-9.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-1.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-18.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-8.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-12.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-15.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-7.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-3.parquet], metrics=[predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-9.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-22.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-23.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24p
art-parquet/lineitem/part-18.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-1.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-11.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-3.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-5.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-23.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-11.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-1.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-14.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-3.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/
lineitem/part-9.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-21.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-18.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-6.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-7.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-21.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-19.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-10.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-5.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-17.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/l
ineitem/part-15.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-12.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-8.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-16.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-12.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-6.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-16.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-10.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-4.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-20.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/pa
rt-0.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-7.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-2.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-8.parquet}=0, num_predicate_creation_errors=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-22.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-15.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-0.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-14.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-13.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-2.parquet}=0, predicate_evaluation_errors{filename=
/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-19.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-4.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-13.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-17.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-20.parquet}=0]
+--------------+--------------+------------+--------------------+--------------------+--------------------+--------------------+-------------------+----------------------+-------------+
| l_returnflag | l_linestatus | sum_qty | sum_base_price | sum_disc_price | sum_charge | avg_qty | avg_price | avg_disc | count_order |
+--------------+--------------+------------+--------------------+--------------------+--------------------+--------------------+-------------------+----------------------+-------------+
| A | F | 3775127569 | 5660775782243.051 | 5377736101986.644 | 5592847119226.674 | 25.49937018008533 | 38236.11640655464 | 0.05000224292310849 | 148047875 |
| N | F | 98553062 | 147771098385.98004 | 140384965965.03497 | 145999793032.7758 | 25.501556956882876 | 38237.19938880452 | 0.04998528433805398 | 3864590 |
| N | O | 7436302680 | 11150725250531.352 | 10593194904399.148 | 11016931834528.229 | 25.500009351223113 | 38237.22761127162 | 0.049997917972634566 | 291619606 |
| R | F | 3775724821 | 5661602800938.198 | 5378513347920.04 | 5593662028982.476 | 25.500066311082758 | 38236.6972423322 | 0.05000130406955949 | 148067255 |
+--------------+--------------+------------+--------------------+--------------------+--------------------+--------------------+-------------------+----------------------+-------------+
Query 1 iteration 0 took 8165.9 ms
Query 1 avg time: 8165.94 ms
```
![master_](https://user-images.githubusercontent.com/934084/149633988-d3f64166-2b03-4688-a187-da491f4360c1.png)
## Arrow2 PR
```
=== Physical plan with metrics ===
SortExec: [l_returnflag@0 ASC NULLS LAST,l_linestatus@1 ASC NULLS LAST], metrics=[output_rows=4, elapsed_compute=90.895µs]
CoalescePartitionsExec, metrics=[output_rows=4, elapsed_compute=16.752µs]
ProjectionExec: expr=[l_returnflag@0 as l_returnflag, l_linestatus@1 as l_linestatus, SUM(lineitem.l_quantity)@2 as sum_qty, SUM(lineitem.l_extendedprice)@3 as sum_base_price, SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount)@4 as sum_disc_price, SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax)@5 as sum_charge, AVG(lineitem.l_quantity)@6 as avg_qty, AVG(lineitem.l_extendedprice)@7 as avg_price, AVG(lineitem.l_discount)@8 as avg_disc, COUNT(UInt8(1))@9 as count_order], metrics=[output_rows=4, elapsed_compute=269.589µs]
HashAggregateExec: mode=FinalPartitioned, gby=[l_returnflag@0 as l_returnflag, l_linestatus@1 as l_linestatus], aggr=[SUM(lineitem.l_quantity), SUM(lineitem.l_extendedprice), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount Multiply Int64(1) Plus lineitem.l_tax), AVG(lineitem.l_quantity), AVG(lineitem.l_extendedprice), AVG(lineitem.l_discount), COUNT(UInt8(1))], metrics=[output_rows=4, elapsed_compute=966.04µs]
CoalesceBatchesExec: target_batch_size=2048, metrics=[output_rows=96, elapsed_compute=3.118624ms]
RepartitionExec: partitioning=Hash([Column { name: "l_returnflag", index: 0 }, Column { name: "l_linestatus", index: 1 }], 24), metrics=[send_time{inputPartition=0}=202.529µs, fetch_time{inputPartition=0}=167.702662304s, repart_time{inputPartition=0}=881.476µs]
HashAggregateExec: mode=Partial, gby=[l_returnflag@5 as l_returnflag, l_linestatus@6 as l_linestatus], aggr=[SUM(lineitem.l_quantity), SUM(lineitem.l_extendedprice), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount Multiply Int64(1) Plus lineitem.l_tax), AVG(lineitem.l_quantity), AVG(lineitem.l_extendedprice), AVG(lineitem.l_discount), COUNT(UInt8(1))], metrics=[output_rows=96, elapsed_compute=69.277706429s]
ProjectionExec: expr=[l_extendedprice@1 * CAST(1 AS Float64) - l_discount@2 as BinaryExpr-*BinaryExpr--Column-lineitem.l_discountLiteral1Column-lineitem.l_extendedprice, l_quantity@0 as l_quantity, l_extendedprice@1 as l_extendedprice, l_discount@2 as l_discount, l_tax@3 as l_tax, l_returnflag@4 as l_returnflag, l_linestatus@5 as l_linestatus], metrics=[output_rows=591599326, elapsed_compute=1.948225184s]
CoalesceBatchesExec: target_batch_size=2048, metrics=[output_rows=591599326, elapsed_compute=19.900026641s]
FilterExec: l_shipdate@6 <= 10471, metrics=[output_rows=591599326, elapsed_compute=17.25811968s]
ParquetExec: batch_size=4096, limit=None, partitions=[/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-5.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-10.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-11.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-13.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-6.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-14.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-20.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-2.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-22.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-19.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-0.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-16.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-21.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-23.parquet, /mnt/bigdata/tp
ch/sf100-24part-parquet/lineitem/part-4.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-17.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-9.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-1.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-18.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-8.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-12.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-15.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-7.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-3.parquet], metrics=[row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-23.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-21.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-2.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24p
art-parquet/lineitem/part-11.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-0.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-3.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-0.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-3.parquet}=0, num_predicate_creation_errors=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-12.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-7.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-4.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-2.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-23.parquet}=0, predicate_evaluation_errors{filenam
e=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-9.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-6.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-5.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-18.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-13.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-1.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-10.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-8.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-1.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-4.parquet}=0, predicate_evaluation_errors{filename=/m
nt/bigdata/tpch/sf100-24part-parquet/lineitem/part-5.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-22.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-17.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-9.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-21.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-7.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-12.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-8.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-18.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-16.parquet}=0, predicate_
evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-15.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-14.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-10.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-19.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-16.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-17.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-11.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-15.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-6.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-13.parquet}=0, predicate_evaluation_errors{
filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-20.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-20.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-14.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-19.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-22.parquet}=0]
+--------------+--------------+------------+--------------------+--------------------+--------------------+--------------------+-------------------+---------------------+-------------+
| l_returnflag | l_linestatus | sum_qty | sum_base_price | sum_disc_price | sum_charge | avg_qty | avg_price | avg_disc | count_order |
+--------------+--------------+------------+--------------------+--------------------+--------------------+--------------------+-------------------+---------------------+-------------+
| A | F | 3775127569 | 5660775782243.05 | 5377736101986.642 | 5592847119226.676 | 25.49937018008533 | 38236.11640655463 | 0.05000224292310846 | 148047875 |
| N | F | 98553062 | 147771098385.98004 | 140384965965.0348 | 145999793032.77588 | 25.501556956882876 | 38237.19938880452 | 0.04998528433805397 | 3864590 |
| N | O | 7436302680 | 11150725250531.35 | 10593194904399.143 | 11016931834528.23 | 25.500009351223113 | 38237.22761127161 | 0.04999791797263452 | 291619606 |
| R | F | 3775724821 | 5661602800938.199 | 5378513347920.042 | 5593662028982.476 | 25.500066311082758 | 38236.6972423322 | 0.05000130406955945 | 148067255 |
+--------------+--------------+------------+--------------------+--------------------+--------------------+--------------------+-------------------+---------------------+-------------+
Query 1 iteration 0 took 27109.6 ms
Query 1 avg time: 27109.65 ms
```
![arrow2](https://user-images.githubusercontent.com/934084/149633991-a1615ac0-a1ca-46f0-9b76-f0abb6917d2c.png)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp commented on a change in pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
houqp commented on a change in pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#discussion_r785407092
##########
File path: .github/workflows/rust.yml
##########
@@ -116,6 +116,7 @@ jobs:
cargo test --no-default-features
cargo run --example csv_sql
cargo run --example parquet_sql
+ #nopass
Review comment:
not entirely sure, @Igosuki do you remember why you added these #nopass lines?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp edited a comment on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
houqp edited a comment on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1015128595
update: all datafusion unit and integration tests are passing now, down to a single test failure in datafusion-cli related to json display format.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp commented on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1015128595
all datafusion unit and integration tests are passing now, down to a single test failure in datafusion-cli related to json display format.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] andygrove commented on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1013730047
Metrics for query 1 from master & arrow2.
## Master
```
=== Physical plan with metrics ===
SortExec: [l_returnflag@0 ASC NULLS LAST,l_linestatus@1 ASC NULLS LAST], metrics=[output_rows=4, elapsed_compute=148.532µs]
CoalescePartitionsExec, metrics=[output_rows=4, elapsed_compute=26.889µs]
ProjectionExec: expr=[l_returnflag@0 as l_returnflag, l_linestatus@1 as l_linestatus, SUM(lineitem.l_quantity)@2 as sum_qty, SUM(lineitem.l_extendedprice)@3 as sum_base_price, SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount)@4 as sum_disc_price, SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax)@5 as sum_charge, AVG(lineitem.l_quantity)@6 as avg_qty, AVG(lineitem.l_extendedprice)@7 as avg_price, AVG(lineitem.l_discount)@8 as avg_disc, COUNT(UInt8(1))@9 as count_order], metrics=[output_rows=4, elapsed_compute=162.216µs]
HashAggregateExec: mode=FinalPartitioned, gby=[l_returnflag@0 as l_returnflag, l_linestatus@1 as l_linestatus], aggr=[SUM(lineitem.l_quantity), SUM(lineitem.l_extendedprice), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount Multiply Int64(1) Plus lineitem.l_tax), AVG(lineitem.l_quantity), AVG(lineitem.l_extendedprice), AVG(lineitem.l_discount), COUNT(UInt8(1))], metrics=[output_rows=4, elapsed_compute=836.092µs]
CoalesceBatchesExec: target_batch_size=2048, metrics=[output_rows=96, elapsed_compute=2.467991ms]
RepartitionExec: partitioning=Hash([Column { name: "l_returnflag", index: 0 }, Column { name: "l_linestatus", index: 1 }], 24), metrics=[fetch_time{inputPartition=0}=121.646159259s, repart_time{inputPartition=0}=1.068665ms, send_time{inputPartition=0}=181.052µs]
HashAggregateExec: mode=Partial, gby=[l_returnflag@5 as l_returnflag, l_linestatus@6 as l_linestatus], aggr=[SUM(lineitem.l_quantity), SUM(lineitem.l_extendedprice), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount Multiply Int64(1) Plus lineitem.l_tax), AVG(lineitem.l_quantity), AVG(lineitem.l_extendedprice), AVG(lineitem.l_discount), COUNT(UInt8(1))], metrics=[output_rows=96, elapsed_compute=62.022452324s]
ProjectionExec: expr=[l_extendedprice@1 * CAST(1 AS Float64) - l_discount@2 as BinaryExpr-*BinaryExpr--Column-lineitem.l_discountLiteral1Column-lineitem.l_extendedprice, l_quantity@0 as l_quantity, l_extendedprice@1 as l_extendedprice, l_discount@2 as l_discount, l_tax@3 as l_tax, l_returnflag@4 as l_returnflag, l_linestatus@5 as l_linestatus], metrics=[output_rows=591599326, elapsed_compute=2.333647686s]
CoalesceBatchesExec: target_batch_size=2048, metrics=[output_rows=591599326, elapsed_compute=18.07630248s]
FilterExec: l_shipdate@6 <= 10471, metrics=[output_rows=591599326, elapsed_compute=14.438278954s]
ParquetExec: batch_size=4096, limit=None, partitions=[/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-5.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-10.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-11.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-13.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-6.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-14.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-20.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-2.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-22.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-19.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-0.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-16.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-21.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-23.parquet, /mnt/bigdata/tp
ch/sf100-24part-parquet/lineitem/part-4.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-17.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-9.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-1.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-18.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-8.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-12.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-15.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-7.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-3.parquet], metrics=[predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-9.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-22.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-23.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24p
art-parquet/lineitem/part-18.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-1.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-11.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-3.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-5.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-23.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-11.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-1.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-14.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-3.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/
lineitem/part-9.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-21.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-18.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-6.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-7.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-21.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-19.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-10.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-5.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-17.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/l
ineitem/part-15.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-12.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-8.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-16.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-12.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-6.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-16.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-10.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-4.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-20.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/pa
rt-0.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-7.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-2.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-8.parquet}=0, num_predicate_creation_errors=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-22.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-15.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-0.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-14.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-13.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-2.parquet}=0, predicate_evaluation_errors{filename=
/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-19.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-4.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-13.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-17.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-20.parquet}=0]
+--------------+--------------+------------+--------------------+--------------------+--------------------+--------------------+-------------------+----------------------+-------------+
| l_returnflag | l_linestatus | sum_qty | sum_base_price | sum_disc_price | sum_charge | avg_qty | avg_price | avg_disc | count_order |
+--------------+--------------+------------+--------------------+--------------------+--------------------+--------------------+-------------------+----------------------+-------------+
| A | F | 3775127569 | 5660775782243.051 | 5377736101986.644 | 5592847119226.674 | 25.49937018008533 | 38236.11640655464 | 0.05000224292310849 | 148047875 |
| N | F | 98553062 | 147771098385.98004 | 140384965965.03497 | 145999793032.7758 | 25.501556956882876 | 38237.19938880452 | 0.04998528433805398 | 3864590 |
| N | O | 7436302680 | 11150725250531.352 | 10593194904399.148 | 11016931834528.229 | 25.500009351223113 | 38237.22761127162 | 0.049997917972634566 | 291619606 |
| R | F | 3775724821 | 5661602800938.198 | 5378513347920.04 | 5593662028982.476 | 25.500066311082758 | 38236.6972423322 | 0.05000130406955949 | 148067255 |
+--------------+--------------+------------+--------------------+--------------------+--------------------+--------------------+-------------------+----------------------+-------------+
Query 1 iteration 0 took 8165.9 ms
Query 1 avg time: 8165.94 ms
```
## Arrow2 PR
```
=== Physical plan with metrics ===
SortExec: [l_returnflag@0 ASC NULLS LAST,l_linestatus@1 ASC NULLS LAST], metrics=[output_rows=4, elapsed_compute=90.895µs]
CoalescePartitionsExec, metrics=[output_rows=4, elapsed_compute=16.752µs]
ProjectionExec: expr=[l_returnflag@0 as l_returnflag, l_linestatus@1 as l_linestatus, SUM(lineitem.l_quantity)@2 as sum_qty, SUM(lineitem.l_extendedprice)@3 as sum_base_price, SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount)@4 as sum_disc_price, SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax)@5 as sum_charge, AVG(lineitem.l_quantity)@6 as avg_qty, AVG(lineitem.l_extendedprice)@7 as avg_price, AVG(lineitem.l_discount)@8 as avg_disc, COUNT(UInt8(1))@9 as count_order], metrics=[output_rows=4, elapsed_compute=269.589µs]
HashAggregateExec: mode=FinalPartitioned, gby=[l_returnflag@0 as l_returnflag, l_linestatus@1 as l_linestatus], aggr=[SUM(lineitem.l_quantity), SUM(lineitem.l_extendedprice), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount Multiply Int64(1) Plus lineitem.l_tax), AVG(lineitem.l_quantity), AVG(lineitem.l_extendedprice), AVG(lineitem.l_discount), COUNT(UInt8(1))], metrics=[output_rows=4, elapsed_compute=966.04µs]
CoalesceBatchesExec: target_batch_size=2048, metrics=[output_rows=96, elapsed_compute=3.118624ms]
RepartitionExec: partitioning=Hash([Column { name: "l_returnflag", index: 0 }, Column { name: "l_linestatus", index: 1 }], 24), metrics=[send_time{inputPartition=0}=202.529µs, fetch_time{inputPartition=0}=167.702662304s, repart_time{inputPartition=0}=881.476µs]
HashAggregateExec: mode=Partial, gby=[l_returnflag@5 as l_returnflag, l_linestatus@6 as l_linestatus], aggr=[SUM(lineitem.l_quantity), SUM(lineitem.l_extendedprice), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount Multiply Int64(1) Plus lineitem.l_tax), AVG(lineitem.l_quantity), AVG(lineitem.l_extendedprice), AVG(lineitem.l_discount), COUNT(UInt8(1))], metrics=[output_rows=96, elapsed_compute=69.277706429s]
ProjectionExec: expr=[l_extendedprice@1 * CAST(1 AS Float64) - l_discount@2 as BinaryExpr-*BinaryExpr--Column-lineitem.l_discountLiteral1Column-lineitem.l_extendedprice, l_quantity@0 as l_quantity, l_extendedprice@1 as l_extendedprice, l_discount@2 as l_discount, l_tax@3 as l_tax, l_returnflag@4 as l_returnflag, l_linestatus@5 as l_linestatus], metrics=[output_rows=591599326, elapsed_compute=1.948225184s]
CoalesceBatchesExec: target_batch_size=2048, metrics=[output_rows=591599326, elapsed_compute=19.900026641s]
FilterExec: l_shipdate@6 <= 10471, metrics=[output_rows=591599326, elapsed_compute=17.25811968s]
ParquetExec: batch_size=4096, limit=None, partitions=[/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-5.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-10.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-11.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-13.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-6.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-14.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-20.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-2.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-22.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-19.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-0.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-16.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-21.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-23.parquet, /mnt/bigdata/tp
ch/sf100-24part-parquet/lineitem/part-4.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-17.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-9.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-1.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-18.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-8.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-12.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-15.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-7.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-3.parquet], metrics=[row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-23.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-21.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-2.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24p
art-parquet/lineitem/part-11.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-0.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-3.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-0.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-3.parquet}=0, num_predicate_creation_errors=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-12.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-7.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-4.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-2.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-23.parquet}=0, predicate_evaluation_errors{filenam
e=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-9.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-6.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-5.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-18.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-13.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-1.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-10.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-8.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-1.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-4.parquet}=0, predicate_evaluation_errors{filename=/m
nt/bigdata/tpch/sf100-24part-parquet/lineitem/part-5.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-22.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-17.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-9.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-21.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-7.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-12.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-8.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-18.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-16.parquet}=0, predicate_
evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-15.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-14.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-10.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-19.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-16.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-17.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-11.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-15.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-6.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-13.parquet}=0, predicate_evaluation_errors{
filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-20.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-20.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-14.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-19.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-22.parquet}=0]
+--------------+--------------+------------+--------------------+--------------------+--------------------+--------------------+-------------------+---------------------+-------------+
| l_returnflag | l_linestatus | sum_qty | sum_base_price | sum_disc_price | sum_charge | avg_qty | avg_price | avg_disc | count_order |
+--------------+--------------+------------+--------------------+--------------------+--------------------+--------------------+-------------------+---------------------+-------------+
| A | F | 3775127569 | 5660775782243.05 | 5377736101986.642 | 5592847119226.676 | 25.49937018008533 | 38236.11640655463 | 0.05000224292310846 | 148047875 |
| N | F | 98553062 | 147771098385.98004 | 140384965965.0348 | 145999793032.77588 | 25.501556956882876 | 38237.19938880452 | 0.04998528433805397 | 3864590 |
| N | O | 7436302680 | 11150725250531.35 | 10593194904399.143 | 11016931834528.23 | 25.500009351223113 | 38237.22761127161 | 0.04999791797263452 | 291619606 |
| R | F | 3775724821 | 5661602800938.199 | 5378513347920.042 | 5593662028982.476 | 25.500066311082758 | 38236.6972423322 | 0.05000130406955945 | 148067255 |
+--------------+--------------+------------+--------------------+--------------------+--------------------+--------------------+-------------------+---------------------+-------------+
Query 1 iteration 0 took 27109.6 ms
Query 1 avg time: 27109.65 ms
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1013079492
Thank you @houqp and @Igosuki -- I'll try and take a look at this later today or tomorrow. I will also start the discussion of "what does this mean for arrow-rs", which I expect may take some time to come to consensus on today
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] ic4y edited a comment on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
ic4y edited a comment on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1012809108
I found that the peak memory usage of this branch increases by 80% compared to the master branch。
sql : select avg(user_id) from parquet_event_1 group by user_name limit 10
test dataset : total 450 million, 50 million users
| branch | peak memory |
| ---- | ---- |
| master(d7e465 and 35d65fc) | 6G |
| arrow2_merge | 10G |
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp edited a comment on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
houqp edited a comment on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1013835311
Thank you everyone for all the reviews and comments so far. @Igosuki and I have addressed most of them. Here are the two remaining todo items:
- [x] Get the parquet row group filter test to pass
- [x] Restore sql integration test migration. All those sql tests were migrated and passing previously, but those changes got lost when we merged the sql test refactoring from master.
I will keep working on this tomorrow. In the mean time, feel free to send PRs to my fork if you are interested in helping. After these two items are fixed, I will run another round of benchmark to double check the performance fix. It's quite interesting that I got the opposite performance test result initial even without that file buf fix :P I will dig into what's causing that as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] yjshen commented on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
yjshen commented on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1017190215
Wow! Milestone reached! Thanks for driving on this and making it happen @houqp 👍
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp commented on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1012678951
Thanks to @Igosuki and @yjshen , we are now down to a single parquet row group pruning related test failure. All other 855 tests are passing without error.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1013246630
@ic4y @tustvold mentioned to me in passing that one of the differences in memory usage might be related to parquet2 decoding entire pages at a time rather than reading them out in smaller batches. I am not sure if that matches your observations
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp commented on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1017176758
Thank you @jorgecarleitao @yjshen and @Igosuki for your hard work on the migration thus far :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on a change in pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#discussion_r785303506
##########
File path: .github/workflows/rust.yml
##########
@@ -116,6 +116,7 @@ jobs:
cargo test --no-default-features
cargo run --example csv_sql
cargo run --example parquet_sql
+ #nopass
Review comment:
What does `#nopass` mean?
##########
File path: datafusion/tests/mod.rs
##########
@@ -1,18 +0,0 @@
-// Licensed to the Apache Software Foundation (ASF) under one
Review comment:
By deleting this file (non obviously) prevents the sql integration tests from running
When I restored it
And then tried to run the sql integration tests:
```
cd /Users/alamb/Software/arrow-datafusion && CARGO_TARGET_DIR=/Users/alamb/Software/df-target cargo test -p datafusion --test mod
```
It doesn't compile.
```
cd /Users/alamb/Software/arrow-datafusion && CARGO_TARGET_DIR=/Users/alamb/Software/df-target cargo test -p datafusion --test mod
Compiling datafusion v6.0.0 (/Users/alamb/Software/arrow-datafusion/datafusion)
error[E0432]: unresolved import `arrow::util::display`
--> datafusion/tests/sql/mod.rs:23:11
|
23 | util::display::array_value_to_string,
| ^^^^^^^ could not find `display` in `util`
error[E0433]: failed to resolve: could not find `pretty` in `util`
--> datafusion/tests/sql/explain_analyze.rs:45:34
|
45 | let formatted = arrow::util::pretty::pretty_format_batches(&results).unwrap();
| ^^^^^^ could not find `pretty` in `util`
error[E0433]: failed to resolve: could not find `pretty` in `util`
--> datafusion/tests/sql/explain_analyze.rs:551:31
|
551 | let actual = arrow::util::pretty::pretty_format_batches(&actual).unwrap();
| ^^^^^^ could not find `pretty` in `util`
error[E0433]: failed to resolve: could not find `pretty` in `util`
--> datafusion/tests/sql/explain_analyze.rs:557:31
|
557 | let actual = arrow::util::pretty::pretty_format_batches(&actual).unwrap();
| ^^^^^^ could not find `pretty` in `util`
error[E0433]: failed to resolve: could not find `pretty` in `util`
--> datafusion/tests/sql/explain_analyze.rs:763:34
|
763 | let formatted = arrow::util::pretty::pretty_format_batches(&actual).unwrap();
| ^^^^^^ could not find `pretty` in `util`
error[E0433]: failed to resolve: could not find `pretty` in `util`
--> datafusion/tests/sql/explain_analyze.rs:783:34
|
783 | let formatted = arrow::util::pretty::pretty_format_batches(&actual).unwrap();
| ^^^^^^ could not find `pretty` in `util`
error[E0433]: failed to resolve: use of undeclared type `StringArray`
--> datafusion/tests/sql/functions.rs:89:22
|
89 | Arc::new(StringArray::from(vec!["", "a", "aa", "aaa"])),
| ^^^^^^^^^^^ use of undeclared type `StringArray`
....
```
This naming is very confusing -- I'll open a PR / issue to fix it shortly
##########
File path: .github/workflows/rust.yml
##########
@@ -318,8 +320,7 @@ jobs:
run: |
cargo miri setup
cargo clean
- # Ignore MIRI errors until we can get a clean run
- cargo miri test || true
+ cargo miri test
Review comment:
👍
##########
File path: ballista/rust/core/Cargo.toml
##########
@@ -35,23 +35,21 @@ async-trait = "0.1.36"
futures = "0.3"
hashbrown = "0.11"
log = "0.4"
-prost = "0.8"
+prost = "0.9"
serde = {version = "1", features = ["derive"]}
sqlparser = "0.13"
tokio = "1.0"
-tonic = "0.5"
+tonic = "0.6"
uuid = { version = "0.8", features = ["v4"] }
chrono = { version = "0.4", default-features = false }
-# workaround for https://github.com/apache/arrow-datafusion/issues/1498
-# should be able to remove when we update arrow-flight
-quote = "=1.0.10"
-arrow-flight = { version = "6.4.0" }
+arrow-format = { version = "0.3", features = ["flight-data", "flight-service"] }
Review comment:
https://crates.io/crates/arrow-format for anyone else following along
Perhaps that is something else we could consider putting into the official apache repo over time (to reduce maintenance costs / allow others to help do so)
##########
File path: datafusion/src/physical_plan/expressions/binary.rs
##########
@@ -981,249 +859,125 @@ mod tests {
// 4. verify that the resulting expression is of type C
// 5. verify that the results of evaluation are $VEC
macro_rules! test_coercion {
- ($A_ARRAY:ident, $A_TYPE:expr, $A_VEC:expr, $B_ARRAY:ident, $B_TYPE:expr, $B_VEC:expr, $OP:expr, $C_ARRAY:ident, $C_TYPE:expr, $VEC:expr) => {{
+ ($A_ARRAY:ident, $B_ARRAY:ident, $OP:expr, $C_ARRAY:ident) => {{
let schema = Schema::new(vec![
- Field::new("a", $A_TYPE, false),
- Field::new("b", $B_TYPE, false),
+ Field::new("a", $A_ARRAY.data_type().clone(), false),
+ Field::new("b", $B_ARRAY.data_type().clone(), false),
]);
- let a = $A_ARRAY::from($A_VEC);
- let b = $B_ARRAY::from($B_VEC);
-
// verify that we can construct the expression
let expression =
binary(col("a", &schema)?, $OP, col("b", &schema)?, &schema)?;
let batch = RecordBatch::try_new(
Arc::new(schema.clone()),
- vec![Arc::new(a), Arc::new(b)],
+ vec![Arc::new($A_ARRAY), Arc::new($B_ARRAY)],
)?;
// verify that the expression's type is correct
- assert_eq!(expression.data_type(&schema)?, $C_TYPE);
+ assert_eq!(&expression.data_type(&schema)?, $C_ARRAY.data_type());
// compute
let result = expression.evaluate(&batch)?.into_array(batch.num_rows());
// verify that the array's data_type is correct
Review comment:
```suggestion
// verify that the array is equal
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp commented on a change in pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
houqp commented on a change in pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#discussion_r785398875
##########
File path: datafusion/Cargo.toml
##########
@@ -74,14 +76,24 @@ regex = { version = "^1.4.3", optional = true }
lazy_static = { version = "^1.4.0" }
smallvec = { version = "1.6", features = ["union"] }
rand = "0.8"
-avro-rs = { version = "0.13", features = ["snappy"], optional = true }
num-traits = { version = "0.2", optional = true }
pyo3 = { version = "0.14", optional = true }
+avro-schema = { version = "0.2", optional = true }
Review comment:
yes, looks like this is required in order to create an arrow2 avro reader.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] jorgecarleitao commented on a change in pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#discussion_r785394934
##########
File path: ballista/rust/core/Cargo.toml
##########
@@ -35,23 +35,21 @@ async-trait = "0.1.36"
futures = "0.3"
hashbrown = "0.11"
log = "0.4"
-prost = "0.8"
+prost = "0.9"
serde = {version = "1", features = ["derive"]}
sqlparser = "0.13"
tokio = "1.0"
-tonic = "0.5"
+tonic = "0.6"
uuid = { version = "0.8", features = ["v4"] }
chrono = { version = "0.4", default-features = false }
-# workaround for https://github.com/apache/arrow-datafusion/issues/1498
-# should be able to remove when we update arrow-flight
-quote = "=1.0.10"
-arrow-flight = { version = "6.4.0" }
+arrow-format = { version = "0.3", features = ["flight-data", "flight-service"] }
Review comment:
I agree.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] Igosuki commented on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
Igosuki commented on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1013898492
Yes sorry about that, these were simply comments to Indicate that these
particular feature tests were not passing.
Le dim. 16 janv. 2022 à 09:52, QP Hou ***@***.***> a écrit :
> Thank you everyone for all the reviews and comments so far. @Igosuki
> <https://github.com/Igosuki> and I have addressed most of them. Here are
> the two remaining todo items:
>
> - Get the parquet row group filter test to pass
> - Restore sql integration test migration. All those sql tests were
> migrated and passing previously, but those changes got lost when we merged
> the sql test refactoring from master.
>
> I will keep working on this tomorrow. In the mean time, feel free to send
> PRs to my fork if you are interested in helping. After these two items are
> fixed, I will run another round of benchmark to double check the
> performance fix. It's quite interesting that I got the opposite performance
> test result initial even without that file buf fix :P I will dig into
> what's causing that as well.
>
> —
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1013835311>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AADDFBSPYHZOIG4O2XOGWZDUWKBLXANCNFSM5L5P5AVQ>
> .
> Triage notifications on the go with GitHub Mobile for iOS
> <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
> or Android
> <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
>
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp commented on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1019727038
Quick update on this, I have cleaned up the issues in the arrow2 milestone: https://github.com/apache/arrow-datafusion/milestone/3. The main remaining items are:
* https://github.com/apache/arrow-datafusion/issues/1652
* https://github.com/apache/arrow-datafusion/issues/1656
* https://github.com/apache/arrow-datafusion/issues/1657
I will keep work on issues in the arrow2 milestone whenever I have capacity. If anyone of you are interested in helping, please feel free to comment on those issues or send PRs to the official arrow2 branch.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1013674164
Discussion of arrow-rs / arrow2 is here in case anyone missed it: https://github.com/apache/arrow-rs/issues/1176
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] Igosuki commented on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
Igosuki commented on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1013010833
Just use massif or heaptrack
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] Igosuki commented on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
Igosuki commented on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1013080406
Ok adding BufReader gave 50% perf on parquet. https://github.com/houqp/arrow-datafusion/pull/19/commits/d8a184969bd2a88292158cbc704e0cb959b28ea6
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp commented on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1014163126
The parquet row group test failure turned out to be a red herring. The asserted expected result is actually not correct. I have filed a follow up issue at https://github.com/apache/arrow-datafusion/issues/1591. I changed the expected result in this branch to fix the test failure for now. What the predicate pruning logic returns in this branch is more correct than what we have in master, but still wrong. The proper fix is out of scope of arrow2 migration and tracked in #1591.
We are now passing all 856 unit tests. 2 more integration tests to fix, which are caused by difference in how arrow2 formats binary array.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] yjshen commented on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
yjshen commented on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1014180045
> The parquet row group test failure turned out to be a red herring. The asserted expected result is actually not correct. I have filed a follow up issue at #1591.
I shared the same observation in https://github.com/houqp/arrow-datafusion/pull/16, but ignored the test at the time.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] yjshen commented on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
yjshen commented on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1012813388
@ic4y I'm confused with the numbers and "increases by 80% compared to the master branch"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp commented on a change in pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
houqp commented on a change in pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#discussion_r785403949
##########
File path: datafusion/src/datasource/object_store/mod.rs
##########
@@ -33,6 +33,12 @@ use local::LocalFileSystem;
use crate::error::{DataFusionError, Result};
+/// Both Read and Seek
+pub trait ReadSeek: Read + Seek {}
+
+impl<R: Read + Seek> ReadSeek for std::io::BufReader<R> {}
+impl<R: AsRef<[u8]>> ReadSeek for std::io::Cursor<R> {}
Review comment:
Nice tip on the generic approach. @Igosuki is right that this is not needed for BufRead and Cursor, It's only needed for `std::fs::File`, so I will keep the generic form. Here is the build error I got without it:
```
error[E0277]: the trait bound `std::fs::File: datasource::object_store::ReadSeek` is not satisfied
--> datafusion/src/datasource/object_store/local.rs:82:12
|
82 | Ok(Box::new(File::open(&self.file.path)?))
| -- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `datasource::object_store::ReadSeek` is not implemented for `std::fs::File`
| |
| required by a bound introduced by this call
|
= note: required for the cast to the object type `dyn datasource::object_store::ReadSeek + std::marker::Send + std::marker::Sync`
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] jorgecarleitao commented on a change in pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#discussion_r785396533
##########
File path: datafusion/src/datasource/file_format/parquet.rs
##########
@@ -238,29 +246,30 @@ fn summarize_min_max(
}
}
}
- _ => {}
+ PhysicalType::FixedLenByteArray(_) => {
+ // type not supported yet
+ }
}
+
+ Ok(())
}
/// Read and parse the schema of the Parquet file at location `path`
fn fetch_schema(object_reader: Arc<dyn ObjectReader>) -> Result<Schema> {
- let obj_reader = ChunkObjectReader(object_reader);
- let file_reader = Arc::new(SerializedFileReader::new(obj_reader)?);
- let mut arrow_reader = ParquetFileArrowReader::new(file_reader);
- let schema = arrow_reader.get_schema()?;
-
+ let mut reader = object_reader.sync_reader()?;
+ let meta_data = read_metadata(&mut reader)?;
+ let schema = get_schema(&meta_data)?;
Ok(schema)
}
/// Read and parse the statistics of the Parquet file at location `path`
fn fetch_statistics(object_reader: Arc<dyn ObjectReader>) -> Result<Statistics> {
- let obj_reader = ChunkObjectReader(object_reader);
- let file_reader = Arc::new(SerializedFileReader::new(obj_reader)?);
- let mut arrow_reader = ParquetFileArrowReader::new(file_reader);
- let schema = arrow_reader.get_schema()?;
+ let mut reader = object_reader.sync_reader()?;
Review comment:
IMO this is the culprit of the perf regressions since this does not buffer anything. Filled under #1583
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] Igosuki commented on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
Igosuki commented on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1017460977
Great work @houqp
Le jeu. 20 janv. 2022 à 08:27, Yijie Shen ***@***.***> a
écrit :
> Wow! Milestone reached! Thanks for driving on this and making it happen
> @houqp <https://github.com/houqp> 👍
>
> —
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1017190215>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AADDFBWXQK6FFRP4FXB26JDUW62PLANCNFSM5L5P5AVQ>
> .
> Triage notifications on the go with GitHub Mobile for iOS
> <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
> or Android
> <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
>
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1015536714
I think we should merge it into the arrow2 branch and keep iterating from there. I suspect the next big chunk of work is the `RecordBatch` removal / adaptation in https://github.com/jorgecarleitao/arrow2/pull/717
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp commented on a change in pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
houqp commented on a change in pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#discussion_r785407309
##########
File path: datafusion/tests/mod.rs
##########
@@ -1,18 +0,0 @@
-// Licensed to the Apache Software Foundation (ASF) under one
Review comment:
oh wow, good catch!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] andygrove commented on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1013718590
Here are some quick benchmark results comparing the master branch and this PR, running on a threadripper desktop with 24 cores and an NVMe drive. I ran each query once only.
These are the commands that I ran:
```
cargo build --release
cd benchmarks
cargo run --release --bin tpch -- benchmark datafusion --iterations 1 --path /mnt/bigdata/tpch/sf100-24part-parquet --format parquet --query 13 --batch-size 4096 --partitions 24
```
![df-arrow2-perf](https://user-images.githubusercontent.com/934084/149630864-d909c05b-a62e-450d-90a0-78b20db70a99.png)
Queries 1 and 6 were both at least 3x slower and query 12 was more than 2x slower.
I used the following commits:
Master: 1c39f5ce865e3e1225b4895196073be560a93e82
Arrow2: e53d165f018a54d47f80ff2a132f83cee363c79c
I will poke around a bit myself this weekend and see if I can debug this based on metrics to try and determine where the performance issues are.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] jorgecarleitao commented on a change in pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#discussion_r784507391
##########
File path: datafusion/Cargo.toml
##########
@@ -74,14 +76,24 @@ regex = { version = "^1.4.3", optional = true }
lazy_static = { version = "^1.4.0" }
smallvec = { version = "1.6", features = ["union"] }
rand = "0.8"
-avro-rs = { version = "0.13", features = ["snappy"], optional = true }
num-traits = { version = "0.2", optional = true }
pyo3 = { version = "0.14", optional = true }
+avro-schema = { version = "0.2", optional = true }
Review comment:
Is this needed?
##########
File path: datafusion/src/logical_plan/dfschema.rs
##########
@@ -536,9 +536,10 @@ mod tests {
fn from_qualified_schema_into_arrow_schema() -> Result<()> {
let schema = DFSchema::try_from_qualified_schema("t1", &test_schema_1())?;
let arrow_schema: Schema = schema.into();
- let expected = "Field { name: \"c0\", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, \
Review comment:
note how `dict_id` and `dict_is_ordered`, two confusing attributes of `Field`, were removed. `dict_is_ordered` is now on the `DataType::Dictionary` itself, which allows implementations to leverage this flag in compute directly, `dict_id` is now only required when writing to IPC (via a different mechanism)
##########
File path: ballista/rust/core/src/execution_plans/shuffle_writer.rs
##########
@@ -532,6 +541,7 @@ mod tests {
.unwrap();
let num_rows = stats
+ // see https://github.com/jorgecarleitao/arrow2/pull/416 for fix
Review comment:
```suggestion
```
##########
File path: ballista/rust/core/Cargo.toml
##########
@@ -35,23 +35,21 @@ async-trait = "0.1.36"
futures = "0.3"
hashbrown = "0.11"
log = "0.4"
-prost = "0.8"
+prost = "0.9"
serde = {version = "1", features = ["derive"]}
sqlparser = "0.13"
tokio = "1.0"
-tonic = "0.5"
+tonic = "0.6"
uuid = { version = "0.8", features = ["v4"] }
chrono = { version = "0.4", default-features = false }
-# workaround for https://github.com/apache/arrow-datafusion/issues/1498
-# should be able to remove when we update arrow-flight
-quote = "=1.0.10"
-arrow-flight = { version = "6.4.0" }
+arrow-format = { version = "0.3", features = ["flight-data", "flight-service"] }
Review comment:
`arrow-format` is an auxiliary crate that only contains the IPC and flight. This is so that we can more easily follow changes to the Arrow spec there (and/or particular libs we use to derive proto/flat buffers).
##########
File path: datafusion/src/physical_plan/file_format/json.rs
##########
@@ -50,6 +54,43 @@ impl NdJsonExec {
}
}
+// TODO: implement iterator in upstream json::Reader type
+struct JsonBatchReader<R: Read> {
+ reader: R,
+ schema: SchemaRef,
+ batch_size: usize,
+ proj: Option<Vec<String>>,
+}
+
+impl<R: BufRead> Iterator for JsonBatchReader<R> {
+ type Item = ArrowResult<RecordBatch>;
+
+ fn next(&mut self) -> Option<Self::Item> {
+ // json::read::read_rows iterates on the empty vec and reads at most n rows
+ let mut rows = vec![String::default(); self.batch_size];
Review comment:
This can be moved to `JsonBatchReader` and re-used across batches (that is the idea, at least)
##########
File path: datafusion-examples/examples/simple_udaf.rs
##########
@@ -37,11 +37,11 @@ fn create_context() -> Result<ExecutionContext> {
// define data in two partitions
let batch1 = RecordBatch::try_new(
schema.clone(),
- vec![Arc::new(Float32Array::from(vec![2.0, 4.0, 8.0]))],
+ vec![Arc::new(Float32Array::from_values(vec![2.0, 4.0, 8.0]))],
Review comment:
```suggestion
vec![Arc::new(Float32Array::from_slice([2.0, 4.0, 8.0]))],
```
##########
File path: datafusion/Cargo.toml
##########
@@ -39,25 +39,27 @@ path = "src/lib.rs"
[features]
default = ["crypto_expressions", "regex_expressions", "unicode_expressions"]
-simd = ["arrow/simd"]
+# FIXME: https://github.com/jorgecarleitao/arrow2/issues/580
+#simd = ["arrow/simd"]
+simd = []
crypto_expressions = ["md-5", "sha2", "blake2", "blake3"]
regex_expressions = ["regex"]
unicode_expressions = ["unicode-segmentation"]
-pyarrow = ["pyo3", "arrow/pyarrow"]
+# FIXME: add pyarrow support to arrow2 pyarrow = ["pyo3", "arrow/pyarrow"]
+pyarrow = ["pyo3"]
# Used for testing ONLY: causes all values to hash to the same value (test for collisions)
force_hash_collisions = []
# Used to enable the avro format
-avro = ["avro-rs", "num-traits"]
+avro = ["arrow/io_avro", "arrow/io_avro_async", "arrow/io_avro_compression", "num-traits", "avro-schema"]
[dependencies]
ahash = { version = "0.7", default-features = false }
hashbrown = { version = "0.11", features = ["raw"] }
-arrow = { version = "6.4.0", features = ["prettyprint"] }
-parquet = { version = "6.4.0", features = ["arrow"] }
+parquet = { package = "parquet2", version = "0.8", default_features = false, features = ["stream"] }
sqlparser = "0.13"
paste = "^1.0"
num_cpus = "1.13.0"
-chrono = { version = "0.4", default-features = false }
+chrono = { version = "0.4", default-features = false, features = ["clock"] }
Review comment:
Note that this feature has a known vulnerability. `arrow-rs` depends on it, `arrow2` does not (casting from utf8 to datetime is consistent with the arrow spec's definitions of timezone-aware timezones). It seems that datafusion depends on this (due to how postgres casts string to datetimes?)
##########
File path: datafusion-examples/examples/simple_udaf.rs
##########
@@ -37,11 +37,11 @@ fn create_context() -> Result<ExecutionContext> {
// define data in two partitions
let batch1 = RecordBatch::try_new(
schema.clone(),
- vec![Arc::new(Float32Array::from(vec![2.0, 4.0, 8.0]))],
+ vec![Arc::new(Float32Array::from_values(vec![2.0, 4.0, 8.0]))],
)?;
let batch2 = RecordBatch::try_new(
schema.clone(),
- vec![Arc::new(Float32Array::from(vec![64.0]))],
+ vec![Arc::new(Float32Array::from_values(vec![64.0]))],
Review comment:
```suggestion
vec![Arc::new(Float32Array::from_slice([64.0]))],
```
##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -681,8 +666,8 @@ fn build_join_indexes(
match join_type {
JoinType::Inner | JoinType::Semi | JoinType::Anti => {
// Using a buffer builder to avoid slower normal builder
- let mut left_indices = UInt64BufferBuilder::new(0);
- let mut right_indices = UInt32BufferBuilder::new(0);
+ let mut left_indices = Vec::<u64>::new();
Review comment:
this is another major difference: arrow2 is interoperable with `Vec`: `Buffer<T>` implements `From<Vec<T>>`
##########
File path: datafusion/Cargo.toml
##########
@@ -39,25 +39,27 @@ path = "src/lib.rs"
[features]
default = ["crypto_expressions", "regex_expressions", "unicode_expressions"]
-simd = ["arrow/simd"]
+# FIXME: https://github.com/jorgecarleitao/arrow2/issues/580
+#simd = ["arrow/simd"]
+simd = []
Review comment:
```suggestion
simd = ["arrow/simd"]
```
##########
File path: datafusion/src/datasource/file_format/parquet.rs
##########
@@ -342,12 +332,12 @@ mod tests {
use super::*;
use arrow::array::{
- BinaryArray, BooleanArray, Float32Array, Float64Array, Int32Array,
- TimestampNanosecondArray,
+ BinaryArray, BooleanArray, Float32Array, Float64Array, Int32Array, Int64Array,
};
use futures::StreamExt;
#[tokio::test]
+ /// Parquet2 lacks the ability to set batch size for parquet reader
Review comment:
it is by design: parquet's unit of parallel work is the column chunk. Setting a batch size breaks that unit of work since it requires shared state and synchronization across column chunks (longer version here: https://medium.com/@henry.yijieshen/for-csv-or-the-like-we-could-split-file-by-offsets-and-take-special-care-for-those-records-spread-3ff43195f90)
The upsell is that arrow2 supports parallel deserialization out of the box, e.g. [this example](https://github.com/jorgecarleitao/arrow2/blob/main/examples/parquet_read_parallel/src/main.rs), which allows to break from the usual one thread one file constraint found in other systems. In Polars we saw a 1.5-2x speedup when doing this (they use rayon, but the principle applies).
##########
File path: ballista-examples/src/bin/ballista-sql.rs
##########
@@ -27,7 +27,7 @@ async fn main() -> Result<()> {
.build()?;
let ctx = BallistaContext::remote("localhost", 50050, &config);
- let testdata = datafusion::arrow::util::test_util::arrow_test_data();
Review comment:
this was made private in `arrow2` because all tests in arrow2 are outside `src/`, so that adding tests does not require re-compiling the crate (and makes it easier to follow which changes are to the source and which changes are to the tests)
##########
File path: datafusion/Cargo.toml
##########
@@ -74,14 +76,24 @@ regex = { version = "^1.4.3", optional = true }
lazy_static = { version = "^1.4.0" }
smallvec = { version = "1.6", features = ["union"] }
rand = "0.8"
-avro-rs = { version = "0.13", features = ["snappy"], optional = true }
num-traits = { version = "0.2", optional = true }
pyo3 = { version = "0.14", optional = true }
+avro-schema = { version = "0.2", optional = true }
+
+[dependencies.arrow]
+package = "arrow2"
+version="0.8"
+features = ["io_csv", "io_json", "io_parquet", "io_parquet_compression", "io_ipc", "io_print", "ahash",
+ "compute_merge_sort", "compute_concatenate", "compute_regex_match", "compute_arithmetics",
+ "compute_cast", "compute_partition", "compute_temporal", "compute_take", "compute_aggregate",
+ "compute_comparison", "compute_if_then_else", "compute_nullif", "compute_boolean", "compute_length",
+ "compute_limit", "compute_boolean_kleene", "compute_like", "compute_filter", "compute_window",]
Review comment:
`"compute"` feature is an alias for all `compute_*` features.
##########
File path: datafusion/tests/parquet_pruning.rs
##########
@@ -697,13 +730,17 @@ fn make_timestamp_batch(offset: Duration) -> RecordBatch {
.map(|(i, _)| format!("Row {} + {}", i, offset))
.collect::<Vec<_>>();
- let arr_nanos = TimestampNanosecondArray::from_opt_vec(ts_nanos, None);
- let arr_micros = TimestampMicrosecondArray::from_opt_vec(ts_micros, None);
- let arr_millis = TimestampMillisecondArray::from_opt_vec(ts_millis, None);
- let arr_seconds = TimestampSecondArray::from_opt_vec(ts_seconds, None);
+ let arr_nanos = PrimitiveArray::<i64>::from(ts_nanos)
+ .to(DataType::Timestamp(TimeUnit::Nanosecond, None));
Review comment:
this is an example of why it is so easy to add timezone support in arrow2 - the physical type (`i64`) is separated from the logical type `DataType::Timestamp(TimeUnit::Nanosecond, None)` so that we "attach" logical types to the (physical) array.
##########
File path: ballista/rust/core/src/execution_plans/shuffle_writer.rs
##########
@@ -567,6 +577,7 @@ mod tests {
.downcast_ref::<StructArray>()
.unwrap();
let num_rows = stats
+ // see https://github.com/jorgecarleitao/arrow2/pull/416 for fix
Review comment:
```suggestion
```
##########
File path: datafusion-examples/examples/flight_client.rs
##########
@@ -57,23 +55,26 @@ async fn main() -> Result<(), Box<dyn std::error::Error>> {
// the schema should be the first message returned, else client should error
let flight_data = stream.message().await?.unwrap();
// convert FlightData to a stream
- let schema = Arc::new(Schema::try_from(&flight_data)?);
+ let (schema, ipc_schema) =
+ deserialize_schemas(flight_data.data_body.as_slice()).unwrap();
+ let schema = Arc::new(schema);
println!("Schema: {:?}", schema);
// all the remaining stream messages should be dictionary and record batches
let mut results = vec![];
- let dictionaries_by_field = vec![None; schema.fields().len()];
+ let dictionaries_by_field = HashMap::new();
Review comment:
this is a subtle change for interoperability with the arrow spec - arrow2 implements the complete arrow specification (all official integration tests pass against C++, [diff on the tests](https://github.com/jorgecarleitao/arrow2/blob/main/integration-testing/unskip.patch)).
##########
File path: ballista/rust/core/src/utils.rs
##########
@@ -30,16 +30,17 @@ use crate::memory_stream::MemoryStream;
use crate::serde::scheduler::PartitionStats;
use crate::config::BallistaConfig;
+use arrow::io::ipc::write::WriteOptions;
Review comment:
```suggestion
use datafusion::arrow::io::ipc::write::WriteOptions;
```
for consistency ^_^
##########
File path: datafusion/src/avro_to_arrow/reader.rs
##########
@@ -101,30 +100,49 @@ impl ReaderBuilder {
}
/// Create a new `Reader` from the `ReaderBuilder`
- pub fn build<'a, R>(self, source: R) -> Result<Reader<'a, R>>
+ pub fn build<R>(self, source: R) -> Result<Reader<R>>
where
R: Read + Seek,
{
let mut source = source;
// check if schema should be inferred
- let schema = match self.schema {
- Some(schema) => schema,
- None => Arc::new(super::read_avro_schema_from_reader(&mut source)?),
- };
source.seek(SeekFrom::Start(0))?;
- Reader::try_new(source, schema, self.batch_size, self.projection)
+ let (mut avro_schemas, mut schema, codec, file_marker) =
+ read::read_metadata(&mut source)?;
+ if let Some(proj) = self.projection {
Review comment:
projections on avro files are not tested in arrow2. Note that a projection in a row-based format is not so appealing because we still need to read the whole row to extract the correct columns and thus the bulk of the cost is still there.
A more defensive approach here is to perform the projection after read (i.e. remove columns from the RecordBatch). Filled upstream: https://github.com/jorgecarleitao/arrow2/issues/764
##########
File path: datafusion/src/datasource/object_store/mod.rs
##########
@@ -33,6 +33,12 @@ use local::LocalFileSystem;
use crate::error::{DataFusionError, Result};
+/// Both Read and Seek
+pub trait ReadSeek: Read + Seek {}
+
+impl<R: Read + Seek> ReadSeek for std::io::BufReader<R> {}
+impl<R: AsRef<[u8]>> ReadSeek for std::io::Cursor<R> {}
Review comment:
Not sure the trait is needed, but
```suggestion
impl<T: Read + Seek> ReadSeek for T {}
```
should generalize for everything.
##########
File path: datafusion/src/datasource/file_format/csv.rs
##########
@@ -96,18 +97,30 @@ impl FileFormat for CsvFormat {
let mut records_to_read = self.schema_infer_max_rec.unwrap_or(std::usize::MAX);
while let Some(obj_reader) = readers.next().await {
- let mut reader = obj_reader?.sync_reader()?;
- let (schema, records_read) = arrow::csv::reader::infer_reader_schema(
+ let mut reader = csv::read::ReaderBuilder::new()
+ .delimiter(self.delimiter)
+ .has_headers(self.has_header)
+ .from_reader(obj_reader?.sync_reader()?);
+
+ let schema = csv::read::infer_schema(
&mut reader,
- self.delimiter,
Some(records_to_read),
self.has_header,
+ &csv::read::infer,
)?;
- if records_read == 0 {
- continue;
- }
+
+ // if records_read == 0 {
+ // continue;
+ // }
+ // schemas.push(schema.clone());
+ // records_to_read -= records_read;
+ // if records_to_read == 0 {
+ // break;
+ // }
+ //
+ // FIXME: return recods_read from infer_schema
Review comment:
Addressed here: https://github.com/jorgecarleitao/arrow2/pull/765
##########
File path: datafusion/src/physical_plan/projection.rs
##########
@@ -70,16 +70,15 @@ impl ProjectionExec {
e.data_type(&input_schema)?,
e.nullable(&input_schema)?,
);
- field.set_metadata(get_field_metadata(e, &input_schema));
+ if let Some(metadata) = get_field_metadata(e, &input_schema) {
+ field = field.with_metadata(metadata);
Review comment:
```field.metadata = metadata;```
would also work when field is `mut` (since everything is public in `Field`)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] Igosuki edited a comment on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
Igosuki edited a comment on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1019463521
> ![arrow2](https://user-images.githubusercontent.com/934084/149633991-a1615ac0-a1ca-46f0-9b76-f0abb6917d2c.png)
@andygrove What tool did you use to get such a smooth CPU chart ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp commented on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1014164828
I also noticed my benchmarks were ran with data generated from `tpch-gen.sh`, which only produces single partition CSV files. @andygrove could you share with me how you generated your sf100 dataset?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] ic4y commented on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
ic4y commented on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1012829112
@yjshen
Sorry, I made a typo, it has been corrected.
I want to express is the same query, using 6G memory on arrow but 10G memory on arrow2
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] andygrove edited a comment on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
andygrove edited a comment on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1013730047
Metrics for query 1 from master & arrow2.
## Master
```
=== Physical plan with metrics ===
SortExec: [l_returnflag@0 ASC NULLS LAST,l_linestatus@1 ASC NULLS LAST], metrics=[output_rows=4, elapsed_compute=148.532µs]
CoalescePartitionsExec, metrics=[output_rows=4, elapsed_compute=26.889µs]
ProjectionExec: expr=[l_returnflag@0 as l_returnflag, l_linestatus@1 as l_linestatus, SUM(lineitem.l_quantity)@2 as sum_qty, SUM(lineitem.l_extendedprice)@3 as sum_base_price, SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount)@4 as sum_disc_price, SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax)@5 as sum_charge, AVG(lineitem.l_quantity)@6 as avg_qty, AVG(lineitem.l_extendedprice)@7 as avg_price, AVG(lineitem.l_discount)@8 as avg_disc, COUNT(UInt8(1))@9 as count_order], metrics=[output_rows=4, elapsed_compute=162.216µs]
HashAggregateExec: mode=FinalPartitioned, gby=[l_returnflag@0 as l_returnflag, l_linestatus@1 as l_linestatus], aggr=[SUM(lineitem.l_quantity), SUM(lineitem.l_extendedprice), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount Multiply Int64(1) Plus lineitem.l_tax), AVG(lineitem.l_quantity), AVG(lineitem.l_extendedprice), AVG(lineitem.l_discount), COUNT(UInt8(1))], metrics=[output_rows=4, elapsed_compute=836.092µs]
CoalesceBatchesExec: target_batch_size=2048, metrics=[output_rows=96, elapsed_compute=2.467991ms]
RepartitionExec: partitioning=Hash([Column { name: "l_returnflag", index: 0 }, Column { name: "l_linestatus", index: 1 }], 24), metrics=[fetch_time{inputPartition=0}=121.646159259s, repart_time{inputPartition=0}=1.068665ms, send_time{inputPartition=0}=181.052µs]
HashAggregateExec: mode=Partial, gby=[l_returnflag@5 as l_returnflag, l_linestatus@6 as l_linestatus], aggr=[SUM(lineitem.l_quantity), SUM(lineitem.l_extendedprice), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount Multiply Int64(1) Plus lineitem.l_tax), AVG(lineitem.l_quantity), AVG(lineitem.l_extendedprice), AVG(lineitem.l_discount), COUNT(UInt8(1))], metrics=[output_rows=96, elapsed_compute=62.022452324s]
ProjectionExec: expr=[l_extendedprice@1 * CAST(1 AS Float64) - l_discount@2 as BinaryExpr-*BinaryExpr--Column-lineitem.l_discountLiteral1Column-lineitem.l_extendedprice, l_quantity@0 as l_quantity, l_extendedprice@1 as l_extendedprice, l_discount@2 as l_discount, l_tax@3 as l_tax, l_returnflag@4 as l_returnflag, l_linestatus@5 as l_linestatus], metrics=[output_rows=591599326, elapsed_compute=2.333647686s]
CoalesceBatchesExec: target_batch_size=2048, metrics=[output_rows=591599326, elapsed_compute=18.07630248s]
FilterExec: l_shipdate@6 <= 10471, metrics=[output_rows=591599326, elapsed_compute=14.438278954s]
ParquetExec: batch_size=4096, limit=None, partitions=[/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-5.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-10.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-11.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-13.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-6.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-14.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-20.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-2.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-22.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-19.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-0.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-16.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-21.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-23.parquet, /mnt/bigdata/tp
ch/sf100-24part-parquet/lineitem/part-4.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-17.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-9.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-1.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-18.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-8.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-12.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-15.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-7.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-3.parquet], metrics=[predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-9.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-22.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-23.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24p
art-parquet/lineitem/part-18.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-1.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-11.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-3.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-5.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-23.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-11.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-1.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-14.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-3.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/
lineitem/part-9.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-21.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-18.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-6.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-7.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-21.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-19.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-10.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-5.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-17.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/l
ineitem/part-15.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-12.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-8.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-16.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-12.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-6.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-16.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-10.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-4.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-20.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/pa
rt-0.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-7.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-2.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-8.parquet}=0, num_predicate_creation_errors=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-22.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-15.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-0.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-14.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-13.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-2.parquet}=0, predicate_evaluation_errors{filename=
/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-19.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-4.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-13.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-17.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-20.parquet}=0]
+--------------+--------------+------------+--------------------+--------------------+--------------------+--------------------+-------------------+----------------------+-------------+
| l_returnflag | l_linestatus | sum_qty | sum_base_price | sum_disc_price | sum_charge | avg_qty | avg_price | avg_disc | count_order |
+--------------+--------------+------------+--------------------+--------------------+--------------------+--------------------+-------------------+----------------------+-------------+
| A | F | 3775127569 | 5660775782243.051 | 5377736101986.644 | 5592847119226.674 | 25.49937018008533 | 38236.11640655464 | 0.05000224292310849 | 148047875 |
| N | F | 98553062 | 147771098385.98004 | 140384965965.03497 | 145999793032.7758 | 25.501556956882876 | 38237.19938880452 | 0.04998528433805398 | 3864590 |
| N | O | 7436302680 | 11150725250531.352 | 10593194904399.148 | 11016931834528.229 | 25.500009351223113 | 38237.22761127162 | 0.049997917972634566 | 291619606 |
| R | F | 3775724821 | 5661602800938.198 | 5378513347920.04 | 5593662028982.476 | 25.500066311082758 | 38236.6972423322 | 0.05000130406955949 | 148067255 |
+--------------+--------------+------------+--------------------+--------------------+--------------------+--------------------+-------------------+----------------------+-------------+
Query 1 iteration 0 took 8165.9 ms
Query 1 avg time: 8165.94 ms
```
![master_](https://user-images.githubusercontent.com/934084/149633988-d3f64166-2b03-4688-a187-da491f4360c1.png)
## Arrow2 PR
```
=== Physical plan with metrics ===
SortExec: [l_returnflag@0 ASC NULLS LAST,l_linestatus@1 ASC NULLS LAST], metrics=[output_rows=4, elapsed_compute=90.895µs]
CoalescePartitionsExec, metrics=[output_rows=4, elapsed_compute=16.752µs]
ProjectionExec: expr=[l_returnflag@0 as l_returnflag, l_linestatus@1 as l_linestatus, SUM(lineitem.l_quantity)@2 as sum_qty, SUM(lineitem.l_extendedprice)@3 as sum_base_price, SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount)@4 as sum_disc_price, SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax)@5 as sum_charge, AVG(lineitem.l_quantity)@6 as avg_qty, AVG(lineitem.l_extendedprice)@7 as avg_price, AVG(lineitem.l_discount)@8 as avg_disc, COUNT(UInt8(1))@9 as count_order], metrics=[output_rows=4, elapsed_compute=269.589µs]
HashAggregateExec: mode=FinalPartitioned, gby=[l_returnflag@0 as l_returnflag, l_linestatus@1 as l_linestatus], aggr=[SUM(lineitem.l_quantity), SUM(lineitem.l_extendedprice), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount Multiply Int64(1) Plus lineitem.l_tax), AVG(lineitem.l_quantity), AVG(lineitem.l_extendedprice), AVG(lineitem.l_discount), COUNT(UInt8(1))], metrics=[output_rows=4, elapsed_compute=966.04µs]
CoalesceBatchesExec: target_batch_size=2048, metrics=[output_rows=96, elapsed_compute=3.118624ms]
RepartitionExec: partitioning=Hash([Column { name: "l_returnflag", index: 0 }, Column { name: "l_linestatus", index: 1 }], 24), metrics=[send_time{inputPartition=0}=202.529µs, fetch_time{inputPartition=0}=167.702662304s, repart_time{inputPartition=0}=881.476µs]
HashAggregateExec: mode=Partial, gby=[l_returnflag@5 as l_returnflag, l_linestatus@6 as l_linestatus], aggr=[SUM(lineitem.l_quantity), SUM(lineitem.l_extendedprice), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount Multiply Int64(1) Plus lineitem.l_tax), AVG(lineitem.l_quantity), AVG(lineitem.l_extendedprice), AVG(lineitem.l_discount), COUNT(UInt8(1))], metrics=[output_rows=96, elapsed_compute=69.277706429s]
ProjectionExec: expr=[l_extendedprice@1 * CAST(1 AS Float64) - l_discount@2 as BinaryExpr-*BinaryExpr--Column-lineitem.l_discountLiteral1Column-lineitem.l_extendedprice, l_quantity@0 as l_quantity, l_extendedprice@1 as l_extendedprice, l_discount@2 as l_discount, l_tax@3 as l_tax, l_returnflag@4 as l_returnflag, l_linestatus@5 as l_linestatus], metrics=[output_rows=591599326, elapsed_compute=1.948225184s]
CoalesceBatchesExec: target_batch_size=2048, metrics=[output_rows=591599326, elapsed_compute=19.900026641s]
FilterExec: l_shipdate@6 <= 10471, metrics=[output_rows=591599326, elapsed_compute=17.25811968s]
ParquetExec: batch_size=4096, limit=None, partitions=[/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-5.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-10.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-11.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-13.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-6.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-14.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-20.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-2.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-22.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-19.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-0.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-16.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-21.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-23.parquet, /mnt/bigdata/tp
ch/sf100-24part-parquet/lineitem/part-4.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-17.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-9.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-1.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-18.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-8.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-12.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-15.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-7.parquet, /mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-3.parquet], metrics=[row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-23.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-21.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-2.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24p
art-parquet/lineitem/part-11.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-0.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-3.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-0.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-3.parquet}=0, num_predicate_creation_errors=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-12.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-7.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-4.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-2.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-23.parquet}=0, predicate_evaluation_errors{filenam
e=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-9.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-6.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-5.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-18.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-13.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-1.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-10.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-8.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-1.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-4.parquet}=0, predicate_evaluation_errors{filename=/m
nt/bigdata/tpch/sf100-24part-parquet/lineitem/part-5.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-22.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-17.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-9.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-21.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-7.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-12.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-8.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-18.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-16.parquet}=0, predicate_
evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-15.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-14.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-10.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-19.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-16.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-17.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-11.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-15.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-6.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-13.parquet}=0, predicate_evaluation_errors{
filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-20.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-20.parquet}=0, predicate_evaluation_errors{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-14.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-19.parquet}=0, row_groups_pruned{filename=/mnt/bigdata/tpch/sf100-24part-parquet/lineitem/part-22.parquet}=0]
+--------------+--------------+------------+--------------------+--------------------+--------------------+--------------------+-------------------+---------------------+-------------+
| l_returnflag | l_linestatus | sum_qty | sum_base_price | sum_disc_price | sum_charge | avg_qty | avg_price | avg_disc | count_order |
+--------------+--------------+------------+--------------------+--------------------+--------------------+--------------------+-------------------+---------------------+-------------+
| A | F | 3775127569 | 5660775782243.05 | 5377736101986.642 | 5592847119226.676 | 25.49937018008533 | 38236.11640655463 | 0.05000224292310846 | 148047875 |
| N | F | 98553062 | 147771098385.98004 | 140384965965.0348 | 145999793032.77588 | 25.501556956882876 | 38237.19938880452 | 0.04998528433805397 | 3864590 |
| N | O | 7436302680 | 11150725250531.35 | 10593194904399.143 | 11016931834528.23 | 25.500009351223113 | 38237.22761127161 | 0.04999791797263452 | 291619606 |
| R | F | 3775724821 | 5661602800938.199 | 5378513347920.042 | 5593662028982.476 | 25.500066311082758 | 38236.6972423322 | 0.05000130406955945 | 148067255 |
+--------------+--------------+------------+--------------------+--------------------+--------------------+--------------------+-------------------+---------------------+-------------+
Query 1 iteration 0 took 27109.6 ms
Query 1 avg time: 27109.65 ms
```
![arrow2](https://user-images.githubusercontent.com/934084/149633991-a1615ac0-a1ca-46f0-9b76-f0abb6917d2c.png)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] Igosuki commented on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
Igosuki commented on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1019463521
> Query 1 iteration 0 took 27109.6 ms
> Query 1 avg time: 27109.65 ms
> ```
@andygrove What tool did you use to get such a smooth CPU chart ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] Igosuki commented on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
Igosuki commented on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1013038787
@houqp What hardware/setup did you use to run the benchmark ? I'm actually getting way worse performance if running tpch using parquet
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp commented on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1013835311
Thank you everyone for all the reviews and comments so far. @Igosuki and I have addressed most of them. Here are the two remaining todo items:
* Get the parquet row group filter test to pass
* Restore sql integration test migration. All those sql tests were migrated and passing previously, but those changes got lost when we merged the sql test refactoring from master.
I will keep working on this tomorrow. In the mean time, feel free to send PRs to my fork if you are interested in helping. After these two items are fixed, I will run another round of benchmark to double check the performance fix. It's quite interesting that I got the opposite performance test result initial even without that file buf fix :P I will dig into what's causing that as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] jorgecarleitao commented on a change in pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#discussion_r785396533
##########
File path: datafusion/src/datasource/file_format/parquet.rs
##########
@@ -238,29 +246,30 @@ fn summarize_min_max(
}
}
}
- _ => {}
+ PhysicalType::FixedLenByteArray(_) => {
+ // type not supported yet
+ }
}
+
+ Ok(())
}
/// Read and parse the schema of the Parquet file at location `path`
fn fetch_schema(object_reader: Arc<dyn ObjectReader>) -> Result<Schema> {
- let obj_reader = ChunkObjectReader(object_reader);
- let file_reader = Arc::new(SerializedFileReader::new(obj_reader)?);
- let mut arrow_reader = ParquetFileArrowReader::new(file_reader);
- let schema = arrow_reader.get_schema()?;
-
+ let mut reader = object_reader.sync_reader()?;
+ let meta_data = read_metadata(&mut reader)?;
+ let schema = get_schema(&meta_data)?;
Ok(schema)
}
/// Read and parse the statistics of the Parquet file at location `path`
fn fetch_statistics(object_reader: Arc<dyn ObjectReader>) -> Result<Statistics> {
- let obj_reader = ChunkObjectReader(object_reader);
- let file_reader = Arc::new(SerializedFileReader::new(obj_reader)?);
- let mut arrow_reader = ParquetFileArrowReader::new(file_reader);
- let schema = arrow_reader.get_schema()?;
+ let mut reader = object_reader.sync_reader()?;
Review comment:
```suggestion
let mut reader = object_reader.sync_reader()?;
```
IMO this is the culprit of the perf regressions since this does not buffer anything. Filled under #1583
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp edited a comment on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
houqp edited a comment on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1013835311
Thank you everyone for all the reviews and comments so far. @Igosuki and I have addressed most of them. Here are the two remaining todo items:
- [ ] Get the parquet row group filter test to pass
- [x] Restore sql integration test migration. All those sql tests were migrated and passing previously, but those changes got lost when we merged the sql test refactoring from master.
I will keep working on this tomorrow. In the mean time, feel free to send PRs to my fork if you are interested in helping. After these two items are fixed, I will run another round of benchmark to double check the performance fix. It's quite interesting that I got the opposite performance test result initial even without that file buf fix :P I will dig into what's causing that as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp commented on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1016085515
oops, looks like the arrow2 branch got updated with latest commits from master, anyone mind if I revert it back to https://github.com/apache/arrow-datafusion/commit/2008b1dc06d5030f572634c7f8f2ba48562fa636 and handle master catch up in a follow up PR?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp commented on a change in pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
houqp commented on a change in pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#discussion_r785403551
##########
File path: datafusion/src/avro_to_arrow/reader.rs
##########
@@ -101,30 +100,49 @@ impl ReaderBuilder {
}
/// Create a new `Reader` from the `ReaderBuilder`
- pub fn build<'a, R>(self, source: R) -> Result<Reader<'a, R>>
+ pub fn build<R>(self, source: R) -> Result<Reader<R>>
where
R: Read + Seek,
{
let mut source = source;
// check if schema should be inferred
- let schema = match self.schema {
- Some(schema) => schema,
- None => Arc::new(super::read_avro_schema_from_reader(&mut source)?),
- };
source.seek(SeekFrom::Start(0))?;
- Reader::try_new(source, schema, self.batch_size, self.projection)
+ let (mut avro_schemas, mut schema, codec, file_marker) =
+ read::read_metadata(&mut source)?;
+ if let Some(proj) = self.projection {
Review comment:
I might be missing something here, if this is properly supported by arrow2, we don't need to play defense and do the post read projection right? I agree with @Igosuki that there is still benefit in saving compute and memory by doing this at parsing time.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp commented on a change in pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
houqp commented on a change in pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#discussion_r785403949
##########
File path: datafusion/src/datasource/object_store/mod.rs
##########
@@ -33,6 +33,12 @@ use local::LocalFileSystem;
use crate::error::{DataFusionError, Result};
+/// Both Read and Seek
+pub trait ReadSeek: Read + Seek {}
+
+impl<R: Read + Seek> ReadSeek for std::io::BufReader<R> {}
+impl<R: AsRef<[u8]>> ReadSeek for std::io::Cursor<R> {}
Review comment:
Nice tip on the generic approach. @Igosuki is right that this is not needed for BufRead and Cursor, It's only needed for `std::fs::File`, so I will keep the generic form.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] Igosuki commented on a change in pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
Igosuki commented on a change in pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#discussion_r784746803
##########
File path: datafusion/src/avro_to_arrow/reader.rs
##########
@@ -101,30 +100,49 @@ impl ReaderBuilder {
}
/// Create a new `Reader` from the `ReaderBuilder`
- pub fn build<'a, R>(self, source: R) -> Result<Reader<'a, R>>
+ pub fn build<R>(self, source: R) -> Result<Reader<R>>
where
R: Read + Seek,
{
let mut source = source;
// check if schema should be inferred
- let schema = match self.schema {
- Some(schema) => schema,
- None => Arc::new(super::read_avro_schema_from_reader(&mut source)?),
- };
source.seek(SeekFrom::Start(0))?;
- Reader::try_new(source, schema, self.batch_size, self.projection)
+ let (mut avro_schemas, mut schema, codec, file_marker) =
+ read::read_metadata(&mut source)?;
+ if let Some(proj) = self.projection {
Review comment:
It can save a bit of computation and memory but not i/o obviously. Should we change this ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] Igosuki commented on a change in pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
Igosuki commented on a change in pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#discussion_r784748048
##########
File path: datafusion/src/datasource/object_store/mod.rs
##########
@@ -33,6 +33,12 @@ use local::LocalFileSystem;
use crate::error::{DataFusionError, Result};
+/// Both Read and Seek
+pub trait ReadSeek: Read + Seek {}
+
+impl<R: Read + Seek> ReadSeek for std::io::BufReader<R> {}
+impl<R: AsRef<[u8]>> ReadSeek for std::io::Cursor<R> {}
Review comment:
Yes they are not needed, I originally had just Read + Seek in my branch because BufReader is Seek if R is Seek and Cursor is Seek on &[u8] in std. Should be removed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] Igosuki edited a comment on pull request #1556: Officially maintained Arrow2 branch
Posted by GitBox <gi...@apache.org>.
Igosuki edited a comment on pull request #1556:
URL: https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1013038787
@houqp What hardware/setup did you use to run the benchmark ? I'm actually getting way worse performance if running tpch using parquet
Edit : ok flamegraphs showed me that the parquet reader in the arrow2 branch is not being passed a BufReader but a File instead
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org