You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "progval (via GitHub)" <gi...@apache.org> on 2024/03/05 10:25:59 UTC
[I] ORDER BY is ignored when COPYing from a pyarrow table to a csv file [arrow-datafusion-python]
progval opened a new issue, #609:
URL: https://github.com/apache/arrow-datafusion-python/issues/609
**Describe the bug**
ORDER BY is ignored when COPYing from a pyarrow table to a csv file
This happens both for tables created with `pyarrow.Table.from_pydict` and from ORC files.
**To Reproduce**
```py
from pathlib import Path
import datafusion
import pyarrow.csv
import pyarrow.dataset
config = datafusion.SessionConfig()
config.set("datafusion.execution.minimum_parallel_output_files", "16")
ctx = datafusion.SessionContext(config=config)
ctx.from_arrow_table(pyarrow.Table.from_pydict({'value': [2, 1, 3]}), "content")
output_path = Path("/tmp/output.csv")
query = f"""
COPY (SELECT value FROM content ORDER BY value DESC)
TO '{output_path}' (
FORMAT CSV,
)
"""
df = ctx.sql(query)
columns = df.schema().names
assert columns == ["value"], columns
df.count() # force the query to run
print(output_path.read_text())
```
```
$ python3 /tmp/order_arrow_table.py
value
2
1
3
```
**Expected behavior**
Should print
```
value
2
1
3
```
**Additional context**
I tried to reproduce it directly in Rust, but this code does produce a sorted output as expected:
```rust
use std::sync::Arc;
use datafusion::arrow::array::PrimitiveArray;
use datafusion::arrow::datatypes::{DataType, Field, Schema, Int64Type};
use datafusion::arrow::record_batch::RecordBatch;
use datafusion::prelude::*;
use datafusion::datasource::MemTable;
#[tokio::main]
async fn main() {
let ctx = SessionContext::new();
let schema = Arc::new(Schema::new(vec![Field::new("value", DataType::Int64, false)]));
let column: PrimitiveArray<Int64Type> = vec![2, 1, 3].into();
let partition = RecordBatch::try_new(schema.clone(), vec![Arc::new(column)]).unwrap();
let table = MemTable::try_new(schema, vec![vec![partition]]).unwrap();
ctx.register_table("content", Arc::new(table)).unwrap();
let df = ctx.sql("COPY (SELECT value FROM content ORDER BY value) TO '/tmp/output.csv'").await.unwrap();
df.collect().await.unwrap();
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] ORDER BY is ignored when COPYing from a pyarrow table to a csv file [arrow-datafusion-python]
Posted by "mesejo (via GitHub)" <gi...@apache.org>.
mesejo commented on issue #609:
URL: https://github.com/apache/arrow-datafusion-python/issues/609#issuecomment-2031554920
This work correctly if you do
```python
df.collect()
```
as opposed to
```python
df.count()
```
What is (probably) happening is that DataFusion optimizes the query (removing the ORDER BY) since `count` is not changed by order.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org