You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "progval (via GitHub)" <gi...@apache.org> on 2024/03/05 10:25:59 UTC

[I] ORDER BY is ignored when COPYing from a pyarrow table to a csv file [arrow-datafusion-python]

progval opened a new issue, #609:
URL: https://github.com/apache/arrow-datafusion-python/issues/609

   **Describe the bug**
   ORDER BY is ignored when COPYing from a pyarrow table to a csv file
   
   This happens both for tables created with `pyarrow.Table.from_pydict` and from ORC files.
   
   **To Reproduce**
   
   ```py
   from pathlib import Path
   
   import datafusion
   import pyarrow.csv
   import pyarrow.dataset
   
   config = datafusion.SessionConfig()
   config.set("datafusion.execution.minimum_parallel_output_files", "16")
   ctx = datafusion.SessionContext(config=config)
   ctx.from_arrow_table(pyarrow.Table.from_pydict({'value': [2, 1, 3]}), "content")
   
   output_path = Path("/tmp/output.csv")
   
   query = f"""
       COPY (SELECT value FROM content ORDER BY value DESC)
       TO '{output_path}' (
           FORMAT CSV,
       )
   """
   df = ctx.sql(query)
   
   columns = df.schema().names
   assert columns == ["value"], columns
   
   df.count()  # force the query to run
   
   print(output_path.read_text())
   ```
   
   ```
   $ python3 /tmp/order_arrow_table.py
   value
   2
   1
   3
   ```
   
   **Expected behavior**
   Should print
   
   ```
   value
   2
   1
   3
   ```
   
   **Additional context**
   
   I tried to reproduce it directly in Rust, but this code does produce a sorted output as expected:
   
   ```rust
   use std::sync::Arc;
   use datafusion::arrow::array::PrimitiveArray;
   use datafusion::arrow::datatypes::{DataType, Field, Schema, Int64Type};
   use datafusion::arrow::record_batch::RecordBatch;
   use datafusion::prelude::*;
   use datafusion::datasource::MemTable;
   
   #[tokio::main]
   async fn main() {
       let ctx = SessionContext::new();
   
       let schema = Arc::new(Schema::new(vec![Field::new("value", DataType::Int64, false)]));
       let column: PrimitiveArray<Int64Type> = vec![2, 1, 3].into();
       let partition = RecordBatch::try_new(schema.clone(), vec![Arc::new(column)]).unwrap();
       let table = MemTable::try_new(schema, vec![vec![partition]]).unwrap();
       ctx.register_table("content", Arc::new(table)).unwrap();
   
       let df = ctx.sql("COPY (SELECT value FROM content ORDER BY value) TO '/tmp/output.csv'").await.unwrap();
       df.collect().await.unwrap();
   }
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] ORDER BY is ignored when COPYing from a pyarrow table to a csv file [arrow-datafusion-python]

Posted by "mesejo (via GitHub)" <gi...@apache.org>.

mesejo commented on issue #609:
URL: https://github.com/apache/arrow-datafusion-python/issues/609#issuecomment-2031554920

   This work correctly if you do 
   ```python
   df.collect()
   ```
   as opposed to 
   ```python
   df.count()
   ```
   What is (probably) happening is that DataFusion optimizes the query (removing the ORDER BY) since `count` is not changed by order. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org