You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/09 12:47:55 UTC

[GitHub] [arrow-datafusion] selvavm opened a new issue #1533: When using Dataframe getting empty row but pretty print contain rows

selvavm opened a new issue #1533:
URL: https://github.com/apache/arrow-datafusion/issues/1533


   **Describe the bug**
   Getting empty array when using parquet but prints data
   
   **To Reproduce**
   
          let df = df
           .aggregate(
               vec![col("name")],
               vec![
                   min(col("salary")).alias("min"),
                   max(col("salary")).alias("max"),
               ],
           )?;
       let results: Vec<RecordBatch> = df.collect().await?;
       pretty::print_batches(&results)?;
       println!(
           "Min for Aaa is {:?}",
           results[1]
               .column(1)
               .as_any()
               .downcast_ref::<Float32Array>()
               .unwrap()
               .value(0)
       );
   
   **Expected behavior**
   Prints,
   
   | name           | min        | max       |
   | :---         |     :---:      |          ---: |
   | Aaa | 5755.896  | 6388.9575 |
   | Bbb | 6905.5454 | 7203.9756 |
   
   Min for Aaa is 5755.896
   
   **Actual behavior**
   Prints,
   | name           | min        | max       |
   | :---         |     :---:      |          ---: |
   | Aaa | 5755.896  | 6388.9575 |
   | Bbb | 6905.5454 | 7203.9756 |
   
   `thread '<unnamed>' panicked at 'assertion failed: i < self.len()', C:\Users\seved\.cargo\registry\src\github.com-1ecc6299db9ec823\arrow-6.5.0\src\array\array_primitive.rs:120:9`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] selvavm edited a comment on issue #1533: When using Dataframe getting empty row but pretty print contain rows

Posted by GitBox <gi...@apache.org>.

selvavm edited a comment on issue #1533:
URL: https://github.com/apache/arrow-datafusion/issues/1533#issuecomment-1008620650


   Hi @alamb. Thanks for response. I will see if I can do a self contained reproducer. Sorry, I am new to Parquet files and Datafusion, so having trouble in understanding it.
   
   I also found that `results` is of size 12 with most of them having 0 rows. So, I added my code like below,
   
       for batch in results {
           for i in 0..batch.num_rows() {
               let min = batch
                   .column(1)
                   .as_any()
                   .downcast_ref::<Float32Array>()
                   .unwrap()
                   .value(i);
               min_col.push(min);
               let max = batch
                   .column(0)
                   .as_any()
                   .downcast_ref::<Float32Array>()
                   .unwrap()
                   .value(i);
               max_col.push(max);
           }
       }
   
   Not an elegant approach. Is there any util to combine all `Vec<RecordBatch>` into one `RecordBatch`?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] selvavm edited a comment on issue #1533: When using Dataframe getting empty row but pretty print contain rows

Posted by GitBox <gi...@apache.org>.

selvavm edited a comment on issue #1533:
URL: https://github.com/apache/arrow-datafusion/issues/1533#issuecomment-1008620650


   Hi @alamb. Thanks for response. I will see if I can do a self contained reproducer. Sorry, I am new to Parquet files and Datafusion, so having trouble in understanding it.
   
   I also found that `results` is of size 12 with most of them having 0 rows. So, I added my code like below,
   
       for batch in results {
           for i in 0..batch.num_rows() {
               let max = batch
                   .column(0)
                   .as_any()
                   .downcast_ref::<Float32Array>()
                   .unwrap()
                   .value(i);
               max_col.push(max);
               let min = batch
                   .column(1)
                   .as_any()
                   .downcast_ref::<Float32Array>()
                   .unwrap()
                   .value(i);
               min_col.push(min);
           }
       }
   
   Not an elegant approach. Is there any util to combine all `Vec<RecordBatch>` into one `RecordBatch`?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1533: When using Dataframe getting empty row but pretty print contain rows

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1533:
URL: https://github.com/apache/arrow-datafusion/issues/1533#issuecomment-1008775801


   > Not an elegant approach. Is there any util to combine all Vec<RecordBatch> into one RecordBatch?
   
   @selvavm  you can `RecordBatch::concat` for that purpose:
   
   https://docs.rs/arrow/6.5.0/arrow/record_batch/struct.RecordBatch.html#method.concat
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] selvavm commented on issue #1533: When using Dataframe getting empty row but pretty print contain rows

Posted by GitBox <gi...@apache.org>.

selvavm commented on issue #1533:
URL: https://github.com/apache/arrow-datafusion/issues/1533#issuecomment-1008620650


   Hi @alamb. Thanks for response. I will see if I can do a self contained reproducer. Sorry, I am new to Parquet files and Datafusion, so having trouble in understanding it.
   
   I also found that `results` is of size 12 with most of them having 0 rows. So, I added my code like below,
   
       for batch in results {
           for i in 0..batch.num_rows() {
               let max = batch
                   .column(0)
                   .as_any()
                   .downcast_ref::<Float32Array>()
                   .unwrap()
                   .value(i);
               max_col.push(max);
               let min = batch
                   .column(1)
                   .as_any()
                   .downcast_ref::<Float32Array>()
                   .unwrap()
                   .value(i);
               min_col.push(min);
               let avg = batch
                   .column(2)
                   .as_any()
                   .downcast_ref::<Float64Array>()
                   .unwrap()
                   .value(i) as f32;
               avg_col.push(avg);
           }
       }
   
   Not an elegant approach. Is there any util to combine all `Vec<RecordBatch>` into one `RecordBatch`?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] selvavm edited a comment on issue #1533: When using Dataframe getting empty row but pretty print contain rows

Posted by GitBox <gi...@apache.org>.

selvavm edited a comment on issue #1533:
URL: https://github.com/apache/arrow-datafusion/issues/1533#issuecomment-1008620650


   Hi @alamb. Thanks for response. I will see if I can do a self contained reproducer. Sorry, I am new to Parquet files and Datafusion, so having trouble in understanding it.
   
   I also found that `results` is of size 12 with most of them having 0 rows. So, I added my code like below,
   
       for batch in results {
           for i in 0..batch.num_rows() {
               let min = batch
                   .column(0)
                   .as_any()
                   .downcast_ref::<Float32Array>()
                   .unwrap()
                   .value(i);
               min_col.push(min);
               let max = batch
                   .column(1)
                   .as_any()
                   .downcast_ref::<Float32Array>()
                   .unwrap()
                   .value(i);
               max_col.push(max);
           }
       }
   
   Not an elegant approach. Is there any util to combine all `Vec<RecordBatch>` into one `RecordBatch`?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1533: When using Dataframe getting empty row but pretty print contain rows

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1533:
URL: https://github.com/apache/arrow-datafusion/issues/1533#issuecomment-1008324922


   Hi @selvavm  -- this is very strange what happens when you print out the entire `results`? 
   
   ```rust
   println!("Min for Aaa is {:#?}", results);
   ```
   
   If you can provide a self contained reproducer I can also help try and narrow this down


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org