You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "dadepo (via GitHub)" <gi...@apache.org> on 2023/06/02 03:16:41 UTC

[GitHub] [arrow-datafusion] dadepo opened a new issue, #6524: array_agg panics on large rows.

dadepo opened a new issue, #6524:
URL: https://github.com/apache/arrow-datafusion/issues/6524

   ### Describe the bug
   
   ```
       let df = ctx.sql(r#"
       SELECT count(payload) from demo
       "#).await?;
       
   +---------------------+
   | COUNT(demo.payload) |
   +---------------------+
   | 68934                          |
   +---------------------+
   ```
   
   Then
   
   ```
       let df = ctx.sql(r#"
       SELECT array_agg(payload) from demo
       "#).await?;
   ```
   
   After running for about 2 mins, panics
   
   ```
   thread 'main' panicked at 'attempt to add with overflow', /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/comfy-table-6.1.4/src/utils/mod.rs:44:9
   stack backtrace:
      0: rust_begin_unwind
                at /rustc/c18a5e8a5b1afb0d7a582fe9ebad4c1996c90da3/library/std/src/panicking.rs:575:5
      1: core::panicking::panic_fmt
                at /rustc/c18a5e8a5b1afb0d7a582fe9ebad4c1996c90da3/library/core/src/panicking.rs:64:14
      2: core::panicking::panic
                at /rustc/c18a5e8a5b1afb0d7a582fe9ebad4c1996c90da3/library/core/src/panicking.rs:114:5
      3: comfy_table::utils::ColumnDisplayInfo::width
                at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/comfy-table-6.1.4/src/utils/mod.rs:44:9
      4: comfy_table::utils::formatting::borders::draw_top_border
                at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/comfy-table-6.1.4/src/utils/formatting/borders.rs:54:40
      5: comfy_table::utils::formatting::borders::draw_borders
                at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/comfy-table-6.1.4/src/utils/formatting/borders.rs:21:20
      6: comfy_table::utils::build_table
                at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/comfy-table-6.1.4/src/utils/mod.rs:51:5
      7: comfy_table::table::Table::lines
                at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/comfy-table-6.1.4/src/table.rs:95:9
      8: <comfy_table::table::Table as core::fmt::Display>::fmt
                at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/comfy-table-6.1.4/src/table.rs:47:25
      9: core::fmt::write
                at /rustc/c18a5e8a5b1afb0d7a582fe9ebad4c1996c90da3/library/core/src/fmt/mod.rs:1230:17
     10: std::io::Write::write_fmt
                at /rustc/c18a5e8a5b1afb0d7a582fe9ebad4c1996c90da3/library/std/src/io/mod.rs:1682:15
     11: <&std::io::stdio::Stdout as std::io::Write>::write_fmt
                at /rustc/c18a5e8a5b1afb0d7a582fe9ebad4c1996c90da3/library/std/src/io/stdio.rs:715:9
     12: <std::io::stdio::Stdout as std::io::Write>::write_fmt
                at /rustc/c18a5e8a5b1afb0d7a582fe9ebad4c1996c90da3/library/std/src/io/stdio.rs:689:9
     13: std::io::stdio::print_to
                at /rustc/c18a5e8a5b1afb0d7a582fe9ebad4c1996c90da3/library/std/src/io/stdio.rs:1007:21
     14: std::io::stdio::_print
                at /rustc/c18a5e8a5b1afb0d7a582fe9ebad4c1996c90da3/library/std/src/io/stdio.rs:1074:5
     15: arrow_cast::pretty::print_batches
                at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-cast-38.0.0/src/pretty.rs:62:5
     16: datafusion::dataframe::DataFrame::show::{{closure}}
                at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-24.0.0/src/dataframe.rs:682:12
     17: query_playground::main::{{closure}}
                at ./src/query_playground.rs:785:14
     18: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
                at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.25.0/src/runtime/park.rs:283:63
     19: tokio::runtime::coop::with_budget
                at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.25.0/src/runtime/coop.rs:102:5
     20: tokio::runtime::coop::budget
                at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.25.0/src/runtime/coop.rs:68:5
     21: tokio::runtime::park::CachedParkThread::block_on
                at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.25.0/src/runtime/park.rs:283:31
     22: tokio::runtime::context::BlockingRegionGuard::block_on
                at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.25.0/src/runtime/context.rs:315:13
     23: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
                at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.25.0/src/runtime/scheduler/multi_thread/mod.rs:66:9
     24: tokio::runtime::runtime::Runtime::block_on
                at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.25.0/src/runtime/runtime.rs:284:45
     25: query_playground::main
                at ./src/query_playground.rs:812:5
     26: core::ops::function::FnOnce::call_once
                at /rustc/c18a5e8a5b1afb0d7a582fe9ebad4c1996c90da3/library/core/src/ops/function.rs:250:5
   note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
   ```
   
   ### To Reproduce
   
   Create a table with a large number of rows and use array_agg on one of its columns
   
   ### Expected behavior
   
   array_agg should not panic on large rows.
   
   ### Additional context
   
   ```
   datafusion = { version = "24.0.0" }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Nukesor commented on issue #6524: array_agg panics on large rows.

Posted by "Nukesor (via GitHub)" <gi...@apache.org>.
Nukesor commented on issue #6524:
URL: https://github.com/apache/arrow-datafusion/issues/6524#issuecomment-1590490691

   Hey :)
   
   In comfy-table, I went with the approach to choose a `u16` for table widths, since this covered all use-cases I could come up with. This allows the user to build tables that're `65535` characters wide.
   
   If there's some content that's longer than 65535 chars, the table width is internally fixed to that `u16::MAX`, but the formatting can look a bit wonky, since the actual content can be even longer than `65535` chars.
   However, I consider this a non-issue, since no user will ever have a screen wide enough to display that many chars and the output on your terminal will be botched anyway.
   In case a user wants proper formatting, there's the `Dynamic` content adjustment mode, which introduces line-breaks to fit the content to the current/specified terminal width.
   
   I already tried to use the appropriate `saturating_[sub|add]` functions everywhere, but it seems I overlooked a few places. A PR https://github.com/Nukesor/comfy-table/pull/114 has been opened, but it would be awesome, if you somebdoy could test this or even write a regression test case to trigger the panic on the current code-base.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on issue #6524: array_agg panics on large rows.

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on issue #6524:
URL: https://github.com/apache/arrow-datafusion/issues/6524#issuecomment-1583212705

   > This may be related to the fact that datafusion-cli is trying to write the entire result to an inmemory buffer 🤔
   
   That gives the idea to show only last N rows and make a param on that. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on issue #6524: array_agg panics on large rows.

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on issue #6524:
URL: https://github.com/apache/arrow-datafusion/issues/6524#issuecomment-1581701662

   @dadepo are you trying to show the result? reg to the stacktrace it failed on console output , not on the computation itself
   
   ```
   14: std::io::stdio::_print
                at /rustc/c18a5e8a5b1afb0d7a582fe9ebad4c1996c90da3/library/std/src/io/stdio.rs:1074:5
     15: arrow_cast::pretty::print_batches
                at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-cast-38.0.0/src/pretty.rs:62:5
     16: datafusion::dataframe::DataFrame::show::{{closure}}
                at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-24.0.0/src/dataframe.rs:682:12
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] thomas-k-cameron commented on issue #6524: array_agg panics on large rows.

Posted by "thomas-k-cameron (via GitHub)" <gi...@apache.org>.
thomas-k-cameron commented on issue #6524:
URL: https://github.com/apache/arrow-datafusion/issues/6524#issuecomment-1587364220

   > > This may be related to the fact that datafusion-cli is trying to write the entire result to an inmemory buffer 🤔
   > 
   > That gives the idea to show only first N rows and make a param on that.
   
   This doesn't work because the overflow happens on calculating `width`. I think you can restrict the number letters you pass to each row though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] thomas-k-cameron commented on issue #6524: array_agg panics on large rows.

Posted by "thomas-k-cameron (via GitHub)" <gi...@apache.org>.
thomas-k-cameron commented on issue #6524:
URL: https://github.com/apache/arrow-datafusion/issues/6524#issuecomment-1585713825

   Few things that concerns me.
   
   1. Performance: You don't really know how wide a column has to be until you go to the very end; which can turn out of be time-consuming.  
   2. Comfy-table: Say we implement an algorithm to calculate the comfy-table's width, but you don't really know when they change the implementation.
   
   I'm not one of the maintainer so I appreciate advice.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] thomas-k-cameron commented on issue #6524: array_agg panics on large rows.

Posted by "thomas-k-cameron (via GitHub)" <gi...@apache.org>.
thomas-k-cameron commented on issue #6524:
URL: https://github.com/apache/arrow-datafusion/issues/6524#issuecomment-1587301169

   I created a ticket on comfy-table. They are pretty active so I should be able to hear from them.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #6524: array_agg panics on large rows.

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #6524:
URL: https://github.com/apache/arrow-datafusion/issues/6524#issuecomment-1583170412

   This may be related to the fact that datafusion-cli is trying to write the entire result to an inmemory buffer 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #6524: array_agg panics on large rows.

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #6524:
URL: https://github.com/apache/arrow-datafusion/issues/6524#issuecomment-1587183199

   > That gives the idea to show only first N rows and make a param on that.
   
   This is a neat idea @comphead  -- what I recall `psql` doing is buffering the first 1000 rows or something and using that sample to determine column widths for the rest of the output, which is streamed.
   
   I have always wanted to do something similar for datafusion-cli (and IOx) but I haven't gotten around to doing so


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Nukesor commented on issue #6524: array_agg panics on large rows.

Posted by "Nukesor (via GitHub)" <gi...@apache.org>.
Nukesor commented on issue #6524:
URL: https://github.com/apache/arrow-datafusion/issues/6524#issuecomment-1594541652

   New release with a bunch of fixes has been published :)
   
   https://github.com/Nukesor/comfy-table/releases/tag/v7.0.1
   https://crates.io/crates/comfy-table/7.0.1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] dadepo commented on issue #6524: array_agg panics on large rows.

Posted by "dadepo (via GitHub)" <gi...@apache.org>.
dadepo commented on issue #6524:
URL: https://github.com/apache/arrow-datafusion/issues/6524#issuecomment-1585698019

   Is having `arrow-datafusion` truncate the width, to a max value, that does not overflow an acceptable fix? Is that easily doable? If so, I'll say that is better than blowing up as it currently does.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] thomas-k-cameron commented on issue #6524: array_agg panics on large rows.

Posted by "thomas-k-cameron (via GitHub)" <gi...@apache.org>.
thomas-k-cameron commented on issue #6524:
URL: https://github.com/apache/arrow-datafusion/issues/6524#issuecomment-1585676871

   I think that's because confy-table uses u16.
   
   Your error says,
   ```
   thread 'main' panicked at 'attempt to add with overflow', /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/comfy-table-6.1.4/src/utils/mod.rs:44:9
   stack backtrace:
      0: rust_begin_unwind
                at /rustc/c18a5e8a5b1afb0d7a582fe9ebad4c1996c90da3/library/std/src/panicking.rs:575:5
   ```
   
   and on the confy-table, you can find
   https://github.com/Nukesor/comfy-table/blob/main/src/utils/mod.rs
   ```rust
       pub fn width(&self) -> u16 {
           self.content_width + self.padding.0 + self.padding.1
       }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #6524: array_agg panics on large rows.

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #6524:
URL: https://github.com/apache/arrow-datafusion/issues/6524#issuecomment-1587180907

   I think this is likely something that needs to be fixed in arrow-rs or comfy-table. 
   
   I agree with the tradeoffs identified by @thomas-k-cameron but don't know enough about comfy-table and its maintenance status to know how easy it is to fix or how likely they would be to accept such a change.  
   
   Thus, my suggestion is:
   1. File a ticket in comfy table explaining the issue and see what, if any, response comes back 
   2. File a ticket in arrow-rs about the panic on large width columns in display
   
   Depending on the response you get from those two crates may help decide which approach to take. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] dadepo commented on issue #6524: array_agg panics on large rows.

Posted by "dadepo (via GitHub)" <gi...@apache.org>.
dadepo commented on issue #6524:
URL: https://github.com/apache/arrow-datafusion/issues/6524#issuecomment-1583235345

   > @dadepo are you trying to show the result? reg to the stacktrace it failed on console output , not on the computation itself
   
   @comphead I just checked again, and yeah it blows up with the `df.show().await?` call. 
   
   `df.collect().await?` on the other hand, does not give the error
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org