You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/17 14:32:53 UTC

[GitHub] [arrow-datafusion] the-mousaillon opened a new issue, #4259: approx_percentile_cont panics because array is not ordered

the-mousaillon opened a new issue, #4259:
URL: https://github.com/apache/arrow-datafusion/issues/4259

   **Describe the bug**
   approx_quantile_cont panics, complaining that the input to TDigest is not ordered:
   ```panicked at 'unsorted input to TDigest'```
   
   I have done some digging to understand what happens and it seams to have something to do with a corruption of indexes whithin the arrow Array.
   
   I added a simple check to see if the array is ordered, and if not to print it in the [update_batch](https://github.com/apache/arrow-datafusion/blob/929a17525917727aa99b26667e9c3a8c55bd67c3/datafusion/physical-expr/src/aggregate/approx_percentile_cont.rs#L368-L374) function
   ```rust
       fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> {
           let values = &values[0];
           let v = downcast_value!(values, Int64Array);
           let sorted_values = &arrow::compute::sort(values, None)?;
           let a_down = downcast_value!(sorted_values, Int64Array);
   
           let mut is_sorted = true;
           for i in 0..sorted_values.len() -1 {
               if a_down.value(i) > a_down.value(i+1) {
                   is_sorted = false;
                   break;
               }
           }
           if !is_sorted {
              for i in 0..sorted_values.len() {
               println!("s: {}, b: {}", a_down.value(i), v.value(i));
           }
           }
           let sorted_values =
               ApproxPercentileAccumulator::convert_to_ordered_float(sorted_values)?;
           self.digest = self.digest.merge_sorted_f64(&sorted_values);
           Ok(())
       }
   ```
   
   The output is surprising, we see that in seemingly every case, the first value of the sorted array *"s"* should be at the end of the array.
   For intance on this array :
   ![image](https://user-images.githubusercontent.com/36140579/202452429-3af4bf33-88c2-4580-a50f-4ee9b26dec28.png)
   
   The weird thing is that if I recreate the array, the sort works properly and the panic goes away.
   ```rust
       fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> {
           let values = &values[0];
           # If I just recreate the array before the sorting, the problem goes away
           let v = downcast_value!(values, Int64Array);
           let values: Int64Array = v.iter().map(|d| d.clone()).collect();
           let values = Arc::new(values) as ArrayRef;
           let values = &values;
           let sorted_values =
               ApproxPercentileAccumulator::convert_to_ordered_float(sorted_values)?;
           self.digest = self.digest.merge_sorted_f64(&sorted_values);
           Ok(())
   }
   ```
   This makes me think that their may be some kind of index corruption within the array buffer.
   
   I had this bug while performing an approx_percentile_cont on a GROUP BY, whithout the GROUP BY it works fine.
   
   **To Reproduce**
   Steps to reproduce:
   1. Download the parquet file: [percentile_cont_bug.zip](https://github.com/apache/arrow-datafusion/files/10032416/percentile_cont_bug.zip)
   
   2. Execute this function
   
   ```rust
   fn percentile_cont_bug() ->Result<(), ()>{
       let ctx = SessionContext::new();
       let pq_path = "...path to the provided parquet file";
       ctx.register_parquet("example", pq_path, ParquetReadOptions::default())
           .await
           .map_err(|e|())?;
       
       let q = format!("
           SELECT 
               country,
               approx_median(age) as median_age,
           FROM example
           group by 
               country
       ");
       let df = ctx.sql(&q).await
       .map_err(|e| println!("err: {:?}", e))?;
       // execute and print results
       df.show().await;
       Ok(())
   }
   ```
   
   **Additional context**
   datafusion version: 14.0.0
   arrow version: 26.0
   platform: WSL (ubuntu)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] HaoYang670 commented on issue #4259: approx_percentile_cont panics because array is not ordered

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on issue #4259:
URL: https://github.com/apache/arrow-datafusion/issues/4259#issuecomment-1323329448

   I've tested again and @tustvold I guess you are right.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] HaoYang670 commented on issue #4259: approx_percentile_cont panics because array is not ordered

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on issue #4259:
URL: https://github.com/apache/arrow-datafusion/issues/4259#issuecomment-1323277652

   cc @tustvold, could you please take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on issue #4259: approx_percentile_cont panics because array is not ordered

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #4259:
URL: https://github.com/apache/arrow-datafusion/issues/4259#issuecomment-1323297857

   Without having spent any time looking into the code, yet, my hunch would be that the accumulator is handling nulls incorrectly, and the `s: 72` "value" is actually a null.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] HaoYang670 commented on issue #4259: approx_percentile_cont panics because array is not ordered

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on issue #4259:
URL: https://github.com/apache/arrow-datafusion/issues/4259#issuecomment-1321405612

   Thank you @the-mousaillon for reporting the bug. We will debug it carefully. 
   Also, welcome to contribute your fix to Datafusion!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] HaoYang670 commented on issue #4259: approx_percentile_cont panics because array is not ordered

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on issue #4259:
URL: https://github.com/apache/arrow-datafusion/issues/4259#issuecomment-1324731699

   Hi @domodwyer, could you please to fix this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] domodwyer commented on issue #4259: approx_percentile_cont panics because array is not ordered

Posted by GitBox <gi...@apache.org>.
domodwyer commented on issue #4259:
URL: https://github.com/apache/arrow-datafusion/issues/4259#issuecomment-1323394363

   Yeah I don't remember doing anything with NULL masks when I wrote this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] the-mousaillon commented on issue #4259: approx_percentile_cont panics because array is not ordered

Posted by GitBox <gi...@apache.org>.
the-mousaillon commented on issue #4259:
URL: https://github.com/apache/arrow-datafusion/issues/4259#issuecomment-1318793149

   Also, I only have this issue if I use approx_percentile_cont on an INT64 column


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] HaoYang670 commented on issue #4259: approx_percentile_cont panics because array is not ordered

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on issue #4259:
URL: https://github.com/apache/arrow-datafusion/issues/4259#issuecomment-1323295587

   The `age` column contains `null`s


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] HaoYang670 commented on issue #4259: approx_percentile_cont panics because array is not ordered

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on issue #4259:
URL: https://github.com/apache/arrow-datafusion/issues/4259#issuecomment-1323277310

   I can reproduce the bug printed here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] domodwyer commented on issue #4259: approx_percentile_cont panics because array is not ordered

Posted by GitBox <gi...@apache.org>.
domodwyer commented on issue #4259:
URL: https://github.com/apache/arrow-datafusion/issues/4259#issuecomment-1326197678

   Hi @HaoYang670,
   
   I do not have time to fix this in the short term as I am working on a project. I can take a look in a few weeks - feel free to fix this yourself if you need it sooner :+1:


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on issue #4259: approx_percentile_cont panics because array is not ordered

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #4259:
URL: https://github.com/apache/arrow-datafusion/issues/4259#issuecomment-1323286193

   I can add it to my list, although I'm not familiar with the code in question. Perhaps @domodwyer or @crepererum may have some ideas


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] HaoYang670 commented on issue #4259: approx_percentile_cont panics because array is not ordered

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on issue #4259:
URL: https://github.com/apache/arrow-datafusion/issues/4259#issuecomment-1323333153

   The `ApproxPercentileAccumulator::convert_to_float` seems to ignore the null buffer when casting an array to `vec<f64>` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] HaoYang670 commented on issue #4259: approx_percentile_cont panics because array is not ordered

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on issue #4259:
URL: https://github.com/apache/arrow-datafusion/issues/4259#issuecomment-1325995968

   BTW, the estimate quantile algorithm doesn't follow the `paper`, any reason for this?
   https://github.com/apache/arrow-datafusion/blob/df8aa7a2e2a6f54acfbfed336b84144256fb7ff8/datafusion/physical-expr/src/aggregate/tdigest.rs#L523-L524
   
   ![image](https://user-images.githubusercontent.com/59198230/203706553-91202460-aabc-4623-93ac-c16fdd7622f6.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org