You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "randolf-scholz (via GitHub)" <gi...@apache.org> on 2023/08/07 23:19:36 UTC

[GitHub] [arrow] randolf-scholz opened a new issue, #37055: `compute.value_counts` extremely slow for chunked `dictionary[int32,string]`-types

randolf-scholz opened a new issue, #37055:
URL: https://github.com/apache/arrow/issues/37055

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   I have a large dataset (>100M rows) with a `dictionary[int32,string]` column (`ChunkedArray`) and noticed that `compute.value_counts` is extremely slow for this column, compared to other columns.
   
   `table[col].value_counts()` is 10x-100x slower than `table[col].combine_chunks().value_counts()` in this case.
   
   
   ### Component(s)
   
   C++, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] randolf-scholz commented on issue #37055: `value_counts` extremely slow for chunked `DictionaryArray`

Posted by "randolf-scholz (via GitHub)" <gi...@apache.org>.
randolf-scholz commented on issue #37055:
URL: https://github.com/apache/arrow/issues/37055#issuecomment-1668742257

   Yes, 12.0.1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++][Python] `value_counts` extremely slow for chunked `DictionaryArray` [arrow]

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv closed issue #37055: [C++][Python] `value_counts` extremely slow for chunked `DictionaryArray`
URL: https://github.com/apache/arrow/issues/37055


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++][Python] `value_counts` extremely slow for chunked `DictionaryArray` [arrow]

Posted by "randolf-scholz (via GitHub)" <gi...@apache.org>.
randolf-scholz commented on issue #37055:
URL: https://github.com/apache/arrow/issues/37055#issuecomment-1772555892

   @js8544 The dataset in question we table "hosp/labevents.csv" from the MIMIC-IV dataset: https://physionet.org/content/mimiciv/2.2/.
   
   I changed my own preprocessing, so it doesn't really affect me anymore, but I was able to reproduce it in pyarrow 13:
   
   1. Read the csv file, parsing the `"value"`-column to `dictionary[int32, string]`
   2. `%timeit table["value"].value_counts()`: 10.5 s ± 102 ms (on desktop, was worse on laptop with fewer cores)
   2. `%timeit table["value"].combine_chunks().value_counts()`: 1.29 s ± 12.9 ms 
   
   The stats of the data are: 
   
   - `length`: 118,171,367
   - `null_count`: 19,803,023 (~17%)
   - `num_chunks`: 13095
   - `num_unique`: 39160
   - binary entropy (non-null): 9.48 bits 
   - [normalized entropy](https://mc-stan.org/posterior/reference/entropy.html): 62%
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on issue #37055: [C++][Python] `value_counts` extremely slow for chunked `DictionaryArray`

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #37055:
URL: https://github.com/apache/arrow/issues/37055#issuecomment-1688570712

   cc @js8544 @felipecrv 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] js8544 commented on issue #37055: [C++][Python] `value_counts` extremely slow for chunked `DictionaryArray`

Posted by "js8544 (via GitHub)" <gi...@apache.org>.
js8544 commented on issue #37055:
URL: https://github.com/apache/arrow/issues/37055#issuecomment-1688582657

   Ah, I had done some research on this issue but forgot to post my findings. I think @rok's comment [here](https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/vector_hash.cc#L452) and the discussion [here](https://github.com/apache/arrow/pull/9683#issuecomment-800442398) explain it well. We can optimize it by first computing it over each chunk and hash-aggregate the result. However, I don't think we can directly call hash aggregate functions in compute kernels, without having to depend on acero? 
   
   cc @westonpace Can you confirm?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #37055: [C++][Python] `value_counts` extremely slow for chunked `DictionaryArray`

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #37055:
URL: https://github.com/apache/arrow/issues/37055#issuecomment-1690179765

   I'm not entirely sure I understand the goal.  The aggregate operations do have standalone python bindings.  For example:
   
   ```
   >>> import pyarrow as pa
   >>> x = pa.chunked_array([[1, 2, 3, 4, 5], [6, 7, 8, 9]])
   >>> import pyarrow.compute as pc
   >>> pc.sum(x)
   <pyarrow.Int64Scalar: 45>
   ```
   
   However, the individuals parts (the partial aggregate func (Consume) and the final aggregate func (Finalize)) cannot be called from python individually.  So, for example, it is not possible to create a streaming aggregator in python.
   
   However, in this case, you might be able to get away with something like this:
   
   ```
   import pyarrow as pa
   import pyarrow.compute as pc
   
   x = pa.chunked_array([[1, 2, 3, 4, 5], [6, 7, 8, 9]])
   y = pa.chunked_array([[1, 1, 2, 2, 3], [4, 4]])
   
   x_counts = pc.value_counts(x)
   y_counts = pc.value_counts(y)
   
   x_batch = pa.RecordBatch.from_struct_array(x_counts)
   y_batch = pa.RecordBatch.from_struct_array(y_counts)
   
   table = pa.Table.from_batches([x_batch, y_batch])
   
   counts = table.group_by("values").aggregate([("counts", "sum")])
   ```
   
   I'm not sure if it will be faster or not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++][Python] `value_counts` extremely slow for chunked `DictionaryArray` [arrow]

Posted by "js8544 (via GitHub)" <gi...@apache.org>.
js8544 commented on issue #37055:
URL: https://github.com/apache/arrow/issues/37055#issuecomment-1774404687

   Thanks! Since the original file requires registration and some other verificaiton processes, I downloaded a [demo file](https://physionet.org/content/mimic-iv-demo/2.2/hosp/labevents.csv.gz) with about 100K rows. Nevertheless I was able to optimize `value_counts()` to the same level as `combine_chunks().values_counts()`:
   ```python
   # Before
   1.04 ms ± 6.88 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
   625 µs ± 19.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
   # After
   642 µs ± 4.94 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
   610 µs ± 2.71 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
   ```
   I'll write a formal C++ benchmark to further verify and send a PR shortly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] assignUser commented on issue #37055: `value_counts` extremely slow for chunked `DictionaryArray`

Posted by "assignUser (via GitHub)" <gi...@apache.org>.
assignUser commented on issue #37055:
URL: https://github.com/apache/arrow/issues/37055#issuecomment-1668739901

   I assume this is with pyarrow 12?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] js8544 commented on issue #37055: [C++][Python] `value_counts` extremely slow for chunked `DictionaryArray`

Posted by "js8544 (via GitHub)" <gi...@apache.org>.
js8544 commented on issue #37055:
URL: https://github.com/apache/arrow/issues/37055#issuecomment-1690920363

   > I'm not entirely sure I understand the goal.
   
   Sorry I wasn't clear enough. As discussed [here](https://github.com/apache/arrow/pull/9683#issuecomment-800442398), there are two ways to implement the `value_counts` kernel for Dictionary inputs. The current implementation uses the first approach, but we want to switch to the second for better performance. However, we would need to call `hash_count` within the `value_counts` kernel. There used to be a `internal::GroupBy` available, but I am not sure if that's possible now after the refactoring. To be clear, I'm talking about kernel implementation in C++, not user's code in Python. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++][Python] `value_counts` extremely slow for chunked `DictionaryArray` [arrow]

Posted by "js8544 (via GitHub)" <gi...@apache.org>.
js8544 commented on issue #37055:
URL: https://github.com/apache/arrow/issues/37055#issuecomment-1771890185

   Hi @randolf-scholz, do you remember how many chunks are in your `ChunkedArray`? I'm optimizing this kernel and would like to reproduce your case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org