You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "marsupialtail (via GitHub)" <gi...@apache.org> on 2023/05/09 19:08:42 UTC
[GitHub] [arrow] marsupialtail commented on issue #35508: adding data to tdigest in pyarrow
marsupialtail commented on issue #35508:
URL: https://github.com/apache/arrow/issues/35508#issuecomment-1540750019
Hey, if you are not picky about using random python projects, try this:
pip3 install ldbpy
Then:
```
#!/usr/bin/env python3
import pyarrow as pa
import numpy as np
from pyarrow.cffi import ffi
c_schema = ffi.new("struct ArrowSchema*")
schema_ptr = int(ffi.cast("uintptr_t", c_schema))
c_array = ffi.new("struct ArrowArray*")
array_ptr = int(ffi.cast("uintptr_t", c_array))
import polars
lineitem = polars.read_parquet("demo-tpch/lineitem.parquet")
arr = lineitem.to_arrow()["l_tax"].combine_chunks()
arr._export_to_c(array_ptr, schema_ptr)
import ldbpy, time
a = ldbpy.NTDigest(20,100,10000)
start = time.time()
a.batch_add_arrow([array_ptr] * 20, [schema_ptr] * 20)
print(a.quantile(0, 0.5))
print(a.quantile(1, 0.1))
print(time.time() - start)
import pyarrow.compute as pac
start = time.time()
print(pac.tdigest(arr, 0.5))
print(time.time() - start)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org