You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "shner-elmo (via GitHub)" <gi...@apache.org> on 2023/04/21 13:20:11 UTC
[GitHub] [arrow] shner-elmo opened a new issue, #35269: Partition a dataset by numeric column
shner-elmo opened a new issue, #35269:
URL: https://github.com/apache/arrow/issues/35269
### Describe the usage question you have. Please include as many useful details as possible.
Hello, I was wondering if there is a more eficcient way to partition a very large dataset (6B records) by a column that has many unique values.
I have a column called ID that has 400k unique values (integer ranging from `0` to `9698193`) and I'm wondering if there is a way I can partition it with either the first N digits of a given ID, or create something like a B-tree where you group by all the records with the ID between X and Y in a directory.
And in general is there a way to create custom partitioning? (like maybe by passing a function that takes as input the value and returns the value to use for partitioning, or subclassing a class)
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] shner-elmo commented on issue #35269: Partition a dataset by numeric column
Posted by "shner-elmo (via GitHub)" <gi...@apache.org>.
shner-elmo commented on issue #35269:
URL: https://github.com/apache/arrow/issues/35269#issuecomment-1517835726
Same thing for string columns, to be able to partition by the first N chars of a string column would be great for columns that have many unique values.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] westonpace commented on issue #35269: [Python] Partition a dataset by numeric column
Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #35269:
URL: https://github.com/apache/arrow/issues/35269#issuecomment-1523819701
Something like this should work. Note that this might crash on pyarrow 11.0.0 (currently released version). There was a write_dataset bug introduced. It should be fixed in 12.0.0 (will release soon) but should also work on 10.0.0.
```
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import pyarrow.compute as pc
# Create a table with one column of random 20-character strings and one column of incrementing integers
A, Z = np.array(["A","Z"]).view("int32")
LENGTH = 10_000_000
STRLEN = 20
np_arr = np.random.randint(low=A,high=Z,size=LENGTH*STRLEN,dtype="int32").view(f"U{STRLEN}")
pa_arr = pa.array(np_arr)
other_col = pa.array(range(LENGTH))
table = pa.Table.from_arrays([pa_arr, other_col], names=["strings", "numbers"])
# Write the table out. This will be our "source dataset". You already have this
pq.write_table(table, "/tmp/source.parquet")
# Create a dataset object to represent our source dataset
my_dataset = ds.dataset(["/tmp/source.parquet"], format="parquet")
# Create a column map. We want to load all the columns as normal but we also
# want to add an additional dynamic column which is the first 2 characters of the long
# strings array
columns = {}
for field in my_dataset.schema:
columns[field.name] = pc.field(field.name)
columns["string_code"] = pc.utf8_slice_codeunits(pc.field("strings"), 0, 2)
# Use a scanner as input to write_dataset. This way we don't need to load the entire
# dataset into memory. Partition on our dynamic column.
ds.write_dataset(my_dataset.scanner(columns=columns), "/tmp/my_dataset", partitioning=["string_code"], partitioning_flavor="hive", format="parquet")
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org