You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "shner-elmo (via GitHub)" <gi...@apache.org> on 2023/04/21 13:20:11 UTC

[GitHub] [arrow] shner-elmo opened a new issue, #35269: Partition a dataset by numeric column

shner-elmo opened a new issue, #35269:
URL: https://github.com/apache/arrow/issues/35269

   ### Describe the usage question you have. Please include as many useful details as  possible.
   
   
   Hello, I was wondering if there is a more eficcient way to partition a very large dataset (6B records) by a column that has many unique values.
   I have a column called ID that has 400k unique values (integer ranging from `0` to `9698193`) and I'm wondering if there is a way I can partition it with either the first N digits of a given ID, or create something like a B-tree where you group by all the records with the ID between X and Y in a directory.
   
   And in general is there a way to create custom partitioning? (like maybe by passing a function that takes as input the value and returns the value to use for partitioning, or subclassing a class)
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] shner-elmo commented on issue #35269: Partition a dataset by numeric column

Posted by "shner-elmo (via GitHub)" <gi...@apache.org>.

shner-elmo commented on issue #35269:
URL: https://github.com/apache/arrow/issues/35269#issuecomment-1517835726

   Same thing for string columns, to be able to partition by the first N chars of a string column would be great for columns that have many unique values. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35269: [Python] Partition a dataset by numeric column

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35269:
URL: https://github.com/apache/arrow/issues/35269#issuecomment-1523819701

   Something like this should work.  Note that this might crash on pyarrow 11.0.0 (currently released version).  There was a write_dataset bug introduced.  It should be fixed in 12.0.0 (will release soon) but should also work on 10.0.0.
   
   ```
   import numpy as np
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pyarrow.dataset as ds
   import pyarrow.compute as pc
   
   # Create a table with one column of random 20-character strings and one column of incrementing integers
   A, Z = np.array(["A","Z"]).view("int32")
   LENGTH = 10_000_000
   STRLEN = 20
   np_arr = np.random.randint(low=A,high=Z,size=LENGTH*STRLEN,dtype="int32").view(f"U{STRLEN}")
   pa_arr = pa.array(np_arr)
   other_col = pa.array(range(LENGTH))
   table = pa.Table.from_arrays([pa_arr, other_col], names=["strings", "numbers"])
   
   # Write the table out.  This will be our "source dataset".  You already have this
   pq.write_table(table, "/tmp/source.parquet")
   
   # Create a dataset object to represent our source dataset
   my_dataset = ds.dataset(["/tmp/source.parquet"], format="parquet")
   
   # Create a column map.  We want to load all the columns as normal but we also
   # want to add an additional dynamic column which is the first 2 characters of the long
   # strings array
   columns = {}
   for field in my_dataset.schema:
       columns[field.name] = pc.field(field.name)
   columns["string_code"] = pc.utf8_slice_codeunits(pc.field("strings"), 0, 2)
   
   # Use a scanner as input to write_dataset.  This way we don't need to load the entire
   # dataset into memory.  Partition on our dynamic column.
   ds.write_dataset(my_dataset.scanner(columns=columns), "/tmp/my_dataset", partitioning=["string_code"], partitioning_flavor="hive", format="parquet")
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org