You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "baumgold (via GitHub)" <gi...@apache.org> on 2023/03/13 23:28:31 UTC

[GitHub] [arrow-julia] baumgold commented on a diff in pull request #400: add kwarg chunksize for default data partitioning for write

baumgold commented on code in PR #400:
URL: https://github.com/apache/arrow-julia/pull/400#discussion_r1134709709


##########
src/write.jl:
##########
@@ -48,14 +48,29 @@ Supported keyword arguments to `Arrow.write` include:
   * `metadata=Arrow.getmetadata(tbl)`: the metadata that should be written as the table's schema's `custom_metadata` field; must either be `nothing` or an iterable of `<:AbstractString` pairs.
   * `ntasks::Int`: number of buffered threaded tasks to allow while writing input partitions out as arrow record batches; default is no limit; for unbuffered writing, pass `ntasks=0`
   * `file::Bool=false`: if a an `io` argument is being written to, passing `file=true` will cause the arrow file format to be written instead of just IPC streaming
+  * `chunksize::Union{Nothing,Integer}=64000`: if a table is being written, this will cause the table to be partitioned into chunks of the given size (`chunksize` rows); if `nothing`, no partitioning will occur
 """
 function write end
 
 write(io_or_file; kw...) = x -> write(io_or_file, x; kw...)
 
-function write(file_path, tbl; kwargs...)
+function write(file_path, tbl; chunksize::Union{Nothing,Integer}=64000, kwargs...)

Review Comment:
   I think `chunksize` should move to be a new field  in `Writer` with default kwarg value set in the `Base.open` constructor on L170.  This would eliminate the code duplication.



##########
src/write.jl:
##########
@@ -278,9 +293,23 @@ function Base.close(writer::Writer)
     nothing
 end
 
-function write(io::IO, tbl; kwargs...)
+function write(io::IO, tbl; chunksize::Union{Nothing,Integer}=64000, kwargs...)
+    # rowaccces is a necessary pre-requisite for row-iteration (not sufficient though)
+    if !isnothing(chunksize) && Tables.istable(tbl) && Tables.rowaccess(tbl)
+        @assert chunksize >= 0 "chunksize must be >= 0"
+        if hasmethod(Iterators.partition,(typeof(tbl),))
+            tbl_source = Iterators.partition(tbl, chunksize)

Review Comment:
   Can we use `Iterators.partition` from Base rather than `DataFrames` to prevent adding one more dependency?
   
   https://docs.julialang.org/en/v1/base/iterators/#Base.Iterators.partition



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org