You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "svilupp (via GitHub)" <gi...@apache.org> on 2023/03/12 19:04:04 UTC

[GitHub] [arrow-julia] svilupp commented on issue #403: Missing type gets lost when writing partitions of DataFrame

svilupp commented on issue #403:
URL: https://github.com/apache/arrow-julia/issues/403#issuecomment-1465275189

   I think I know where it's coming from.
   
   The issue happens [here](https://github.com/apache/arrow-julia/blob/9b36c8b1ec9efbdc63009d1b8cd72ee705fc1711/src/write.jl#L196)
   - Only the first partition is scanned to determine the schema
   - Unfortunately, the partition of DataFrameRows loses the parent schema when pushed through Tables.columns
   - It does however keep the reference to the parent (and its schema)
   
   In other words, we do `partition |> Tables.columns |> Tables.schema`, which loses the missingness.
   
   I don't know enough about the Tables API/contract to know whether this is an Arrow problem, Tables problem, or DataFrames problem. Does this issue belong somewhere else?
   
   It would be an easy fix to get schema info from the parent object, but are all Tables-compatible sources required to keep that?
   
   Eg, 
   - change from `partition |> Tables.columns |> Tables.schema` 
   - to `partition |> Tables.columns |> Base.Fix2(getfield,:parent) |> Tables.schema`
   
   Illustration
   ```
   # correct when working with Tables object
   df = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4))
   for r in Iterators.partition(Tables.rows(df), 2)
       @info "Parent type: $(x.parent.x1|>eltype)"
       @info "Columns type: $(Tables.columns(r)|>Tables.schema)"
   end
     [ Info: Parent type: Union{Missing, String}
     ┌ Info: Columns type: Tables.Schema:
     │  :x1  String
     └  :x2  Int64
     [ Info: Parent type: Union{Missing, String}
     ┌ Info: Columns type: Tables.Schema:
     │  :x1  Union{Missing, String}
     └  :x2  Int64
   
   # incorrect when working with DataFrame
   df = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4)) |> DataFrame
   for r in Iterators.partition(Tables.rows(df), 2)
       @info "Parent type: $(x.parent.x1|>eltype)"
       @info "Columns type: $(Tables.columns(r)|>Tables.schema)"
   end
     [ Info: Parent type: Union{Missing, String}
     ┌ Info: Columns type: Tables.Schema:
     │  :x1  String
     └  :x2  Int64
     [ Info: Parent type: Union{Missing, String}
     ┌ Info: Columns type: Tables.Schema:
     │  :x1  Union{Missing, String}
     └  :x2  Int64
   ```
   
   EDIT: I suspect this will effect other partitioners that rely on Iterators over `Tables.rows()`, eg, `TableOperations.makepartition()`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org