You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "rdavis120 (via GitHub)" <gi...@apache.org> on 2023/05/16 03:02:49 UTC

[GitHub] [arrow] rdavis120 opened a new issue, #35604: unable to create struct in schema for open_dataset

rdavis120 opened a new issue, #35604:
URL: https://github.com/apache/arrow/issues/35604

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   When reading a parquet file and trying to set the schema to use a struct, I only get NA values back.
   
   ```r
   # write parquet 
   arrow::write_dataset(iris, path = 'iris.parquet')
   
   arrow::open_dataset("iris.parquet") |> dplyr::collect()
       Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
   1            5.1         3.5          1.4         0.2     setosa
   2            4.9         3.0          1.4         0.2     setosa
   
   
   arrow::open_dataset("iris.parquet", schema = schema(Species=string(),  
                                                       Sepal = struct(Sepal.Length = float64(), Sepal.Width = float64()),  
                                                       Petal = struct(Petal.Length = float64(), Petal.Width = float64()))) |>
     dplyr::collect()
      Species Sepal$Sepal.Length $Sepal.Width Petal$Petal.Length $Petal.Width
      <chr>                <dbl>        <dbl>              <dbl>        <dbl>
    1 setosa                  NA           NA                 NA           NA
    2 setosa                  NA           NA                 NA           NA
    3 setosa                  NA           NA                 NA           NA
   
   > packageVersion('arrow')
   [1] ‘12.0.0’
   ```
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rdavis120 closed issue #35604: [R] unable to create struct in schema for open_dataset

Posted by "rdavis120 (via GitHub)" <gi...@apache.org>.

rdavis120 closed issue #35604: [R] unable to create struct in schema for open_dataset
URL: https://github.com/apache/arrow/issues/35604


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rdavis120 commented on issue #35604: [R] unable to create struct in schema for open_dataset

Posted by "rdavis120 (via GitHub)" <gi...@apache.org>.

rdavis120 commented on issue #35604:
URL: https://github.com/apache/arrow/issues/35604#issuecomment-1553215854

   Ok thanks, I think this works.  
   
   The reason to convert the columns to a struct was that since a udf can support up to three argument, we can get around this by converting a table with >3 columns to a struct, and then passing the struct as an argument to the udf, which is before the collect(). 
   
   library(arrow)
   #
   arrow::register_scalar_function(
     "sum2",
     function(context, x) {
       rowSums(x, na.rm = T)
     },
     in_type = schema(x = struct(Sepal.Length = float64(), Sepal.Width = float64(), Petal.Length = float64(), Petal.Width = float64())),
     out_type = float64(),
     auto_convert = T
   )
   #
   arrow::write_dataset(iris, path = "iris.parquet")
   x <- arrow::open_dataset("iris.parquet")
   x |>
     dplyr::mutate(x = data.frame(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width), .keep = "none") |>
     dplyr::mutate(sum2 = sum2(x), .keep = "none") |>
     dplyr::collect()
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35604: [R] unable to create struct in schema for open_dataset

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35604:
URL: https://github.com/apache/arrow/issues/35604#issuecomment-1553008818

   The schema in `open_dataset` is not a tool for introducing mutations on the columns.  It can be used to reduce the number of columns you load, or I believe you can include additional columns (to get null values), but you should not be changing the type of existing columns.
   
   I think you will want to use dplyr's mutate and the `make_struct` function (I'm not sure what the R bindings look like for this function).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org