You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jacob Wujciak-Jens (Jira)" <ji...@apache.org> on 2022/04/08 12:22:00 UTC
[jira] [Updated] (ARROW-16148) [C++] TPC-H generator cleanup

     [ https://issues.apache.org/jira/browse/ARROW-16148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jacob Wujciak-Jens updated ARROW-16148:
---------------------------------------
    Component/s: C++

> [C++] TPC-H generator cleanup
> -----------------------------
>
>                 Key: ARROW-16148
>                 URL: https://issues.apache.org/jira/browse/ARROW-16148
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Weston Pace
>            Priority: Major
>
> An umbrella issue for a number of issues I've run into with our TPC-H generator.
> h2. We emit fixed_size_binary fields with nuls padding the strings.
> Ideally we would either emit these as utf8 strings like the others, or we would have a toggle to emit them as such (though see below about needing to strip nuls)
> When I try and run these through the I get a number of seg faults or hangs when running a number of the TPC-H queries.
> Additionally, even converting these to utf8|string types, I also need to strip out the nuls in order to actually query against them:
> {code}
> library(arrow, warn.conflicts = FALSE)
> #> See arrow_info() for available features
> library(dplyr, warn.conflicts = FALSE)
> options(arrow.skip_nul = TRUE)
> tab <- read_parquet("data_arrow_raw/nation_1.parquet", as_data_frame = FALSE)
> tab
> #> Table
> #> 25 rows x 4 columns
> #> $N_NATIONKEY <int32>
> #> $N_NAME <fixed_size_binary[25]>
> #> $N_REGIONKEY <int32>
> #> $N_COMMENT <string>
> # This will not work (Though is how the TPC-H queries are structured)
> tab %>% filter(N_NAME == "JAPAN") %>% collect()
> #> # A tibble: 0 × 4
> #> # … with 4 variables: N_NATIONKEY <int>, N_NAME <fixed_size_binary<25>>,
> #> #   N_REGIONKEY <int>, N_COMMENT <chr>
> # Instead, we need to create the nul padded string to do the comparison
> japan_raw <- as.raw(
>   c(0x4a, 0x41, 0x50, 0x41, 0x4e, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 
>     0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
> )
> # confirming this is the same thing as in the data 
> japan_raw == as.vector(tab$N_NAME)[[13]]
> #>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> #> [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> tab %>% filter(N_NAME == Scalar$create(japan_raw, type = fixed_size_binary(25))) %>% collect()
> #> # A tibble: 1 × 4
> #>   N_NATIONKEY
> #>         <int>
> #> 1          12
> #> # … with 3 more variables: N_NAME <fixed_size_binary<25>>, N_REGIONKEY <int>,
> #> #   N_COMMENT <chr>
> {code}
> Here is the code I've been using to cast + strip these out after the fact:
> {code}
> library(arrow, warn.conflicts = FALSE)
> options(arrow.skip_nul = TRUE)
> options(arrow.use_altrep = FALSE)
> tables <- arrowbench:::tpch_tables
>   
> for (table_name in tables) {
>   message("Working on ", table_name)
>   tab <- read_parquet(glue::glue("./data_arrow_raw/{table_name}_1.parquet"), as_data_frame=FALSE)
>   
>   for (col in tab$schema$fields) {
>     if (inherits(col$type, "FixedSizeBinary")) {
>       message("Rewritting ", col$name)
>       tab[[col$name]] <- Array$create(as.vector(tab[[col$name]]$cast(string())))
>     }
>   }
>   
>   tab <- write_parquet(tab, glue::glue("./data/{table_name}_1.parquet"))
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)