You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "paleolimbot (via GitHub)" <gi...@apache.org> on 2023/03/07 13:27:44 UTC

[GitHub] [arrow] paleolimbot opened a new issue, #34484: [C++] Substrait join results in all zeroes on the righthand side of the join

paleolimbot opened a new issue, #34484:
URL: https://github.com/apache/arrow/issues/34484

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Very possible that there's something wrong with my plan here! Reproducer via the R bindings:
   
   ``` r
   cities <- tibble::tibble(
     city = c("Halifax", "Lancaster", "Chicago"),
     country = c("Canada", "United Kingdom", "United States")
   )
   
   countries <- tibble::tibble(
     country = c("United States", "Canada", "United Kingdom", "Morroco"),
     continent = c("North America", "North America", "Europe", "Africa")
   )
   
   
   tmp_left <- tempfile()
   tmp_right <- tempfile()
   arrow::write_parquet(cities, tmp_left)
   arrow::write_parquet(cities, tmp_right)
   
   plan <- sprintf('{
     "extensionUris": [
       {
         "extensionUriAnchor": 1,
         "uri": "https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml"
       },
       {
         "extensionUriAnchor": 2,
         "uri": "https://github.com/substrait-io/substrait/blob/main/extensions/functions_comparison.yaml"
       }
     ],
     "extensions": [
       {
         "extensionFunction": {
           "extensionUriReference": 2,
           "functionAnchor": 3,
           "name": "equal"
         }
       }
     ],
     "relations": [
       {
         "rel": {
           "join": {
             "common": {
               "emit": {
                 "outputMapping": [
                   0,
                   1,
                   2,
                   3
                 ]
               }
             },
             "left": {
               "read": {
                 "baseSchema": {
                   "names": [
                     "city",
                     "country"
                   ],
                   "struct": {
                     "types": [
                       {
                         "string": {
                           "nullability": "NULLABILITY_NULLABLE"
                         }
                       },
                       {
                         "string": {
                           "nullability": "NULLABILITY_NULLABLE"
                         }
                       }
                     ]
                   }
                 },
                 "localFiles": {
                   "items": [
                     {
                       "uriFile": "file://%s",
                       "parquet": {
   
                       }
                     }
                   ]
                 }
               }
             },
             "right": {
               "read": {
                 "baseSchema": {
                   "names": [
                     "country",
                     "continent"
                   ],
                   "struct": {
                     "types": [
                       {
                         "string": {
                           "nullability": "NULLABILITY_NULLABLE"
                         }
                       },
                       {
                         "string": {
                           "nullability": "NULLABILITY_NULLABLE"
                         }
                       }
                     ]
                   }
                 },
                 "localFiles": {
                   "items": [
                     {
                       "uriFile": "file://%s",
                       "parquet": {
   
                       }
                     }
                   ]
                 }
               }
             },
             "expression": {
               "scalarFunction": {
                 "functionReference": 3,
                 "outputType": {
                   "bool": {
                     "nullability": "NULLABILITY_NULLABLE"
                   }
                 },
                 "arguments": [
                   {
                     "value": {
                       "selection": {
                         "directReference": {
                           "structField": {
                             "field": 1
                           }
                         },
                         "rootReference": {
   
                         }
                       }
                     }
                   },
                   {
                     "value": {
                       "selection": {
                         "directReference": {
                           "structField": {
                             "field": 2
                           }
                         },
                         "rootReference": {
   
                         }
                       }
                     }
                   }
                 ]
               }
             },
             "type": "JOIN_TYPE_INNER"
           }
         }
       }
     ]
   }', tmp_left, tmp_right)
   
   arrow:::do_exec_plan_substrait(plan) |> as.data.frame()
   #> # A tibble: 3 × 4
   #>   `FieldPath(0)` `FieldPath(1)` `FieldPath(2)` `FieldPath(3)`
   #>   <chr>          <chr>                   <int>          <int>
   #> 1 Halifax        Canada                      0              0
   #> 2 Lancaster      United Kingdom              0              0
   #> 3 Chicago        United States               0              0
   ```
   
   <sup>Created on 2023-03-07 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>
   
   (The substrait producer interface I'm using to generate this):
   
   ``` r
   # remotes::install_github("voltrondata/substrait-r#225")
   library(substrait, warn.conflicts = FALSE)
   library(dplyr, warn.conflicts = FALSE)
   
   cities <- tibble::tibble(
     city = c("Halifax", "Lancaster", "Chicago"),
     country = c("Canada", "United Kingdom", "United States")
   )
   
   countries <- tibble::tibble(
     country = c("United States", "Canada", "United Kingdom", "Morroco"),
     continent = c("North America", "North America", "Europe", "Africa")
   )
   
   compiler <- cities |> 
     arrow_substrait_compiler() |> 
     substrait_join(countries)
   
   # Obviously very wrong
   compiler |> collect()
   #> # A tibble: 3 × 4
   #>   city      country.x      country.y continent
   #>   <chr>     <chr>              <int>     <int>
   #> 1 Halifax   Canada                 0         0
   #> 2 Lancaster United Kingdom         0         0
   #> 3 Chicago   United States          0         0
   
   # Anything wrong here?
   compiler$plan()
   #> message of type 'substrait.Plan' with 3 fields set
   #> extension_uris {
   #>   extension_uri_anchor: 1
   #>   uri: "https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml"
   #> }
   #> extension_uris {
   #>   extension_uri_anchor: 2
   #>   uri: "https://github.com/substrait-io/substrait/blob/main/extensions/functions_comparison.yaml"
   #> }
   #> extensions {
   #>   extension_function {
   #>     extension_uri_reference: 2
   #>     function_anchor: 3
   #>     name: "equal"
   #>   }
   #> }
   #> relations {
   #>   root {
   #>     input {
   #>       join {
   #>         common {
   #>           emit {
   #>             output_mapping: 0
   #>             output_mapping: 1
   #>             output_mapping: 2
   #>             output_mapping: 3
   #>           }
   #>         }
   #>         left {
   #>           read {
   #>             base_schema {
   #>               names: "city"
   #>               names: "country"
   #>               struct {
   #>                 types {
   #>                   string {
   #>                     nullability: NULLABILITY_NULLABLE
   #>                   }
   #>                 }
   #>                 types {
   #>                   string {
   #>                     nullability: NULLABILITY_NULLABLE
   #>                   }
   #>                 }
   #>               }
   #>             }
   #>             named_table {
   #>               names: "named_table_1"
   #>             }
   #>           }
   #>         }
   #>         right {
   #>           read {
   #>             base_schema {
   #>               names: "country"
   #>               names: "continent"
   #>               struct {
   #>                 types {
   #>                   string {
   #>                     nullability: NULLABILITY_NULLABLE
   #>                   }
   #>                 }
   #>                 types {
   #>                   string {
   #>                     nullability: NULLABILITY_NULLABLE
   #>                   }
   #>                 }
   #>               }
   #>             }
   #>             named_table {
   #>               names: "named_table_2"
   #>             }
   #>           }
   #>         }
   #>         expression {
   #>           scalar_function {
   #>             function_reference: 3
   #>             output_type {
   #>               bool {
   #>                 nullability: NULLABILITY_NULLABLE
   #>               }
   #>             }
   #>             arguments {
   #>               value {
   #>                 selection {
   #>                   direct_reference {
   #>                     struct_field {
   #>                       field: 1
   #>                     }
   #>                   }
   #>                   root_reference {
   #>                   }
   #>                 }
   #>               }
   #>             }
   #>             arguments {
   #>               value {
   #>                 selection {
   #>                   direct_reference {
   #>                     struct_field {
   #>                       field: 2
   #>                     }
   #>                   }
   #>                   root_reference {
   #>                   }
   #>                 }
   #>               }
   #>             }
   #>           }
   #>         }
   #>         type: JOIN_TYPE_INNER
   #>       }
   #>     }
   #>     names: "city"
   #>     names: "country.x"
   #>     names: "country.y"
   #>     names: "continent"
   #>   }
   #> }
   ```
   
   <sup>Created on 2023-03-07 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] vibhatha commented on issue #34484: [C++] Substrait join results in all zeroes on the righthand side of the join

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.
vibhatha commented on issue #34484:
URL: https://github.com/apache/arrow/issues/34484#issuecomment-1459077623

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] paleolimbot commented on issue #34484: [C++] Substrait join results in all zeroes on the righthand side of the join

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on issue #34484:
URL: https://github.com/apache/arrow/issues/34484#issuecomment-1460133619

   >  It seems we are doing way too much testing with in-memory tables 
   
   The flip side of that is that we really should implement a proper TableProvider instead of writing temporary Parquet files! Glad that it at least exposed this before it was used somewhere else 😬 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #34484: [C++] Substrait join results in all zeroes on the righthand side of the join

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #34484:
URL: https://github.com/apache/arrow/issues/34484#issuecomment-1458922632

   Your plan is good.  It seems we are doing way too much testing with in-memory tables :cold_sweat: 
   
   Since you are actually scanning it is using a scan node. The scan node is obnoxiously (and incorrectly) appending the augmented fields `(__fragment_index, __batch_index, __last_in_fragment, __filename)`.
   
   This is a bug in the Acero's Substrait handling for scan.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #34484: [C++] Substrait join results in all zeroes on the righthand side of the join

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #34484:
URL: https://github.com/apache/arrow/issues/34484#issuecomment-1458926783

   CC @vibhatha if you have time to take a look.  For now we should just hide the augmented fields entirely from Substrait.  At the moment I'm thinking the simplest way to do this might be to place a project node after any scan node.  The project node can then hide these fields.
   
   Then we can wait to fix it properly until we have the new scan node.
   
   I think the "proper" fix would be to key on the ReadRel's `baseSchema` property.  If the column is named __filename (or whatever) then we automatically assume they are asking for the augmented field and we deliver it.  This means a user can't use a column names `__filename` but that seems like a reasonable workaround until we decide to introduce the concept of "augmented fields" to Substrait (I can't imagine this happening anytime soon).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] vibhatha commented on issue #34484: [C++] Substrait join results in all zeroes on the righthand side of the join

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.
vibhatha commented on issue #34484:
URL: https://github.com/apache/arrow/issues/34484#issuecomment-1459077943

   > CC @vibhatha if you have time to take a look. For now we should just hide the augmented fields entirely from Substrait. At the moment I'm thinking the simplest way to do this might be to place a project node after any scan node. The project node can then hide these fields.
   > 
   > Then we can wait to fix it properly until we have the new scan node.
   > 
   > I think a more "proper" fix would be to key on the ReadRel's `baseSchema` property. If the column is named __filename (or whatever) then we automatically assume they are asking for the augmented field and we deliver it. This means a user can't use a column names `__filename` but that seems like a reasonable workaround until we decide to introduce the concept of "augmented fields" to Substrait (I can't imagine this happening anytime soon).
   
   Sure, @westonpace I will work on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org