You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Jin Shang (Jira)" <ji...@apache.org> on 2022/09/09 17:47:00 UTC

[jira] [Created] (ARROW-17668) [C++][Gandiva]Add parser frontend for Gandiva

Jin Shang created ARROW-17668:
---------------------------------

             Summary: [C++][Gandiva]Add parser frontend for Gandiva
                 Key: ARROW-17668
                 URL: https://issues.apache.org/jira/browse/ARROW-17668
             Project: Apache Arrow
          Issue Type: New Feature
          Components: C++ - Gandiva
            Reporter: Jin Shang
            Assignee: Jin Shang


This issue is a feature proposal to add a parser frontend for Gandiva.
h2. Background

My team uses an expression computation library for our C++ feature engineering pipeline. We currently use [Exprtk]([https://github.com/ArashPartow/exprtk]). We recently tried out Gandiva and wrote some benchmarks. We discovered that Gandiva is several times faster than Exprtk in our use cases. Therefore we decided to switch to Gandiva for computing expressions. 
h2. Objective

As of current, due to its lack of a frontend, we need to manually construct an AST to use Gandiva. This is inconvenient and requires extra learning cost. We would like to implement a parser frontend for Gandiva, so that Gandiva becomes a standalone complete expression compiler and evaluator, and a drop-in replacement for the existing libraries like like [Exprtk]([https://github.com/ArashPartow/exprtk]) and [TinyExpr]([https://github.com/codeplea/tinyexpr]). The goal is to enable the following functionality:

 
{code:cpp}
// Create schema for gandiva
auto field_x = arrow::field(x, arrow::int32());
auto field_y = arrow::field(y, arrow::float64());
auto field_z = arrow::field(z, arrow::boolean());
auto schema = arrow::scehma({field_x, field_y, field_z});

// Use the Parser to generate an ExpressionPtr
std::string expr_str = "if(z, castFloat8(x), y * 1000.0)";
auto parser = gandiva::Parser();
std::shared_ptr<gandiva::ExpressionPtr> expr_ptr;
auto status = parser.Parse(schema, expr_str, &expr_ptr);

// The rest is normal usage of Gandiva projector
std::shared_ptr<gandiva::Projector> projector;
auto status = gandiva::Projector::Make(schema, {expr_ptr}, &projector);
auto in_batch = arrow::RecordBatch::Make(schema, size, input_arr);
auto* pool = arrow::default_memory_pool();
arrow::ArrayVector outputs;
auto status = projector->Evaluate(*in_batch, pool, &outputs);
{code}
 
h2. Syntax

The goal is to design a succinct and intuitive grammar for both schema and Gandiva expressions. We will need a corresponding grammar for each Node type in Gandiva.
 - Literals: We find Rust’s literal representation([https://doc.rust-lang.org/rust-by-example/types/literals.html]([https://doc.rust-lang.org/rust-by-example/types/literals.html])) very intuitive. We’ll support suffix such as `i32`, `u64`, `f32` to denote a literal’s type. The types of unsuffixed literals are inferred by their usage. Otherwise, the default type for integers is `int32` and `float32` for floating points. String and binary literals are wrapped with single or double quotes. Decimal128 literals will not be supported in the first version.
 - Fields: Just their names as defined in the schema. To avoid conflicts with other node types, field names must start with alphabetical letters.
 - Functions: `<function_name>(<param1>, <param2>, ...)`. For functions with multiple overloads, their return type is inferred from input types. For commonly used functions, we would also like to support infix forms. They include:
    - Comparisons: equal(==), not equal(!=), greater than(>), greater than or equal to(>=), less than(<), less than or equal to(<=)
    - Arithmetics: add( +), substract( -), multiply( *), divide( /), modulo(%), power({^}), bitwise and(&), bitwise or(|), bitwise xor({^}), bitwise not(~)
    
    Function alias with spaces in their names won’t be supported such as “is not false” are not supported.
    
 - If: We could like to support two grammars for if expressions:
    - `if(<cond>, <then>, <else>)` for its simplicity and functional feel;
    - `if(<cond>) \{ <then> } else \{ <else> }` since it’s the C++ `if` grammar and has better formatting for complex expressions.
 - Boolean: We would like to support both `&& ||` and `and or` keywords same as C++.
 - InExpression: `<eval> in (<member1>, <member2>, ...)` . Its type is also inferred.

The grammar can be roughly represented as:
{code:cpp}
exp := literal | field | function | if | boolean | inexpression
literal := INT_LITERAL | INT_LITERAL_SUFFIX | FLOAT_LITERAL | FLOAT_LITERAL_SUFFIX | "-" literal
field := IDENTIFIER
function := IDENTIFIER(arguments) | infixfunc
infixfunc := exp "+" exp | exp "-" exp | ...
arguments := exp | argument "," exp
if := IF "(" exp "," exp "," exp ")" | "if" "(" exp ")" "{" exp "}" ELSE "{" exp "}"
boolean := and_boolean | or_boolean
and_boolean := exp AND exp | exp "&&" exp
or_boolean := exp OR exp | exp "||" exp
inexpression := exp IN "(" arguments ")"{code}
lower cases are non-terminals and upper cases are tokens.
h2. Implementation
h3. Lexing and Parsing

We would like to use flex and bison for lexing and parsing. They have several advantages compared to other options like Antlr4 and Boost::Spirit.
 # They are the most classical and popular parsing library in the cpp world.
 # They allow us to precompile the parser codes and have no runtime dependencies.

Flex&bison takes a string and outputs a `gandiva::node` tree. The tree may contain untyped nodes, e.g., unsuffixed literals and functions. 
h3. Type Inference

We’ll have a TypeInference class that inherits node visitor, implementing a simple DFS algorithm to do type inference. 

For functions, 
 # Get all available signatures of this function from the function registry.
 # Find all candidate signatures that match the current known argument and return type.
 ## If none of them matches, there is a type error.
 ## If only one matches, this is the signature. From the signature we also know the type of each argument. Visit each argument node carrying their return type.
 ## If more than one matches, visit each argument node first. Then we do matching again. It is guaranteed to have zero or one matches because no function has overloads that differ only on their return types, because at this point each argument’s type is known (because untyped literals are given default types upon visit, see below). 

The process is similar for `if`, `boolean`and `in expression` , only their “signatures” are manually constructed. For example `if` has `(bool, <type>, <type>)-><type>`and `boolean` has `(bool, bool)->(bool)`.

For untyped literals,
 # If a type is passed from its parent, the type is known.
 # Otherwise set it to the default type.

Any feedback is appreciated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)