You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by "gortiz (via GitHub)" <gi...@apache.org> on 2024/02/21 10:07:43 UTC

[I] String to bytes autocasting [pinot]

gortiz opened a new issue, #12457:
URL: https://github.com/apache/pinot/issues/12457

   # The problem
   
   It is not well defined and not well implemented how to cast Strings to bytes. Sometimes we cast them as using `hexToBinary` function. Sometimes we fail. And in future changes in Calcite we may also sometimes cast them following the Postgres not documented (and not recommended) semantics.
   
   There are some PRs that change the behavior, but I think it is important to first understand the problem in deep and decide what semantics we want to follow and how that is going to affect the backward compatibility between V1 and V2 (aka multistage engine).
   
   Note: The text of this issue is very long. In order to make it easier to read I used detail sections like the following:
   <details>
   <summary>Remember you can click on rows like this</summary>
   To open the details that would normally make very difficult to read the whole text
   </details>
   
   # Solutions
   ## To the casting issues specific to V2
   1. Keep V1 compatibility and cast Strings to Bytes using `hexToBinary`. 
   <details>
   <summary>For example:</summary>
   <ul>
   <li>
   `select * from starbucksStores where location_st_point = '80c062bc98021f94f1404e9bda0f6b0202'` would be equivalent to 
   </li>
   <li>
   `select * from starbucksStores where location_st_point = hexToBytes('80c062bc98021f94f1404e9bda0f6b0202')`
   </li>
   <li>
   `select * from starbucksStores where location_st_point = bytesToHex(location_st_point)`  would be equivalent to `select * from starbucksStores where location_st_point = hexToBytes(bytesToHex(location_st_point))`
   </li>
   <li>
   `select * from starbucksStores where location_st_point = address` would be equivalent to `select * from starbucksStores where location_st_point = hexToBytes(address)`
   </li>
   <li>
   `select * from starbucksStores where location_st_point = x'80c062bc98021f94f1404e9bda0f6b0202'` would be legal. Here the `x` indicates this is a byte literal in `hex` notation.
   </li>
   </ul>
    </details>
   
   2. Keep _standard_ compatibility and behave like Postgres. As far as we know there is no standard definition here, but both Postgres and MySQL seem to behave similarly.
   <details>
   <summary>For example:</summary>
   <ul>
   <li>
    `select * from starbucksStores where location_st_point = '80c062bc98021f94f1404e9bda0f6b0202'` would be equivalent to 
   </li>
   <li>
   `select * from starbucksStores where location_st_point = asciiToBytes('80c062bc98021f94f1404e9bda0f6b0202')`. It will return different values than in V1. In this case it will return an empty set.
   </li>
   <li>
   `select * from starbucksStores where location_st_point = bytesToHex(location_st_point)`  would be equivalent to `select * from starbucksStores where location_st_point = asciiToBytes(bytesToHex(location_st_point))`. It will return different values than in V1. It will always return an empty set.
   'select * from starbucksStores where location_st_point = address` would be equivalent to `select * from starbucksStores where location_st_point = asciiToBytes(address)`. It will return different values than in V1.
   </li>
   <li>
   `select * from starbucksStores where location_st_point = x'80c062bc98021f94f1404e9bda0f6b0202'` would be legal. Here the `x` indicates this is a byte literal in `hex` notation.
   </li>
   </ul>
    </details>
   
   3. Consider this cast illegal. In order to compare bytes with Strings customers would need to specify the conversion function. 
   <details>
   <summary>For example:</summary>
   <ul>
   <li>
      `select * from starbucksStores where location_st_point = '80c062bc98021f94f1404e9bda0f6b0202'` would be illegal. Bytes and Strings cannot be compared.
    </li>
    <li>
      `select * from starbucksStores where location_st_point = bytesToHex(location_st_point)` would be illegal. Bytes and Strings cannot be compared.
    </li>
    <li>
      `select * from starbucksStores where location_st_point = address` would be illegal. Bytes and Strings. Bytes and Strings cannot be compared.
    </li>
    <li>
      `select * from starbucksStores where location_st_point = hexToBytes(address)` would be legal. Here we are specifying how to convert the String to bytes.
    </li>
    <li>
      `select * from starbucksStores where location_st_point = hexToBytes('80c062bc98021f94f1404e9bda0f6b0202')` would be legal. Here we are specifying how to convert the String to bytes.
    </li>
    <li>
      `select * from starbucksStores where location_st_point = x'80c062bc98021f94f1404e9bda0f6b0202'` would be legal. Here the `x` indicates this is a byte literal in `hex` notation.
    </li>
   </ul>
    </details>
   
   ## To the casting issues specific to V1
   Probably keep them as they are right now.
   
   ## To the casting issues specific to leaf stage
   1. Eventually modify leaf stage to do not relay on V1 given the know issues we have there.
   
   # Example
   
   Using GenericQuickstart:
   * The expression 
   ```select * from starbucksStores where location_st_point = '80c062bc98021f94f1404e9bda0f6b0202' limit 10``` is valid using the V1 but not in V2. 
   * The expression
   ```select * from starbucksStores where location_st_point = X'80c062bc98021f94f1404e9bda0f6b0202' limit 10``` parses in both V1 and V2, but in V1 returns 0 rows while in V2 returns the row whose bytes are equal to `hexToBytes('80c062bc98021f94f1404e9bda0f6b0202')`.
   * The expression ```select * from starbucksStores where location_st_point = bytesToHex(location_st_point) limit 10``` fails in V1 with
   ```java.lang.IllegalArgumentException at org.apache.pinot.segment.spi.index.reader.ForwardIndexReader.readValuesSV(ForwardIndexReader.java:322)```
   while in V2 fails with:
   ```
   Error Code: 200
   
   QueryExecutionError:
   Received error query execution result block: {200=QueryExecutionError:
   org.apache.pinot.spi.exception.BadQueryRequestException: Caught exception while initializing transform function: cast
   	at org.apache.pinot.core.operator.transform.function.TransformFunctionFactory.get(TransformFunctionFactory.java:332)
   	at org.apache.pinot.core.operator.transform.function.TransformFunctionFactory.get(TransformFunctionFactory.java:327)
   	at org.apache.pinot.core.operator.filter.ExpressionFilterOperator.<init>(ExpressionFilterOperator.java:69)
   	at org.apache.pinot.core.plan.FilterPlanNode.constructPhysicalOperator(FilterPlanNode.java:246)
   ...
   Caused by: java.lang.IllegalArgumentException: Unable to cast expression to type - BYTES
   	at org.apache.pinot.core.operator.transform.function.CastTransformFunction.init(CastTransformFunction.java:112)
   	at org.apache.pinot.core.operator.transform.function.BaseTransformFunction.init(BaseTransformFunction.java:120)
   ```
   * The expression ```select location_st_point = '80c062bc98021f94f1404e9bda0f6b0202' from starbucksStores limit 10``` fail in both V1 and V2.
   * The expression ```select * from starbucksStores where location_st_point = CAST(address as BYTES) limit 10```` fails in V1 saying Strings cannot be cast to Bytes. In V2 the equivalent expression is ```select * from starbucksStores where location_st_point = CAST(address as BINARY) limit 10```, which fails in the same way.
   
   # How other databases convert Strings to Bytes
   
   Postgres defines different ways to declare `bytea` (their bytes type) literals ([link](https://www.postgresql.org/docs/current/datatype-binary.html)). 
   * The `hex` format is the closest to what it seems we try to do. In order to use that, in Postgres you declare a String literal starting with `\x` followed by a even number of hex digits.
   * The `escape` format, which is a bit more complex and not recommended.
   
   Although it is not documented (or at least neither @Jackie-Jiang or I found a reference in the documentation) is that strings that do not match with the previous formats are automatically converted to `bytea` as the UTF8 representation of the String. 
   
   <details>
     <summary>Examples</summary>
   
   ```postgres
   postgres=# create table asd(id int, bytes bytea);
   CREATE TABLE
   postgres=# insert into asd(id, bytes) VALUES (1, '\x1234');
   INSERT 0 1
   postgres=# select * from asd;
    id | bytes  
   ----+--------
     1 | \x1234
   (1 row)
   
   postgres=# select * from asd where bytes = '\x1234';
    id | bytes  
   ----+--------
     1 | \x1234
   (1 row)
   
   postgres=# select * from asd where bytes = '\x1234';
    id | bytes  
   ----+--------
     1 | \x1234
   (1 row)
   
   postgres=# explain select * from asd where bytes = '\x1234';
                        QUERY PLAN                      
   -----------------------------------------------------
    Seq Scan on asd  (cost=0.00..25.88 rows=6 width=36)
      Filter: (bytes = '\x1234'::bytea)
   (2 rows)
   
   postgres=# explain select * from asd where bytes = '1234';
                        QUERY PLAN                      
   -----------------------------------------------------
    Seq Scan on asd  (cost=0.00..25.88 rows=6 width=36)
      Filter: (bytes = '\x31323334'::bytea)
   (2 rows)
   ```
   
   The last two sentences (using explain) show how the literal is interpreted. Postgres uses the `hex` format to represent back the binaries and we can see that `'1234'` was transformed into `'\x31323334'`.
   </details>
   
   # What Calcite does
   Calcite recognizes different literals as well. For example `x'1234'` is interpreted as Postgres `'\x1234'. In the current version values like `80c062bc98021f94f1404e9bda0f6b0202` produce a failure, but https://github.com/apache/calcite/pull/3635 will change the behavior to behave as Postgres in this case as well.
   
   # What Pinot does in V2
   In V2 Pinot delegates validation to Calcite, so expressions all expressions like `location_st_point = <something of type string>` are validated into something like `location_st_point = CAST(<something of type string> as BINARY)`. The tricky thing here is that the behavior depends on whether `<something of type string>` is simplifiable or not.
   
   Having different semantics depending on whether we compare with a literal or a column is unexpected, difficult to understand and error prone and therefore not acceptable. This is one of the problems V1 has that we try to fix in V2. There is PR (https://github.com/apache/pinot/pull/12401) that modified V2 to always use `hexToBinary` in this case, but we may decide to always cast using Postgres semantic instead. This PR doesn't solve all the issues with this conversion. It just removes the dependency on whether the expression is simplifiable or not, but then the expression is sent to V1 and most problems in V1 still affects V2.
   
   <details>
     <summary>Details explaining why the behavior depends on whether the String is simplifiable or not</summary>
   
   ## When String expression is not simplifiable
   For example, the expression `where location_st_point = binaryToHex(location_st_point)` is translated to `filter location_st_point = CAST(bytesToHex(location_st_point) as BINARY)`. Here the simplifier cannot simplify `CAST(bytesToHex(location_st_point) as BINARY)` because it depends on a column (`location_st_point`). 
   
   Then V2 transforms that expression to something the leaf stage (which is basically V1) can understand. Specifically, `BINARY` is not a valid V1 type, so it is transformed to Bytes and the expression executed in the leaf stage is `CAST(bytesToHex(location_st_point) as BYTES)`. 
   
   The problem comes when the leaf stage evaluates the expression. Here we receive the most funny error saying that the query fails because String cannot be cast to bytes. This is in fact correct. As shown in the examples above, query ```select * from starbucksStores where location_st_point = CAST(address as BYTES) limit 10``` fails in the same way in V1.
   
   With [the patch](https://github.com/apache/pinot/pull/12401) this expression doesn't fail because the leaf stage doesn't actually sees the `CAST` function but `hexToBinary` instead.
   
   ## When String expression is simplifiable
   For example, `location_st_point = '80c062bc98021f94f1404e9bda0f6b0202'`. Then it is translated to `location_st_point = CAST('80c062bc98021f94f1404e9bda0f6b0202' as BINARY). Given we ask Calcite to simplify that and that CAST is a pure function with a single argument which is a literal, then Calcite simplifies the expression to the actual result of the casting. As explained above, right now that produces a failure in Calcite, but once (and if) https://github.com/apache/calcite/pull/3635 is merged, the expression will be transformed to `location_st_point = <postgres interpretation of '80c062bc98021f94f1404e9bda0f6b0202'>`.
   
   </details>
   
   # What Pinot does in V1
   
   Given the relax typing nature of V1, there are no implicit casts. Instead we have operations that may or may not be legal depending on contextual factors.
   
   ## Compare bytes column with String literal
   <details>
   <summary>As a where predicate</summary>
   For example, if we compare a bytes column with a String literal in a where clause, V1 will decide to transform the String literal into bytes in a way that is similar to `hexToBinary` function. Therefore 
   ```select * from starbucksStores where location_st_point = '80c062bc98021f94f1404e9bda0f6b0202' limit 10```
   Is equivalent to
   ```select * from starbucksStores where location_st_point = hexToBytes('80c062bc98021f94f1404e9bda0f6b0202') limit 10```
   
   This works because we end up doing a full scan in which the literal in the right side of the expression is transformed with `hexToBytes`.
   </details>
   
   <details>
   <summary>As a select value</summary>
   Contrary to the where case, here we enforce a type check. Specifically, in a query like ```select location_st_point = address from starbucksStores limit 10``` we fail in `BinaryOperatorTransformationFunction.init` because left type is BYTES but right type is not:
   ```java
      if (_leftStoredType == DataType.BYTES || _rightStoredType == DataType.BYTES) {
         Preconditions.checkState(_leftStoredType == _rightStoredType, String.format(
             "Unsupported data type for comparison: [Left Transform Function [%s] result type is [%s], Right Transform "
                 + "Function [%s] result type is [%s]]", _leftTransformFunction.getName(), _leftStoredType,
             _rightTransformFunction.getName(), _rightStoredType));
       }
   ```
   
   There is a plot twist here. Although we should fail with `Unsupported data type for comparision.....` in fact we fail with 
   ```
   Caused by: java.lang.UnsupportedOperationException
   	at org.apache.pinot.core.operator.transform.function.IdentifierTransformFunction.getName(IdentifierTransformFunction.java:56)
   ```
   because `IdentifierTransformFunction.getName` is not override, so the error show to the user is quite bad.
   
   In order to view the correct error we can trick the expression as something like: ```select hexToBytes(bytesToHex(location_st_point)) = substr(location_st_point, 1, -1) from starbucksStores limit 10``` which fails with the _expected_:
   ```
   Caused by: java.lang.IllegalStateException: Unsupported data type for comparison: [Left Transform Function [hexToBytes] result type is [BYTES], Right Transform Function [substr] result type is [STRING]]
   	at com.google.common.base.Preconditions.checkState(Preconditions.java:512)
   ```
   </details>
   ## Compare bytes column with a not literal String
   <details>
   <summary>As where predicate</summary>
   For example `select * from starbucksStores where location_st_point = address limit 10` or `select * from starbucksStores where location_st_point = bytesToHex(location_st_point) limit 10`
   
   In this case V1 fails in a very strange way. Specifically, query ```select * from starbucksStores where location_st_point = address limit 10``` fails with:
   
   ```
   java.lang.IllegalArgumentException
   	at org.apache.pinot.segment.spi.index.reader.ForwardIndexReader.readValuesSV(ForwardIndexReader.java:322)
   	at org.apache.pinot.segment.local.segment.index.readers.forward.BaseChunkForwardIndexReader.readValuesSV(BaseChunkForwardIndexReader.java:444)
   	at org.apache.pinot.segment.local.segment.index.readers.forward.BaseChunkForwardIndexReader.readValuesSV(BaseChunkForwardIndexReader.java:43)
   	at org.apache.pinot.core.common.DataFetcher$ColumnValueReader.readDoubleValues(DataFetcher.java:537)
   ```
   
   The reason is that V1 does not add implicit casts, so when it tries to execute `where location_st_point = address` it sees a comparison between different types and applies the function that compares doubles. Therefore it tries to convert `location_st_point` to double and fails because `ForwardIndexReader.readValuesSV(int[] docIds, int length, double[] values, T context)` doesn't know how to do that.
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [I] String to bytes autocasting [pinot]

Posted by "gortiz (via GitHub)" <gi...@apache.org>.
gortiz commented on issue #12457:
URL: https://github.com/apache/pinot/issues/12457#issuecomment-1959788564

   I've created this draft PR where I implemented a way to fail when casting anything that is not binary into binary: https://github.com/apache/pinot/pull/12475


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [I] String to bytes autocasting [pinot]

Posted by "gortiz (via GitHub)" <gi...@apache.org>.
gortiz commented on issue #12457:
URL: https://github.com/apache/pinot/issues/12457#issuecomment-1957737138

   > Throw exception in V2 when encountering implicit string to binary conversion with proper error message to guide the user to use explicit binary
   
   Two things here:
   1. Only implicit casts? What about explicit casts?
   2. With implicit cast can be do even better. We can tell Calcite that we do not support that type coersion from String to Bytes and it wont type at validation time. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [I] String to bytes autocasting [pinot]

Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang commented on issue #12457:
URL: https://github.com/apache/pinot/issues/12457#issuecomment-1958845665

   > 1. Only implicit casts? What about explicit casts?
   
   I guess we can throw exception in both cases, and only support conversion with encoding specified (e.g. `hexToBytes()`, or introduce the same functions as PostgreSQL)
   
   > 2. With implicit cast can be do even better. We can tell Calcite that we do not support that type coersion from String to Bytes and it wont type at validation time.
   
   Sounds good, as long as we can provide proper exception message to user and guide them to directly use explicit bytes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [I] String to bytes autocasting [pinot]

Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang commented on issue #12457:
URL: https://github.com/apache/pinot/issues/12457#issuecomment-1957637578

   Thanks for the comprehensive explanation of the problem!
   
   - V1 is not strong typed, and for historic reason Pinot used to only take string as literal. That is the reason why we use hex encoded string to represent binary, and does implicit cast with hex encoding.
   - On the contrast, V2 is strong typed, so in order to compare with a BYTES column, literal needs to be binary type. The most straight forward way is to directly declare the literal as binary (e.g. `X'80c062bc98021f94f1404e9bda0f6b0202 '`). When `'80c062bc98021f94f1404e9bda0f6b0202'` is used, it is initially parsed as unknown type, thus Calcite will try to `CAST` it to binary to match the LHS. If RHS is explicit string type (e.g. `CAST('80c062bc98021f94f1404e9bda0f6b0202' AS varchar)`), it should fail because binary and varchar are not compatible.
   - To reduce the confusion, I'm suggesting doing the following:
     - Keep the V1 implicit hex encoding as is, but make it accept explicit binary (e.g. `X'80c062bc98021f94f1404e9bda0f6b0202 '`)
     - Throw exception in V2 when encountering implicit string to binary conversion with proper error message to guide the user to use explicit binary
   
   For the non-literal comparison, it is actually a different issue. In V1, `colA = colB` is rewritten to `colA - colB = 0`, where the `minus` operation only accepts numeric types. It is a more complicated problem, and IMO out of the scope of this issue. We can open a new issue to discuss this problem separately, and focus on literal handling within this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [I] String to bytes autocasting [pinot]

Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang closed issue #12457: String to bytes autocasting
URL: https://github.com/apache/pinot/issues/12457


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [I] String to bytes autocasting [pinot]

Posted by "gortiz (via GitHub)" <gi...@apache.org>.
gortiz commented on issue #12457:
URL: https://github.com/apache/pinot/issues/12457#issuecomment-1972834050

   As a partial solution, @Jackie-Jiang and I are planning to merge https://github.com/apache/pinot/pull/12475, which forbids implicit or explicit casts from anything to Bytes/BINARY only in V2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [I] String to bytes autocasting [pinot]

Posted by "gortiz (via GitHub)" <gi...@apache.org>.
gortiz commented on issue #12457:
URL: https://github.com/apache/pinot/issues/12457#issuecomment-1959791294

   I've tried to use type coersion, but it doesn't seem to work as I expected. I've asked about that in Apache Calcite email list (see https://lists.apache.org/thread/6tsom4d962qd6zco4t922f1fsldpgnm1)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org