You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/23 20:38:27 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue #935: Add support for TRIM BOTH/LEADING/TRAILING

alamb opened a new issue #935:
URL: https://github.com/apache/arrow-datafusion/issues/935


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   The more standard SQL datafusion supports, the easier it will be to integrate it into the rest of the analytics ecosystem
   
   **Describe the solution you'd like**
   I would like to be able to run the following queries without error:
   
   ```shell
   cargo run --bin datafusion-cli
   ```
   
   ```sql
   -- works
   SELECT TRIM('  bar   ');
   
   -- does not work (examples from mysql)
   SELECT TRIM(LEADING 'x' FROM 'xxxbarxxx');
   SELECT TRIM(BOTH 'x' FROM 'xxxbarxxx');
   SELECT TRIM(TRAILING 'xyz' FROM 'barxxyz');
   ```
   
   Actual output:
   ```sql
   > SELECT TRIM('  bar   ');
   
   +------------------------+
   | trim(Utf8("  bar   ")) |
   +------------------------+
   | bar                    |
   +------------------------+
   1 row in set. Query took 0.003 seconds.
   
   
   > SELECT TRIM(LEADING 'x' FROM 'xxxbarxxx');
   Plan("TRIM LEADING is not yet supported ")
   
   > SELECT TRIM(BOTH 'x' FROM 'xxxbarxxx');
   Plan("TRIM BOTH is not yet supported ")
   
   > SELECT TRIM(TRAILING 'xyz' FROM 'barxxyz');
   Plan("TRIM TRAILING is not yet supported ")
   ```
   
   **Describe alternatives you've considered**
   A clear and concise description of any alternative solutions or features you've considered.
   
   **Additional context**
   SQL (postgres and mysql) offer support for extended `TRIM` syntax, and with https://github.com/apache/arrow-datafusion/pull/934 DataFusion's SQL parser can parse this syntax.
   
   Here are links to the references:
   * https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_trim
   * https://www.postgresql.org/docs/current/functions-string.html
   
   Here is a copy of the syntax from the mysql manual:
   
   ```
   TRIM([{BOTH | LEADING | TRAILING} [remstr] FROM] str), TRIM([remstr FROM] str)
   
   Returns the string str with all remstr prefixes or suffixes removed. If none of the specifiers BOTH, LEADING, or TRAILING is given, BOTH is assumed. remstr is optional and, if not specified, spaces are removed.
   
   mysql> SELECT TRIM('  bar   ');
           -> 'bar'
   mysql> SELECT TRIM(LEADING 'x' FROM 'xxxbarxxx');
           -> 'barxxx'
   mysql> SELECT TRIM(BOTH 'x' FROM 'xxxbarxxx');
           -> 'bar'
   mysql> SELECT TRIM(TRAILING 'xyz' FROM 'barxxyz');
           -> 'barx'
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] seddonm1 commented on issue #935: Add support for TRIM BOTH/LEADING/TRAILING

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on issue #935:
URL: https://github.com/apache/arrow-datafusion/issues/935#issuecomment-904126111


   I believe this should be very easy now the parser supports it and just needs a tiny bit of plumbing and a few tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] seddonm1 commented on issue #935: Add support for TRIM BOTH/LEADING/TRAILING

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on issue #935:
URL: https://github.com/apache/arrow-datafusion/issues/935#issuecomment-905864071


   I had a look at the postgres docs again and this implementation meets their definition (https://www.postgresql.org/docs/current/functions-string.html#id-1.5.8.10.7.2.2.2.1.1.1) importantly: 
   
   ```removes the longest string containing only characters in characters``` 
   
   which means characters is treated as an character array not a string literal
   
   Both postgres and datafusion implement consistently:
   
   ```
   SELECT BTRIM('leadtrailtextleadtrail','leadtrail')
   'x'
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] adsharma commented on issue #935: Add support for TRIM BOTH/LEADING/TRAILING

Posted by GitBox <gi...@apache.org>.
adsharma commented on issue #935:
URL: https://github.com/apache/arrow-datafusion/issues/935#issuecomment-905730026


   Does this look right?
   
   https://github.com/adsharma/arrow-datafusion/commit/8c4bd15725d769f6e78f8218fe27c8e935316b93
   
   After this code, I see a query plan that looks legit, but the output is not what I'd expect. Any hints on what's missing or how to debug it?
   
   ```
   > SELECT TRIM(LEADING 'x' FROM 'xxxbarxxx');
   +------------------------------------+
   | ltrim(Utf8("x"),Utf8("xxxbarxxx")) |
   +------------------------------------+
   |                                    |
   +------------------------------------+
   1 row in set. Query took 0.007 seconds.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #935: Add support for TRIM BOTH/LEADING/TRAILING

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #935:
URL: https://github.com/apache/arrow-datafusion/issues/935#issuecomment-905778857


   @adsharma  perhaps the arguments to ltrim are reversed? Maybe something like `ltrim(Utf8("xxxbarxxx"), Utf8("x"))` would produce the expected answer?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #935: Add support for TRIM BOTH/LEADING/TRAILING

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #935:
URL: https://github.com/apache/arrow-datafusion/issues/935#issuecomment-905832359


   > The problem boils down to the interpretation: is it a set of characters or a string to be trimmed?
   
   I think postgres `btrim` (what DataFusion is modeled on)  is supposed to be removing the substrings too (not a set of characters) 🤔  We will need to do some more research perhaps


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] seddonm1 commented on issue #935: Add support for TRIM BOTH/LEADING/TRAILING

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on issue #935:
URL: https://github.com/apache/arrow-datafusion/issues/935#issuecomment-905833583


   @alamb let me have a look


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb closed issue #935: Add support for TRIM BOTH/LEADING/TRAILING

Posted by GitBox <gi...@apache.org>.
alamb closed issue #935:
URL: https://github.com/apache/arrow-datafusion/issues/935


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #935: Add support for TRIM BOTH/LEADING/TRAILING

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #935:
URL: https://github.com/apache/arrow-datafusion/issues/935#issuecomment-906327322


   I ran the same query in MYSQL and POSTGRES and they do produce different results (!!): 
   
   mysql
   ```
   mysql> SELECT TRIM(BOTH 'leadtrail' FROM 'leadtrailtextleadtrail');
   +------------------------------------------------------+
   | TRIM(BOTH 'leadtrail' FROM 'leadtrailtextleadtrail') |
   +------------------------------------------------------+
   | text                                                 |
   +------------------------------------------------------+
   1 row in set (0.00 sec)
   ```
   
   postgres
   ```
   alamb=# SELECT TRIM(BOTH 'leadtrail' FROM 'leadtrailtextleadtrail');
    btrim 
   -------
    x
   (1 row)
   ```
   
   So as @seddonm1  points out DF is consistent with postgres which is the current target


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #935: Add support for TRIM BOTH/LEADING/TRAILING

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #935:
URL: https://github.com/apache/arrow-datafusion/issues/935#issuecomment-904146002


   > I believe this should be very easy now the parser supports it and just needs a tiny bit of plumbing and a few tests.
   
   Indeed -- a perfect good first issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] adsharma commented on issue #935: Add support for TRIM BOTH/LEADING/TRAILING

Posted by GitBox <gi...@apache.org>.
adsharma commented on issue #935:
URL: https://github.com/apache/arrow-datafusion/issues/935#issuecomment-905826481


   Thanks! That was one problem. Fixed now. 
   
   But there is a second significant problem:
   
   ```
   > SELECT TRIM(BOTH 'leadtrail' FROM 'leadtrailtextleadtrail');
   +---------------------------------------------------------+
   | btrim(Utf8("leadtrailtextleadtrail"),Utf8("leadtrail")) |
   +---------------------------------------------------------+
   | x                                                       |
   +---------------------------------------------------------+
   1 row in set. Query took 0.008 seconds.
   ```
   
   MySQL would output "text".
   
   The problem boils down to the interpretation: is it a set of characters or a string to be trimmed?
   
   Perhaps address this in a new PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org