You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/27 07:41:47 UTC

[GitHub] [arrow-datafusion] mingmwang opened a new issue #1688: Testing result of String related built-in functions

mingmwang opened a new issue #1688:
URL: https://github.com/apache/arrow-datafusion/issues/1688


   **Describe the bug**
   Recently I did some test on DataFusion's String related build-in functions. I found there are lots of gaps compared with Spark SQL's (3.x) result.  Some of them maybe bug in Spark SQL but I believe most of them are bugs in DataFusion.
   
   **To Reproduce**
   Please see the details below:
   
   ----------------------------
   upper()
   
   SELECT upper('Haßler');
   
   Spark returns "HASSLER"
   DataFusion returns "HAßLER"
   ----------------------------
   array()
   
   SELECT array(1, 2, 3);
   
   Spark returns [1,2,3]
   DataFusion returns Error: NotImplemented("Array is not implemented for scalar values.")
   ----------------------------
   ascii()
   
   SELECT ascii('ß'); 
   
   Spark returns -61
   DataFusion returns 223
   
   SELECT ascii('ℝ');
   
   Spark returns -30
   DataFusion returns 8477 
   
   ----------------------------
   bit_length()
   
   SELECT bit_length('Spark SQL');
   SELECT bit_length('ß');
   SELECT bit_length('𠎠');
   SELECT bit_length('');
   
   DataFusion results match with Spark.
   
   ----------------------------
   
   char() /chr()
   Spark supports both char() and chr(). DataFusion only supports chr().
   
   SELECT chr(65);
   Result matches
   
   SELECT chr(0);
   Spark returns empty result.
   DataFusion returns  Error: Execution("null character not permitted.")
   
   SELECT chr(-1);
   Spark returns empty result.
   DataFusion returns Error: Execution("requested character too large for encoding."
   
   SELECT chr(65.0);
   Spark returns 'A'.
   DataFusion returns Error: Plan("Coercion from [Float64] to the signature Uniform(1, [Int64]) failed.")
   
   SELECT chr(10000000000000);
   Spark returns empty result.
   DataFusion returns Error: Execution("requested character too large for encoding.")
   
   ----------------------------
   char_length()
   
   SELECT char_length('Haßler');
   Result matches.
   
   SELECT char_length('é');
   Result matches.
   
   SELECT char_length('é');
   Spark returns 2.
   DataFusion returns 1
   
   SELECT char_length('𠎠');
   Result matches.
   
   SELECT char_length('𝕆');
   Result matches.
   
   SELECT char_length('');
   Result matches.
   
   ----------------------------
   concat()
   
   SELECT concat('Spark', 'SQL', '', '', '');
   Result matches.
   ----------------------------
   concat_ws()
   
   SELECT concat_ws('Spark', 'Spark', 'SQL','','','','');
   Result matches.
   
   SELECT concat_ws('s');
   
   Spark returns empty result.
   DataFusion returns Error: Internal("concat_ws was called with 1 arguments. It requires at least 2.")
   
   ----------------------------
   date_part()
   
   DataFusion only support "hour" and "year" as the Date part.
   
   SELECT date_part('YEAR', TIMESTAMP '2019-08-12 01:00:00.123456');
   Result matches
   
   SELECT date_part('YEAR', DATE'2019-08-12');
   Result matches
   
   SELECT date_part('YEAR', '2019-08-12');
   Spark returns 2019.
   DataFusion returns
   Error: Plan("Coercion from [Utf8, Utf8] to the signature OneOf([Exact([Utf8, Date32]), Exact([Utf8, Date64]), Exact([Utf8, Timestamp(Second, None)]), Exact([Utf8, Timestamp(Microsecond, None)]), Exact([Utf8, Timestamp(Millisecond, None)]), Exact([Utf8, Timestamp(Nanosecond, None)])]) failed.")
   
   SELECT date_part('hour', TIMESTAMP '2019-08-12 01:00:00.123456');
   Spark returns 1
   DataFusion returns 17
   
   DataFusion doesn't support the following:
   
   SELECT date_part('week', timestamp'2019-08-12 01:00:00.123456');
   SELECT date_part('doy', DATE'2019-08-12');
   SELECT date_part('SECONDS', timestamp'2019-10-01 00:00:01.000001');
   SELECT date_part('days', interval 5 days 3 hours 7 minutes);
   SELECT date_part('seconds', interval 5 hours 30 seconds 1 milliseconds 1 microseconds);
   SELECT date_part('MONTH', INTERVAL '2021-11' YEAR TO MONTH);
   SELECT date_part('MINUTE', INTERVAL '123 23:55:59.002001' DAY TO SECOND);
   
   ----------------------------
   
   date_trunc()
   
   DataFusion only support lower case granularity like year/quarter/month/week/day/hour/minute/second.
   
   
   SELECT date_trunc('year', '2015-03-05T09:32:05.359');
   Spark returns '2015-01-01 00:00:00.0'.
   DataFusion returns Error: Plan("Coercion from [Utf8, Utf8] to the signature Exact([Utf8, Timestamp(Nanosecond, None)]) failed.").
   
   
   SELECT date_trunc('year', timestamp'2015-03-05T09:32:05.359')
   Spark returns '2015-01-01 00:00:00.0'.
   DataFusion returns '2015-01-01 00:00:00'
   
   DataFusion doesn't support the following:
   
   SELECT date_trunc('MM', '2015-03-05T09:32:05.359');
   SELECT date_trunc('DD', '2015-03-05T09:32:05.359');
   SELECT date_trunc('HOUR', '2015-03-05T09:32:05.359');
   SELECT date_trunc('SECOND', timestamp'2015-03-05T09:32:05.123456');
   SELECT date_trunc('MILLISECOND', '2015-03-05T09:32:05.123456');
   
   ----------------------------
   decode()
   
   DataFusion doesn't support this function
   
   SELECT decode(encode('abc', 'utf-8'), 'utf-8');
   SELECT decode(2, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 4, 'Seattle', 'Non domestic');
   SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 4, 'Seattle', 'Non domestic');
   SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 4, 'Seattle');
   
   ----------------------------
   
   encode()
   
   DataFusion doesn't support this function
   
   SELECT encode('abc', 'utf-8');
   
   ----------------------------
   format_string()
   
   DataFusion doesn't support this function
   
   
   SELECT format_string("Hello World %d %s", 100, "days");
   
   ----------------------------
   
   hex()
   
   DataFusion function name is to_hex()
   
   
   SELECT hex(17);
   Return matches
   
   SELECT hex('Spark SQL');
   Spark returns '537061726B2053514C'.
   DataFusion returns Error: Plan("Coercion from [Utf8] to the signature Uniform(1, [Int64]) failed.")
   
   ----------------------------
   
   instr()
   
   DataFusion doesn't support this function
   
   SELECT instr('SparkSQL', 'SQL');
   
   ----------------------------
   
   initcap()
   
   
   SELECT initcap('sPark sql');
   Result matches.
   
   SELECT initcap('ßsPark sql');
   Spark returns 'ßspark Sql'.
   DataFusion returns 'ßSpark Sq'.
   
   SELECT initcap('ß'), upper('ß');
   Spark returns 'ß   SS'.
   DataFusion returns 'ß   ß'.
   Looks like Spark's result is inconsistent.
   
   ----------------------------
   
   like
   
   SELECT 'Spark' like '_park';
   Result matches
   
   
   SELECT like('Spark', '_park');
   DataFusion doesn't support like function
   
   SELECT '%SystemDrive%\Users\John' like '\%SystemDrive\%\\Users%';
   Spark returns true.
   DataFusion returns false.
   
   SELECT '%SystemDrive%\\Users\\John' like '\%SystemDrive\%\\\\Users%';
   Spark returns true.
   DataFusion returns false.
   
   
   SELECT '%SystemDrive%/Users/John' like '/%SystemDrive/%//Users%' ESCAPE '/';
   Spark returns true.
   DataFusion returns Error: SQL(ParserError("Expected end of statement, found: '/'")).
   
   
   ----------------------------
   rlike, like any, like all
   
   DataFusion doesn't support those.
   
   ----------------------------
   left()
   
   
   SELECT left('Spark SQL', 3);
   Result matches.
   
   SELECT left('Spark SQL', 0);
   Result matches.
   
   SELECT left('Spark SQL', -20);
   Result matches.
   
   SELECT left('Spark SQL', 20);
   Result matches.
   
   SELECT left('Spark SQL', -1);
   Spark returns empty.
   DataFusion returns 'Spark SQ'.
   
   ----------------------------
   length()
   
   SELECT length('Haßler');
   Result matches.
   
   SELECT length('é');
   Result matches.
   
   SELECT length('?');
   SELECT length('测试测试');
   SELECT length('');
   Result matches.
   
   SELECT length('é');
   Spark returns 2
   DataFusion returns 1
   
   ----------------------------
   lpad()
   
   SELECT lpad('hi', 5, '??');
   SELECT lpad('hi', 1, '??');
   SELECT lpad('hi', 0, '??');
   Result matches.
   
   SELECT lpad('hi', -1, '??');
   Spark returns empty
   DataFusion thread panic. thread 'main' panicked at 'capacity overflow', library/alloc/src/raw_vec.rs:509:5
   
   SELECT lpad('hi', 5);
   Result matches.
   
   SELECT lpad('hi', 100000, '??');
   Spark returns.
   DataFusion thread panic. thread 'main' panicked at 'attempt to add with overflow', /Users/xxx/.cargo/registry/src/github.com-1ecc6299db9ec823/comfy-table-5.0.0/src/utils/mod.rs:44:
   
   ----------------------------
   
   locate()
   
   DataFusion doesn't support locate(). But DataFusion has another function called strpos().
   
   SELECT locate('bar', 'foobarbar');
   SELECT locate('bar', 'foobarbar', 5);
   
   ----------------------------
   ltrim(), btrim(),  rtrim()
   
   SELECT ltrim('    SparkSQL   ');
   SELECT length(ltrim('    SparkSQL   '));
   SELECT btrim('    SparkSQL   ');
   SELECT length(btrim('    SparkSQL   '));
   SELECT rtrim('    SparkSQL   ');
   SELECT length(rtrim('    SparkSQL   '));
   
   Result matches.
   ----------------------------
   md5()
   
   
   SELECT md5('Spark');
   SELECT md5('测试测试');
   
   Result matches.
   ----------------------------
   
   nullif()
   
   SELECT nullif(2, 2);
   Spark returns NULL
   DataFusion returns Error: NotImplemented("nullif does not support a literal as first argument").
   
   ----------------------------
   octet_length()
   
   SELECT octet_length('Spark SQL');
   Result matches.
   ----------------------------
   
   position()
   
   DataFusion doesn't support position() function.
   
   SELECT position('bar', 'foobarbar');
   SELECT position('bar', 'foobarbar', 5);
   SELECT POSITION('bar' IN 'foobarbar');
   
   ----------------------------
   
   printf()
   
   DataFusion doesn't support position() function.
   
   SELECT printf("Hello World %d %s", 100, "days");
   
   ----------------------------
   
   repeat()
   
   SELECT repeat('123', 0);
   Result matches.
   
   SELECT repeat('123', -1);
   Spark returns empty.
   DataFusion thread 'main' panicked at 'capacity overflow', library/alloc/src/slice.rs:558:50
   
   SELECT repeat('123', 100000);
   Spark returns correct reuslt.
   DataFusion thread 'main' panicked at 'attempt to add with overflow', /Users/xxx/.cargo/registry/src/github.com-1ecc6299db9ec823/comfy-table-5.0.0/src/utils/mod.rs:44:9
   
   ----------------------------
   regexp()
   
   DataFusion doesn't support regexp(). (regexp_match() is the funtion name in DataFusion ??)
   
   SELECT regexp('%SystemDrive%\Users\John', '%SystemDrive%\\Users.*');
   
   ----------------------------
   regexp_extract()
   
   DataFusion doesn't support regexp_extract().
   
   SELECT regexp_extract('100-200', '(\\d+)-(\\d+)', 1);
   
   ----------------------------
   regexp_replace()
   
   SELECT regexp_replace('100-200', '(\\d+)', 'num');
   Spark returns 'num-num'.
   DataFusion returns '100-200'
   ----------------------------
   replace()
   
   SELECT replace('ABCabc', 'abc', 'DEF');
   SELECT replace('ABABA', 'ABA', 'abA');
   Result matches.
   
   ----------------------------
   reverse()
   
   SELECT reverse('Spark SQL');
   Result matches.
   ----------------------------
   substr()/substring()
   
   Spark supports both but DataFusion only supports substr().
   
   
   SELECT substr('Spark SQL', 5);
   Result matches.
   
   SELECT substr('Spark SQL', 5, 1);
   Result matches.
   
   
   SELECT substr('Spark SQL', -3);
   Spark returns 'SQL'.
   DataFusion returns  'Spark SQL'.
   
   DataFusion doesn't support the following:
   
   SELECT substr('Spark SQL' FROM 5);
   SELECT substr('Spark SQL' FROM -3);
   SELECT substr('Spark SQL' FROM 5 FOR 1);
   
   ----------------------------
   split()
   
   DataFusion doesn't support split(). DataFusion supports split_part() but the behavior is not the same.
   
   SELECT split('oneAtwoBthreeC', '[ABC]');
   SELECT split('oneAtwoBthreeC', '[ABC]', -1);
   SELECT split('oneAtwoBthreeC', '[ABC]', 2);
   
   ----------------------------
   trim()
   
   SELECT trim('    SparkSQL   ');
   Result matches.
   
   SELECT trim(BOTH 'SL' FROM 'SSparkSQLS');
   Result matches.
   SELECT trim(LEADING 'SL' FROM 'SSparkSQLS');
   Result matches.
   SELECT trim(TRAILING 'SL' FROM 'SSparkSQLS');
   Result matches.
   
   
   DataFusion doesn't support the following:
   
   SELECT trim(BOTH FROM '    SparkSQL   ');
   SELECT trim(LEADING FROM '    SparkSQL   ');
   SELECT trim(TRAILING FROM '    SparkSQL   ');
   SELECT trim('SL' FROM 'SSparkSQLS');
   
   ----------------------------
   timestamp()
   
   DataFusion doesn't support the timestamp() function but support the to_timestamp() function
   
   SELECT timestamp(1230219000123);
   Spark returns empty
   
   SELECT to_timestamp(1230219000123);
   DataFusion returns '1970-01-01 00:20:30.219000123'.
   
   SELECT timestamp(1230219000);
   Spark returns '2008-12-25 08:30:00.0'
   
   SELECT to_timestamp(1230219000);
   DataFusion returns 1970-01-01 00:00:01.230219
   
   ----------------------------
   timestamp_millis()
   
   DataFusion doesn't support the timestamp_millis() function but support the to_timestamp_millis() function.
   
   SELECT timestamp_millis(1230219000123);
   Spark returns '2008-12-25 07:30:00.123'.
   
   SELECT to_timestamp_millis(1230219000123);
   DataFusion returns '2008-12-25 15:30:00.123'
   
   ----------------------------
   timestamp_micros()
   
   DataFusion doesn't support the timestamp_micros() function but support the to_timestamp_micros() functionf.
   
   SELECT timestamp_micros(1230219000123123);
   Spark returns '2008-12-25 07:30:00.123123'
   
   SELECT to_timestamp_micros(1230219000123123);
   DataFusion returns '2008-12-25 15:30:00.123123'
   
   ----------------------------
   
   timestamp_seconds()
   
   DataFusion doesn't support the timestamp_seconds() function but support the to_timestamp_seconds() function.
   
   SELECT timestamp_seconds(1230219000);
   Spark returns '2008-12-25 07:30:00'.
   
   
   SELECT to_timestamp_seconds(1230219000);
   DataFusion returns '2008-12-25 15:30:00'.
   
   
   SELECT timestamp_seconds(1230219000.123);
   Spark returns '2008-12-25 07:30:00.123'.
   
   SELECT to_timestamp_seconds(1230219000.123);
   DataFusion returns Error: ArrowError(CastError("Error parsing '1230219000.123' as timestamp")).
   
   SELECT to_timestamp_seconds(-100000000000000);
   
   thread 'main' panicked at 'invalid or out-of-range datetime', /Users/xxx/.cargo/registry/src/github.com-1ecc6299db9ec823/chrono-0.4.19/src/naive/datetime.rs:117:18
   
   ----------------------------
   translate()
   
   SELECT translate('AaBbCc', 'abc', '123');
   Result matches.
   
   ----------------------------
   unbase64()
   
   DataFusion doesn't support this function.
   
   SELECT unbase64('U3BhcmsgU1FM');
   
   ----------------------------
   unhex()
   
   DataFusion doesn't support this function.
   
   SELECT unhex('537061726B2053514C');
   
   ----------------------------
   
   
   **Expected behavior**
   A clear and concise description of what you expected to happen.
   
   **Additional context**
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] seddonm1 commented on issue #1688: Testing result of String related built-in functions

Posted by GitBox <gi...@apache.org>.

seddonm1 commented on issue #1688:
URL: https://github.com/apache/arrow-datafusion/issues/1688#issuecomment-1024747600


   Remember these were tested against Postgres not Spark as Postgres is much more ANSI SQL aligned than Spark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] mingmwang commented on issue #1688: Testing result of String related built-in functions

Posted by GitBox <gi...@apache.org>.

mingmwang commented on issue #1688:
URL: https://github.com/apache/arrow-datafusion/issues/1688#issuecomment-1023835187


   > Thank you @mingmwang for this detailed comparison note! I think it would be good to create separate issue for each of the issues identified in this research so it's easier for the community to divide and conquer. I can see some of the problems will result good first issues that new community members could pick up to ramp up on the code base.
   
   Sure, I will create separate issue for each of them. I will also try to start fixing some of the gaps.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] houqp commented on issue #1688: Testing result of String related built-in functions

Posted by GitBox <gi...@apache.org>.

houqp commented on issue #1688:
URL: https://github.com/apache/arrow-datafusion/issues/1688#issuecomment-1023519901


   Thank you @mingmwang for this detailed comparison note! I think it would be good to create separate issue for each of the issues identified in this research so it's easier for the community to divide and conquer. I can see some of the problems will result good first issues that new community members could pick up to ramp up on the code base.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] seddonm1 edited a comment on issue #1688: Testing result of String related built-in functions

Posted by GitBox <gi...@apache.org>.

seddonm1 edited a comment on issue #1688:
URL: https://github.com/apache/arrow-datafusion/issues/1688#issuecomment-1024747600


   Remember these were tested against Postgres not Spark as Postgres is much more ANSI SQL aligned than Spark.
   
   I implemented most of these and tested explicitly against the agreed target Postgres values. Spark is not exactly the gold-standard of following ANSI spec.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] xudong963 commented on issue #1688: Testing result of String related built-in functions

Posted by GitBox <gi...@apache.org>.

xudong963 commented on issue #1688:
URL: https://github.com/apache/arrow-datafusion/issues/1688#issuecomment-1022950128


   About `substr` & `substring`
   
   I believe now datafusion supports `substring` and `select substring("" from x for y)`. BTW the behavior is the same as Postgres.
   ```sql
   ❯ SELECT substring('Spark SQL', -3);
   +-------------------------------------+
   | substr(Utf8("Spark SQL"),Int64(-3)) |
   +-------------------------------------+
   | Spark SQL                           |
   +-------------------------------------+
   1 row in set. Query took 0.006 seconds.
   ❯ SELECT substr('Spark SQL', -3);
   +-------------------------------------+
   | substr(Utf8("Spark SQL"),Int64(-3)) |
   +-------------------------------------+
   | Spark SQL                           |
   +-------------------------------------+
   1 row in set. Query took 0.006 seconds.
   ❯ SELECT substring('Spark SQL' from 5 for 1);
   +---------------------------------------------+
   | substr(Utf8("Spark SQL"),Int64(5),Int64(1)) |
   +---------------------------------------------+
   | k                                           |
   +---------------------------------------------+
   1 row in set. Query took 0.007 seconds.
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] seddonm1 edited a comment on issue #1688: Testing result of String related built-in functions

Posted by GitBox <gi...@apache.org>.

seddonm1 edited a comment on issue #1688:
URL: https://github.com/apache/arrow-datafusion/issues/1688#issuecomment-1024747600


   Remember these were built to align with Postgres not Spark as Postgres is much more ANSI SQL aligned than Spark.
   
   I implemented most of these and tested explicitly against the agreed target Postgres values. Spark is not exactly the gold-standard of following ANSI spec.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org