You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/26 13:27:48 UTC
[GitHub] [arrow-datafusion] alamb opened a new issue #156: Length kernel returns bytes not character length
alamb opened a new issue #156:
URL: https://github.com/apache/arrow-datafusion/issues/156
*Note*: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-11434
The rust `length` kernel currently counts number of bytes/octets rather than characters given that Arrow uses UTF8 encoding.
This means that the result of the `length` kernel on a string like `josé` will be 5 bytes rather than 4 characters.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] novemberkilo commented on issue #156: Length kernel returns bytes not character length
Posted by GitBox <gi...@apache.org>.
novemberkilo commented on issue #156:
URL: https://github.com/apache/arrow-datafusion/issues/156#issuecomment-976340990
I've had a look around and see the `length` kernel being used in the implementation of `OctetLength` which appears to be tested specifically on the string `josé` here: https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/functions.rs#L2554-L2561
I could use some help with understanding this issue please -- what is the desired outcome -- should the assertion in the test referenced above be 4 instead of 5?
Thanks.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] novemberkilo commented on issue #156: Length kernel returns bytes not character length
Posted by GitBox <gi...@apache.org>.
novemberkilo commented on issue #156:
URL: https://github.com/apache/arrow-datafusion/issues/156#issuecomment-975311496
Hi - I would like to take on this issue please. // @alamb
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb closed issue #156: Length kernel returns bytes not character length
Posted by GitBox <gi...@apache.org>.
alamb closed issue #156:
URL: https://github.com/apache/arrow-datafusion/issues/156
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] novemberkilo edited a comment on issue #156: Length kernel returns bytes not character length
Posted by GitBox <gi...@apache.org>.
novemberkilo edited a comment on issue #156:
URL: https://github.com/apache/arrow-datafusion/issues/156#issuecomment-976340990
I've had a look around and see the `length` kernel being used in the implementation of `OctetLength` which appears to be tested specifically on the string `josé` here: https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/functions.rs#L2554-L2561
I could use some help with understanding this issue please -- what is the desired outcome -- should the test referenced above change to say `Ok(Some(4))` perhaps?
Thanks.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] novemberkilo edited a comment on issue #156: Length kernel returns bytes not character length
Posted by GitBox <gi...@apache.org>.
novemberkilo edited a comment on issue #156:
URL: https://github.com/apache/arrow-datafusion/issues/156#issuecomment-975311496
Hi - I would like to take on this issue please. // @alamb @houqp
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on issue #156: Length kernel returns bytes not character length
Posted by GitBox <gi...@apache.org>.
alamb commented on issue #156:
URL: https://github.com/apache/arrow-datafusion/issues/156#issuecomment-826835010
Comment from Mike Seddon(MikeSeddonAU) @ 2021-01-30T05:45:15.085+0000:
<pre>The function comments have been updated as part of https://github.com/apache/arrow/pull/9243 to clarify that `length` returns `bytes` not number of characters.</pre>
Comment from Mike Seddon(MikeSeddonAU) @ 2021-01-30T05:46:25.295+0000:
<pre>This has been addressed in https://github.com/apache/arrow/pull/9243 with updated comments to clarify that this function returns number of bytes not number of characters.</pre>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on issue #156: Length kernel returns bytes not character length
Posted by GitBox <gi...@apache.org>.
alamb commented on issue #156:
URL: https://github.com/apache/arrow-datafusion/issues/156#issuecomment-977056132
Hi @novemberkilo -- I think you are correct that this ticket is ill formed
Specifically, since this ticket was filed, DataFusion appears to have both `CharacterLength` as well as `OctetLength`, perhaps due to the great work from @seddonm1
https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/functions.rs#L225
Closing this one as I don't think it is actionable anymore
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org