You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/26 13:27:48 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue #156: Length kernel returns bytes not character length

alamb opened a new issue #156:
URL: https://github.com/apache/arrow-datafusion/issues/156


   *Note*: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-11434
   
   The rust `length` kernel currently counts number of bytes/octets rather than characters given that Arrow uses UTF8 encoding.
   
   This means that the result of the `length` kernel on a string like `josé` will be 5 bytes rather than 4 characters.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] novemberkilo commented on issue #156: Length kernel returns bytes not character length

Posted by GitBox <gi...@apache.org>.
novemberkilo commented on issue #156:
URL: https://github.com/apache/arrow-datafusion/issues/156#issuecomment-976340990


   I've had a look around and see the `length` kernel being used in the implementation of `OctetLength` which appears to be tested specifically on the string `josé` here: https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/functions.rs#L2554-L2561
   
   I could use some help with understanding this issue please -- what is the desired outcome -- should the assertion in the test referenced above be 4 instead of 5?
   
   Thanks. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] novemberkilo commented on issue #156: Length kernel returns bytes not character length

Posted by GitBox <gi...@apache.org>.
novemberkilo commented on issue #156:
URL: https://github.com/apache/arrow-datafusion/issues/156#issuecomment-975311496


   Hi - I would like to take on this issue please. // @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb closed issue #156: Length kernel returns bytes not character length

Posted by GitBox <gi...@apache.org>.
alamb closed issue #156:
URL: https://github.com/apache/arrow-datafusion/issues/156


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] novemberkilo edited a comment on issue #156: Length kernel returns bytes not character length

Posted by GitBox <gi...@apache.org>.
novemberkilo edited a comment on issue #156:
URL: https://github.com/apache/arrow-datafusion/issues/156#issuecomment-976340990


   I've had a look around and see the `length` kernel being used in the implementation of `OctetLength` which appears to be tested specifically on the string `josé` here: https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/functions.rs#L2554-L2561
   
   I could use some help with understanding this issue please -- what is the desired outcome -- should the test referenced above change to say `Ok(Some(4))` perhaps?
   
   Thanks. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] novemberkilo edited a comment on issue #156: Length kernel returns bytes not character length

Posted by GitBox <gi...@apache.org>.
novemberkilo edited a comment on issue #156:
URL: https://github.com/apache/arrow-datafusion/issues/156#issuecomment-975311496


   Hi - I would like to take on this issue please. // @alamb @houqp 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #156: Length kernel returns bytes not character length

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #156:
URL: https://github.com/apache/arrow-datafusion/issues/156#issuecomment-826835010


   Comment from Mike Seddon(MikeSeddonAU) @ 2021-01-30T05:45:15.085+0000:
   <pre>The function comments have been updated as part of https://github.com/apache/arrow/pull/9243 to clarify that `length` returns `bytes` not number of characters.</pre>
   
   Comment from Mike Seddon(MikeSeddonAU) @ 2021-01-30T05:46:25.295+0000:
   <pre>This has been addressed in https://github.com/apache/arrow/pull/9243 with updated comments to clarify that this function returns number of bytes not number of characters.</pre>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #156: Length kernel returns bytes not character length

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #156:
URL: https://github.com/apache/arrow-datafusion/issues/156#issuecomment-977056132


   Hi @novemberkilo  -- I think you are correct that this ticket is ill formed
   
   Specifically, since this ticket was filed, DataFusion appears to have both `CharacterLength` as well as `OctetLength`, perhaps due to the great work from @seddonm1 
   
   https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/functions.rs#L225
   
   Closing this one as I don't think it is actionable anymore


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org