You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Sagnik Chakraborty <sa...@dremio.com> on 2020/12/22 05:28:55 UTC

upper() / lower() for utf8 strings

We are looking to implement upper() / lower() for non-ASCII characters. The current Gandiva implementation handles upper() / lower() only for standard ASCII characters.

For the implementation in Gandiva, I went through a few articles and answers on StackOverflow and the top answer to this question <https://stackoverflow.com/questions/36897781/how-to-uppercase-lowercase-utf-8-characters-in-c> suggests that there is no standard way to do Unicode case conversion in C/C++ and that an external library like ICU <https://unicode-org.github.io/icu-docs/#/icu4c/> is necessary to ensure guaranteed Unicode case conversion.

So, I just wanted to know that while adding any external library in Gandiva, what are the issues that we need to take care of in order to ensure that we do not break existing code and not sacrifice on performance as well? Is there any existing library that we can make use of to go about solving this problem? Any suggestions would be welcome.

Regards,
Sagnik

Re: upper() / lower() for utf8 strings

Posted by Maarten Breddels <ma...@gmail.com>.

Hi Sagnik,

it might be worth taking a look at https://github.com/apache/arrow/pull/7449
(that kernel code of mine is a but cumbersome,
TLDR version:
unilib is faster than utf8proc, but there are licensing issues with unilib.
We instead use a LUT to accelerate, at the cost of some memory (would be
great if we at least shared the LUTs).

cpp/src/arrow/util/utf8.h
and cpp/src/arrow/compute/kernels/scalar_string.cc might be useful to take
a look at.
At https://issues.apache.org/jira/browse/ARROW-555 there was a bit of
discussion on using the same codebase for the arrow kernel and Gandiva, but
that never got off the ground.
So yes, if you can do what Wes suggests, that would be great.

cheers,

Maarten Breddels
Software engineer / consultant / data scientist
Python / C++ / Javascript / Jupyter
www.maartenbreddels.com / vaex.io
maartenbreddels@gmail.com +31 6 2464 0838 <+31+6+24640838>
[image: Twitter] <https://twitter.com/maartenbreddels>[image: Github]
<https://github.com/maartenbreddels>[image: LinkedIn]
<https://linkedin.com/in/maartenbreddels>[image: Skype]




On Wed, Dec 23, 2020 at 4:48 PM Wes McKinney <we...@gmail.com> wrote:

> It might be worthwhile to see if some reusable templates can be
> assembled that can be employed in both places
>
> On Tue, Dec 22, 2020 at 5:47 PM Neal Richardson
> <ne...@gmail.com> wrote:
> >
> > FWIW the C++ compute library now uses
> > https://github.com/JuliaStrings/utf8proc, so assuming it does all of the
> > things you want, it could save you some trouble if you used it in Gandiva
> > too--cmake is already set up to use it.
> >
> > Neal
> >
> > On Tue, Dec 22, 2020 at 3:41 PM Sagnik Chakraborty <sa...@dremio.com>
> > wrote:
> >
> > > We are looking to implement upper() / lower() for non-ASCII characters.
> > > The current Gandiva implementation handles upper() / lower() only for
> > > standard ASCII characters.
> > >
> > > For the implementation in Gandiva, I went through a few articles and
> > > answers on StackOverflow and the top answer to this question <
> > >
> https://stackoverflow.com/questions/36897781/how-to-uppercase-lowercase-utf-8-characters-in-c
> >
> > > suggests that there is no standard way to do Unicode case conversion in
> > > C/C++ and that an external library like ICU <
> > > https://unicode-org.github.io/icu-docs/#/icu4c/> is necessary to
> ensure
> > > guaranteed Unicode case conversion.
> > >
> > > So, I just wanted to know that while adding any external library in
> > > Gandiva, what are the issues that we need to take care of in order to
> > > ensure that we do not break existing code and not sacrifice on
> performance
> > > as well? Is there any existing library that we can make use of to go
> about
> > > solving this problem? Any suggestions would be welcome.
> > >
> > > Regards,
> > > Sagnik
>

Re: upper() / lower() for utf8 strings

Posted by Wes McKinney <we...@gmail.com>.

It might be worthwhile to see if some reusable templates can be
assembled that can be employed in both places

On Tue, Dec 22, 2020 at 5:47 PM Neal Richardson
<ne...@gmail.com> wrote:
>
> FWIW the C++ compute library now uses
> https://github.com/JuliaStrings/utf8proc, so assuming it does all of the
> things you want, it could save you some trouble if you used it in Gandiva
> too--cmake is already set up to use it.
>
> Neal
>
> On Tue, Dec 22, 2020 at 3:41 PM Sagnik Chakraborty <sa...@dremio.com>
> wrote:
>
> > We are looking to implement upper() / lower() for non-ASCII characters.
> > The current Gandiva implementation handles upper() / lower() only for
> > standard ASCII characters.
> >
> > For the implementation in Gandiva, I went through a few articles and
> > answers on StackOverflow and the top answer to this question <
> > https://stackoverflow.com/questions/36897781/how-to-uppercase-lowercase-utf-8-characters-in-c>
> > suggests that there is no standard way to do Unicode case conversion in
> > C/C++ and that an external library like ICU <
> > https://unicode-org.github.io/icu-docs/#/icu4c/> is necessary to ensure
> > guaranteed Unicode case conversion.
> >
> > So, I just wanted to know that while adding any external library in
> > Gandiva, what are the issues that we need to take care of in order to
> > ensure that we do not break existing code and not sacrifice on performance
> > as well? Is there any existing library that we can make use of to go about
> > solving this problem? Any suggestions would be welcome.
> >
> > Regards,
> > Sagnik

Re: upper() / lower() for utf8 strings

Posted by Neal Richardson <ne...@gmail.com>.

FWIW the C++ compute library now uses
https://github.com/JuliaStrings/utf8proc, so assuming it does all of the
things you want, it could save you some trouble if you used it in Gandiva
too--cmake is already set up to use it.

Neal

On Tue, Dec 22, 2020 at 3:41 PM Sagnik Chakraborty <sa...@dremio.com>
wrote:

> We are looking to implement upper() / lower() for non-ASCII characters.
> The current Gandiva implementation handles upper() / lower() only for
> standard ASCII characters.
>
> For the implementation in Gandiva, I went through a few articles and
> answers on StackOverflow and the top answer to this question <
> https://stackoverflow.com/questions/36897781/how-to-uppercase-lowercase-utf-8-characters-in-c>
> suggests that there is no standard way to do Unicode case conversion in
> C/C++ and that an external library like ICU <
> https://unicode-org.github.io/icu-docs/#/icu4c/> is necessary to ensure
> guaranteed Unicode case conversion.
>
> So, I just wanted to know that while adding any external library in
> Gandiva, what are the issues that we need to take care of in order to
> ensure that we do not break existing code and not sacrifice on performance
> as well? Is there any existing library that we can make use of to go about
> solving this problem? Any suggestions would be welcome.
>
> Regards,
> Sagnik