You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by Duncan Jones <du...@wortharead.com> on 2017/05/21 12:06:21 UTC

[TEXT] How do we want to handle case conversions?

Hi everyone,

I’ve found some time to continue breaking WordUtils into separate classes (eschewing the “big collection of static methods” approach). However, as I read more about case handling in Unicode, I realise how simplistic the WordUtils methods are and how complex a full solution would need to be.

Section 5.18 of the Unicode specification [1] describes these complexities. The mains ones that bother me are:

1. Title case conversions vary widely between different locales and languages. I’m not clear whether any locale is satisfied by the current simplistic implementation in WordUtils.capitalize(str). Supporting this correctly would be a serious challenge.

2. All types of case conversion may vary depending upon context/locale. There are examples provided in [1] where the outcome is different in a Turkish locale or if the letter in question is followed by another or not.

Does anyone have a suggestion for how to move forward with this work? I see three options: 1] Admit defeat and avoid the case conversion mess entirely. 2] Mimic the existing functionality, but document the limitations. 3] Attempt to deliver a locale-dependent version, perhaps still with limitations (or for certain languages).

I’m leaning towards 2, perhaps even calling the classes “SimpleX…”.

Thanks,
Duncan

[1] http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [TEXT] How do we want to handle case conversions?

Posted by Rob Tompkins <ch...@gmail.com>.

> On May 22, 2017, at 6:04 AM, sebb <se...@gmail.com> wrote:
> 
> On 22 May 2017 at 06:56, Duncan Jones <du...@wortharead.com> wrote:
>> 
>>> On 21 May 2017, at 19:43, Gary Gregory <ga...@gmail.com> wrote:
>>> 
>>> Pardon the obvious but what is missing from methods like
>>> https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isLowerCase(char)
>>> 
>>> Gary
>> 
>> 
>> The WordUtils methods turn sentences into title case, which Java’s core libraries don’t offer. In fact, the core libraries make doing locale-sensitive title case conversions very difficult (see http://stackoverflow.com/questions/7360996/unicode-correct-title-case-in-java for example).
>> 
>> Doing title casing correctly is quite a subtle art. We don’t even do it correctly for English at the moment, which would normally capitalise “The Life of Reilly” rather than “The Life Of Reilly”. Other languages have completely different conventions or additional complexities.
>> 
> 
> However the Javadoc does state that the capitalisation is based on
> words, not sentences.
> So I don't know if there is any expectation that it will take account
> of the meaning of the words.
> 
> I guess the question is whether that is useful at all?
> If so, we should clarify that the processing takes no account of the
> meaning of the words.
> If not, we should perhaps drop the methods.
> 
> I think it will be a huge effort to produce anything that works
> properly even for US English, let alone UK English.
> 

I agree here with the level of effort needed to properly capitalize anything in a semantic fashion without some approximation mechanics. The only clear way to do capitalization in a deterministic fashion is simply to rely upon delimiters. 

I would think that admitting defeat (for commons) isn’t an unreasonable option, with the possibility of putting the bulk of the work in OpenNLP. I would think that would be a better venue for such an algorithm because of the mechanics of language determination being present there and not here.

> Names will be a particular problem: ee cummings, D'Ath, O'Toole, MacDonald
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [TEXT] How do we want to handle case conversions?

Posted by sebb <se...@gmail.com>.

On 22 May 2017 at 06:56, Duncan Jones <du...@wortharead.com> wrote:
>
>> On 21 May 2017, at 19:43, Gary Gregory <ga...@gmail.com> wrote:
>>
>> Pardon the obvious but what is missing from methods like
>> https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isLowerCase(char)
>>
>> Gary
>
>
> The WordUtils methods turn sentences into title case, which Java’s core libraries don’t offer. In fact, the core libraries make doing locale-sensitive title case conversions very difficult (see http://stackoverflow.com/questions/7360996/unicode-correct-title-case-in-java for example).
>
> Doing title casing correctly is quite a subtle art. We don’t even do it correctly for English at the moment, which would normally capitalise “The Life of Reilly” rather than “The Life Of Reilly”. Other languages have completely different conventions or additional complexities.
>

However the Javadoc does state that the capitalisation is based on
words, not sentences.
So I don't know if there is any expectation that it will take account
of the meaning of the words.

I guess the question is whether that is useful at all?
If so, we should clarify that the processing takes no account of the
meaning of the words.
If not, we should perhaps drop the methods.

I think it will be a huge effort to produce anything that works
properly even for US English, let alone UK English.

Names will be a particular problem: ee cummings, D'Ath, O'Toole, MacDonald

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [TEXT] How do we want to handle case conversions?

Posted by Gary Gregory <ga...@gmail.com>.

On May 21, 2017 10:56 PM, "Duncan Jones" <du...@wortharead.com> wrote:

> On 21 May 2017, at 19:43, Gary Gregory <ga...@gmail.com> wrote:
>
> Pardon the obvious but what is missing from methods like
> https://docs.oracle.com/javase/7/docs/api/java/lang/
Character.html#isLowerCase(char)
>
> Gary

The WordUtils methods turn sentences into title case, which Java’s core
libraries don’t offer. In fact, the core libraries make doing
locale-sensitive title case conversions very difficult (see
http://stackoverflow.com/questions/7360996/unicode-
correct-title-case-in-java for example).

Doing title casing correctly is quite a subtle art. We don’t even do it
correctly for English at the moment, which would normally capitalise “The
Life of Reilly” rather than “The Life Of Reilly”. Other languages have
completely different conventions or additional complexities.

I see. So the hard part is coming up with the rules.

Aside from that I could see creating an instance of a class
"TitleCaseConverter" or some such with a Locale through a factory method.
The factory can decide whether or not to create a Locale specific subclass.
Maybe there are general rules that could be implemented in the parent class
or even driven of a locale specific properties file... TBD ;-)

Gary

>
> On May 21, 2017 5:06 AM, "Duncan Jones" <du...@wortharead.com> wrote:
>
>> Hi everyone,
>>
>> I’ve found some time to continue breaking WordUtils into separate classes
>> (eschewing the “big collection of static methods” approach). However, as
I
>> read more about case handling in Unicode, I realise how simplistic the
>> WordUtils methods are and how complex a full solution would need to be.
>>
>> Section 5.18 of the Unicode specification [1] describes these
>> complexities. The mains ones that bother me are:
>>
>> 1. Title case conversions vary widely between different locales and
>> languages. I’m not clear whether any locale is satisfied by the current
>> simplistic implementation in WordUtils.capitalize(str). Supporting this
>> correctly would be a serious challenge.
>>
>> 2. All types of case conversion may vary depending upon context/locale.
>> There are examples provided in [1] where the outcome is different in a
>> Turkish locale or if the letter in question is followed by another or
not.
>>
>> Does anyone have a suggestion for how to move forward with this work? I
>> see three options: 1] Admit defeat and avoid the case conversion mess
>> entirely. 2] Mimic the existing functionality, but document the
>> limitations. 3] Attempt to deliver a locale-dependent version, perhaps
>> still with limitations (or for certain languages).
>>
>> I’m leaning towards 2, perhaps even calling the classes “SimpleX…”.
>>
>> Thanks,
>> Duncan
>>
>>
>> [1] http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [TEXT] How do we want to handle case conversions?

Posted by Duncan Jones <du...@wortharead.com>.

> On 21 May 2017, at 19:43, Gary Gregory <ga...@gmail.com> wrote:
> 
> Pardon the obvious but what is missing from methods like
> https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isLowerCase(char)
> 
> Gary


The WordUtils methods turn sentences into title case, which Java’s core libraries don’t offer. In fact, the core libraries make doing locale-sensitive title case conversions very difficult (see http://stackoverflow.com/questions/7360996/unicode-correct-title-case-in-java for example).

Doing title casing correctly is quite a subtle art. We don’t even do it correctly for English at the moment, which would normally capitalise “The Life of Reilly” rather than “The Life Of Reilly”. Other languages have completely different conventions or additional complexities.


> 
> On May 21, 2017 5:06 AM, "Duncan Jones" <du...@wortharead.com> wrote:
> 
>> Hi everyone,
>> 
>> I’ve found some time to continue breaking WordUtils into separate classes
>> (eschewing the “big collection of static methods” approach). However, as I
>> read more about case handling in Unicode, I realise how simplistic the
>> WordUtils methods are and how complex a full solution would need to be.
>> 
>> Section 5.18 of the Unicode specification [1] describes these
>> complexities. The mains ones that bother me are:
>> 
>> 1. Title case conversions vary widely between different locales and
>> languages. I’m not clear whether any locale is satisfied by the current
>> simplistic implementation in WordUtils.capitalize(str). Supporting this
>> correctly would be a serious challenge.
>> 
>> 2. All types of case conversion may vary depending upon context/locale.
>> There are examples provided in [1] where the outcome is different in a
>> Turkish locale or if the letter in question is followed by another or not.
>> 
>> Does anyone have a suggestion for how to move forward with this work? I
>> see three options: 1] Admit defeat and avoid the case conversion mess
>> entirely. 2] Mimic the existing functionality, but document the
>> limitations. 3] Attempt to deliver a locale-dependent version, perhaps
>> still with limitations (or for certain languages).
>> 
>> I’m leaning towards 2, perhaps even calling the classes “SimpleX…”.
>> 
>> Thanks,
>> Duncan
>> 
>> 
>> [1] http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [TEXT] How do we want to handle case conversions?

Posted by Gary Gregory <ga...@gmail.com>.

Pardon the obvious but what is missing from methods like
https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isLowerCase(char)

Gary

On May 21, 2017 5:06 AM, "Duncan Jones" <du...@wortharead.com> wrote:

> Hi everyone,
>
> I’ve found some time to continue breaking WordUtils into separate classes
> (eschewing the “big collection of static methods” approach). However, as I
> read more about case handling in Unicode, I realise how simplistic the
> WordUtils methods are and how complex a full solution would need to be.
>
> Section 5.18 of the Unicode specification [1] describes these
> complexities. The mains ones that bother me are:
>
> 1. Title case conversions vary widely between different locales and
> languages. I’m not clear whether any locale is satisfied by the current
> simplistic implementation in WordUtils.capitalize(str). Supporting this
> correctly would be a serious challenge.
>
> 2. All types of case conversion may vary depending upon context/locale.
> There are examples provided in [1] where the outcome is different in a
> Turkish locale or if the letter in question is followed by another or not.
>
> Does anyone have a suggestion for how to move forward with this work? I
> see three options: 1] Admit defeat and avoid the case conversion mess
> entirely. 2] Mimic the existing functionality, but document the
> limitations. 3] Attempt to deliver a locale-dependent version, perhaps
> still with limitations (or for certain languages).
>
> I’m leaning towards 2, perhaps even calling the classes “SimpleX…”.
>
> Thanks,
> Duncan
>
>
> [1] http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

Re: [TEXT] How do we want to handle case conversions?

Posted by Duncan Jones <du...@wortharead.com>.

> On 21 May 2017, at 18:41, Claude Warren <cl...@xenei.com> wrote:
> 
> Seems like you have done a lot of investigation so let meask.  Can you
> develop a mechanism that iz extensible to support the various localls and
> then just implement the simple version?  This would provide a framework so
> others could implement the ones the desire as needs arise.

Nice idea, I’ll see what’s possible.


> 
> On 21 May 2017 16:48, "Benedikt Ritter" <br...@apache.org> wrote:
> 
>> Hi,
>> 
>>> Am 21.05.2017 um 08:06 schrieb Duncan Jones <du...@wortharead.com>:
>>> 
>>> Hi everyone,
>>> 
>>> I’ve found some time to continue breaking WordUtils into separate
>> classes (eschewing the “big collection of static methods” approach).
>> However, as I read more about case handling in Unicode, I realise how
>> simplistic the WordUtils methods are and how complex a full solution would
>> need to be.
>>> 
>>> Section 5.18 of the Unicode specification [1] describes these
>> complexities. The mains ones that bother me are:
>>> 
>>> 1. Title case conversions vary widely between different locales and
>> languages. I’m not clear whether any locale is satisfied by the current
>> simplistic implementation in WordUtils.capitalize(str). Supporting this
>> correctly would be a serious challenge.
>>> 
>>> 2. All types of case conversion may vary depending upon context/locale.
>> There are examples provided in [1] where the outcome is different in a
>> Turkish locale or if the letter in question is followed by another or not.
>>> 
>>> Does anyone have a suggestion for how to move forward with this work? I
>> see three options: 1] Admit defeat and avoid the case conversion mess
>> entirely. 2] Mimic the existing functionality, but document the
>> limitations. 3] Attempt to deliver a locale-dependent version, perhaps
>> still with limitations (or for certain languages).
>>> 
>>> I’m leaning towards 2, perhaps even calling the classes “SimpleX…”.
>> 
>> Sounds good to me.
>> 
>>> 
>>> Thanks,
>>> Duncan
>>> 
>>> 
>>> [1] http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>> For additional commands, e-mail: dev-help@commons.apache.org
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [TEXT] How do we want to handle case conversions?

Posted by Claude Warren <cl...@xenei.com>.

Seems like you have done a lot of investigation so let meask.  Can you
develop a mechanism that iz extensible to support the various localls and
then just implement the simple version?  This would provide a framework so
others could implement the ones the desire as needs arise.

On 21 May 2017 16:48, "Benedikt Ritter" <br...@apache.org> wrote:

> Hi,
>
> > Am 21.05.2017 um 08:06 schrieb Duncan Jones <du...@wortharead.com>:
> >
> > Hi everyone,
> >
> > I’ve found some time to continue breaking WordUtils into separate
> classes (eschewing the “big collection of static methods” approach).
> However, as I read more about case handling in Unicode, I realise how
> simplistic the WordUtils methods are and how complex a full solution would
> need to be.
> >
> > Section 5.18 of the Unicode specification [1] describes these
> complexities. The mains ones that bother me are:
> >
> > 1. Title case conversions vary widely between different locales and
> languages. I’m not clear whether any locale is satisfied by the current
> simplistic implementation in WordUtils.capitalize(str). Supporting this
> correctly would be a serious challenge.
> >
> > 2. All types of case conversion may vary depending upon context/locale.
> There are examples provided in [1] where the outcome is different in a
> Turkish locale or if the letter in question is followed by another or not.
> >
> > Does anyone have a suggestion for how to move forward with this work? I
> see three options: 1] Admit defeat and avoid the case conversion mess
> entirely. 2] Mimic the existing functionality, but document the
> limitations. 3] Attempt to deliver a locale-dependent version, perhaps
> still with limitations (or for certain languages).
> >
> > I’m leaning towards 2, perhaps even calling the classes “SimpleX…”.
>
> Sounds good to me.
>
> >
> > Thanks,
> > Duncan
> >
> >
> > [1] http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> > For additional commands, e-mail: dev-help@commons.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

Re: [TEXT] How do we want to handle case conversions?

Posted by Benedikt Ritter <br...@apache.org>.

Hi,

> Am 21.05.2017 um 08:06 schrieb Duncan Jones <du...@wortharead.com>:
> 
> Hi everyone,
> 
> I’ve found some time to continue breaking WordUtils into separate classes (eschewing the “big collection of static methods” approach). However, as I read more about case handling in Unicode, I realise how simplistic the WordUtils methods are and how complex a full solution would need to be.
> 
> Section 5.18 of the Unicode specification [1] describes these complexities. The mains ones that bother me are:
> 
> 1. Title case conversions vary widely between different locales and languages. I’m not clear whether any locale is satisfied by the current simplistic implementation in WordUtils.capitalize(str). Supporting this correctly would be a serious challenge.
> 
> 2. All types of case conversion may vary depending upon context/locale. There are examples provided in [1] where the outcome is different in a Turkish locale or if the letter in question is followed by another or not.
> 
> Does anyone have a suggestion for how to move forward with this work? I see three options: 1] Admit defeat and avoid the case conversion mess entirely. 2] Mimic the existing functionality, but document the limitations. 3] Attempt to deliver a locale-dependent version, perhaps still with limitations (or for certain languages).
> 
> I’m leaning towards 2, perhaps even calling the classes “SimpleX…”.

Sounds good to me.

> 
> Thanks,
> Duncan
> 
> 
> [1] http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [TEXT] How do we want to handle case conversions?

Posted by "Bruno P. Kinoshita" <br...@yahoo.com.br.INVALID>.

Fair enough. My use case was not really important, and if I need to use the simple approach it is still quite easy and quick to implement it in my code.

In case someone else ever finds another reason for having it, we can always revisit it later.

>Perhaps after that, we could deprecate WordUtils for removal in 2.0?


Sounds like a good plan.

Cheers
Bruno________________________________
From: Duncan Jones <du...@wortharead.com>
To: Commons Developers List <de...@commons.apache.org>; Bruno P. Kinoshita <br...@yahoo.com.br> 
Sent: Friday, 26 May 2017 3:28 AM
Subject: Re: [TEXT] How do we want to handle case conversions?




> On 22 May 2017, at 12:15, Bruno P. Kinoshita <br...@yahoo.com.br.INVALID> wrote:
> 
> I'd be in favour of 2 or some variation of it. Provide a well documented naïve implementation, and use whatever is available at the JVM for handling upper/lower case.
> I would use it for very simple cases, where all I need is to capitalise each word, and where it would be OK to have possible mistakes in case you have to handle text that is not in English, or special cases like some new mathematical symbol (e.g. U+1D52B mathematical fraktur small N, which also uses surrogate to make it even more interesting).
> 
> For cases where I have to take care of different languages (e.g. ch digraph for Czech) I would probably use ICU.
> 
> 
> For cases that depend on the country, context, or some other feature (e.g. names in Dutch with the van preposition) I would probably look at OpenNLP with a machine learning or rule based approach.
> 
> The issue is that when all I need is the very simple approach now, I would have to write something like a for-loop or Java 8 stream and split the text, then call toUpperCase on each first char, then write tests for it, etc. I think for this case it would still be worth having our simple implementation in [text], with docs explaining what it is capable of, and what it is not.

I’m beginning to think a simple implementation is harmful - developers will use it in places where they ought to be doing something more locale-specific.. Even in English, trivially capitalising each word is rarely correct. We should avoid the topic entirely if we don't do it justice and I agree that OpenNLP seems a good home for this sort of work.

Consequently, I see little benefit in rewriting the WordUtil capitalisation methods and plan to leave them alone. I’m also tempted to extend their Javadocs to point out the deficiencies.

I’ll instead focus on pulling out the wrapping methods into something more object-oriented. Perhaps after that, we could deprecate WordUtils for removal in 2.0?

> 
> Cheers
> Bruno
> [] https://codepoints.net/U+1D52B?lang=en
> [] https://en.wikipedia.org/wiki/Ch_(digraph)#Czech
> [] https://en.wikipedia.org/wiki/Van_(Dutch)#Collation_and_capitalisation
> ________________________________
> From: Duncan Jones <du...@wortharead.com>
> To: Commons Developers List <de...@commons.apache.org> 
> Sent: Monday, 22 May 2017 12:06 AM
> Subject: [TEXT] How do we want to handle case conversions?
> 
> 
> 
> Hi everyone,
> 
> 
> I’ve found some time to continue breaking WordUtils into separate classes (eschewing the “big collection of static methods” approach). However, as I read more about case handling in Unicode, I realise how simplistic the WordUtils methods are and how complex a full solution would need to be.
> 
> 
> Section 5.18 of the Unicode specification [1] describes these complexities. The mains ones that bother me are:
> 
> 
> 1. Title case conversions vary widely between different locales and languages. I’m not clear whether any locale is satisfied by the current simplistic implementation in WordUtils.capitalize(str). Supporting this correctly would be a serious challenge.
> 
> 
> 2. All types of case conversion may vary depending upon context/locale. There are examples provided in [1] where the outcome is different in a Turkish locale or if the letter in question is followed by another or not.
> 
> 
> Does anyone have a suggestion for how to move forward with this work? I see three options: 1] Admit defeat and avoid the case conversion mess entirely. 2] Mimic the existing functionality, but document the limitations. 3] Attempt to deliver a locale-dependent version, perhaps still with limitations (or for certain languages).
> 
> 
> I’m leaning towards 2, perhaps even calling the classes “SimpleX…”.
> 
> 
> Thanks,
> 
> Duncan
> 
> 
> 
> [1] http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf
> 
> ---------------------------------------------------------------------
> 
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> 
> For additional commands, e-mail: dev-help@commons.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org

> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [TEXT] How do we want to handle case conversions?

Posted by Duncan Jones <du...@wortharead.com>.

> On 22 May 2017, at 12:15, Bruno P. Kinoshita <br...@yahoo.com.br.INVALID> wrote:
> 
> I'd be in favour of 2 or some variation of it. Provide a well documented naïve implementation, and use whatever is available at the JVM for handling upper/lower case.
> I would use it for very simple cases, where all I need is to capitalise each word, and where it would be OK to have possible mistakes in case you have to handle text that is not in English, or special cases like some new mathematical symbol (e.g. U+1D52B mathematical fraktur small N, which also uses surrogate to make it even more interesting).
> 
> For cases where I have to take care of different languages (e.g. ch digraph for Czech) I would probably use ICU.
> 
> 
> For cases that depend on the country, context, or some other feature (e.g. names in Dutch with the van preposition) I would probably look at OpenNLP with a machine learning or rule based approach.
> 
> The issue is that when all I need is the very simple approach now, I would have to write something like a for-loop or Java 8 stream and split the text, then call toUpperCase on each first char, then write tests for it, etc. I think for this case it would still be worth having our simple implementation in [text], with docs explaining what it is capable of, and what it is not.

I’m beginning to think a simple implementation is harmful - developers will use it in places where they ought to be doing something more locale-specific.. Even in English, trivially capitalising each word is rarely correct. We should avoid the topic entirely if we don't do it justice and I agree that OpenNLP seems a good home for this sort of work.

Consequently, I see little benefit in rewriting the WordUtil capitalisation methods and plan to leave them alone. I’m also tempted to extend their Javadocs to point out the deficiencies.

I’ll instead focus on pulling out the wrapping methods into something more object-oriented. Perhaps after that, we could deprecate WordUtils for removal in 2.0?

> 
> Cheers
> Bruno
> [] https://codepoints.net/U+1D52B?lang=en
> [] https://en.wikipedia.org/wiki/Ch_(digraph)#Czech
> [] https://en.wikipedia.org/wiki/Van_(Dutch)#Collation_and_capitalisation
> ________________________________
> From: Duncan Jones <du...@wortharead.com>
> To: Commons Developers List <de...@commons.apache.org> 
> Sent: Monday, 22 May 2017 12:06 AM
> Subject: [TEXT] How do we want to handle case conversions?
> 
> 
> 
> Hi everyone,
> 
> 
> I’ve found some time to continue breaking WordUtils into separate classes (eschewing the “big collection of static methods” approach). However, as I read more about case handling in Unicode, I realise how simplistic the WordUtils methods are and how complex a full solution would need to be.
> 
> 
> Section 5.18 of the Unicode specification [1] describes these complexities. The mains ones that bother me are:
> 
> 
> 1. Title case conversions vary widely between different locales and languages. I’m not clear whether any locale is satisfied by the current simplistic implementation in WordUtils.capitalize(str). Supporting this correctly would be a serious challenge.
> 
> 
> 2. All types of case conversion may vary depending upon context/locale. There are examples provided in [1] where the outcome is different in a Turkish locale or if the letter in question is followed by another or not.
> 
> 
> Does anyone have a suggestion for how to move forward with this work? I see three options: 1] Admit defeat and avoid the case conversion mess entirely. 2] Mimic the existing functionality, but document the limitations. 3] Attempt to deliver a locale-dependent version, perhaps still with limitations (or for certain languages).
> 
> 
> I’m leaning towards 2, perhaps even calling the classes “SimpleX…”.
> 
> 
> Thanks,
> 
> Duncan
> 
> 
> 
> [1] http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf
> 
> ---------------------------------------------------------------------
> 
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> 
> For additional commands, e-mail: dev-help@commons.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [TEXT] How do we want to handle case conversions?

Posted by "Bruno P. Kinoshita" <br...@yahoo.com.br.INVALID>.

I'd be in favour of 2 or some variation of it. Provide a well documented naïve implementation, and use whatever is available at the JVM for handling upper/lower case.
I would use it for very simple cases, where all I need is to capitalise each word, and where it would be OK to have possible mistakes in case you have to handle text that is not in English, or special cases like some new mathematical symbol (e.g. U+1D52B mathematical fraktur small N, which also uses surrogate to make it even more interesting).

For cases where I have to take care of different languages (e.g. ch digraph for Czech) I would probably use ICU.


For cases that depend on the country, context, or some other feature (e.g. names in Dutch with the van preposition) I would probably look at OpenNLP with a machine learning or rule based approach.

The issue is that when all I need is the very simple approach now, I would have to write something like a for-loop or Java 8 stream and split the text, then call toUpperCase on each first char, then write tests for it, etc. I think for this case it would still be worth having our simple implementation in [text], with docs explaining what it is capable of, and what it is not.

Cheers
Bruno
[] https://codepoints.net/U+1D52B?lang=en
[] https://en.wikipedia.org/wiki/Ch_(digraph)#Czech
[] https://en.wikipedia.org/wiki/Van_(Dutch)#Collation_and_capitalisation
________________________________
From: Duncan Jones <du...@wortharead.com>
To: Commons Developers List <de...@commons.apache.org> 
Sent: Monday, 22 May 2017 12:06 AM
Subject: [TEXT] How do we want to handle case conversions?



Hi everyone,


I’ve found some time to continue breaking WordUtils into separate classes (eschewing the “big collection of static methods” approach). However, as I read more about case handling in Unicode, I realise how simplistic the WordUtils methods are and how complex a full solution would need to be.


Section 5.18 of the Unicode specification [1] describes these complexities. The mains ones that bother me are:


1. Title case conversions vary widely between different locales and languages. I’m not clear whether any locale is satisfied by the current simplistic implementation in WordUtils.capitalize(str). Supporting this correctly would be a serious challenge.


2. All types of case conversion may vary depending upon context/locale. There are examples provided in [1] where the outcome is different in a Turkish locale or if the letter in question is followed by another or not.


Does anyone have a suggestion for how to move forward with this work? I see three options: 1] Admit defeat and avoid the case conversion mess entirely. 2] Mimic the existing functionality, but document the limitations. 3] Attempt to deliver a locale-dependent version, perhaps still with limitations (or for certain languages).


I’m leaning towards 2, perhaps even calling the classes “SimpleX…”.


Thanks,

Duncan



[1] http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf

---------------------------------------------------------------------

To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org

For additional commands, e-mail: dev-help@commons.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org