You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by Alex Herbert <al...@gmail.com> on 2019/03/02 16:47:24 UTC

[Text, Lang] Matching two CharSequence instances

I am helping with the PR for TEXT-126 to add to the similarity package.

Part of the new algorithm requires identifying if two CharSequences are identical. Is there a utility in Text to do something like this:

public static boolean CharSequenceUtils.equals(CharSequence, CharSequence);

I cannot find one with a quick regex search of the library. I am not familiar with Lang either but this is a dependency so a method from there could be used.

The current PR is using left.equals(right) on the input CharSequence to compare to one to another which is wrong if the two input CharSequences do not support matching, e.g. if the input was a String and StringBuilder then String.equals(StringBuilder) would not match, even if the characters were the same.

Regards,

Alex


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [Text, Lang] Matching two CharSequence instances

Posted by Alex Herbert <al...@gmail.com>.


> On 3 Mar 2019, at 02:00, Bruno P. Kinoshita <ki...@apache.org> wrote:
> 
> Hi Alex,
> Also found the two implementations similar, but confusing.
> - The parameter name inconsistency
> +1
> 
> 
> - The edge case logic inconsistency
> 
> +1
> 
> - Change to a stepwise charAt comparison
> +1 if the behaviour is kept. Right now some crazy characters like those old latin letters are supported (e.g. LATIN CAPITAL LIGATURE IJ), and the result for equal and equal ignore case are as expected.
> And right now those methods do not work with surrogate characters (e.g. GEORGIAN LETTER AN and GEORGIAN CAPITAL LETTER AN are false for equals and equals ignore case). But this is an edge case that we can ignore for now I believe.

I’ve tested it locally and all units tests are fine. So I’ll PR when I have time.

> CheersBruno
> 
>    On Sunday, 3 March 2019, 1:29:12 pm NZDT, Alex Herbert <al...@gmail.com> wrote:  
> 
> Having looked a bit more at StringUtils it appears that:
> 
> public static boolean equals(final CharSequence cs1, final CharSequence cs2);
> public static boolean equalsIgnoreCase(final CharSequence str1, final CharSequence str2);
> 
> share edge case logic checking but it is implemented differently (although with the same effect). They also have inconsistent parameter names (cs1 vs str1).
> 
> The equals method then ends up calling CharSequenceUtils.regionMatches but without the ignore case option. This method does a lot more work than necessary as it performs a lot of edge case checks then loops with regard to checking for case. It would be cleaner to mimic String.equals by using a stepwise charAt comparison.
> 
> Any objections to a PR to fix:
> 
> - The parameter name inconsistency
> - The edge case logic inconsistency
> - Change to a stepwise charAt comparison
> 
> Alex
> 
>> On 2 Mar 2019, at 20:49, Alex Herbert <al...@gmail.com> wrote:
>> 
>> 
>>> On 2 Mar 2019, at 16:59, Mark Dacek <mark@syberion.com <ma...@syberion.com>> wrote:
>>> 
>>> Is your proposed method a stepwise charAt comparison across both, assuming
>>> non-null and equal length?
>> 
>> Yes. Although the StringUtils.equals(CharSequence, CharSequence) from [lang] will do the job correctly (thanks Gary). It currently does all the edge case checks then calls a region matching method using the entire length but the effect is the same as:
>> 
>> for (int i = 0; i < cs1.length(); i++) {
>>     if (cs1.charAt(i) != cs2.charAt(i)) {
>>         return false;
>>     }
>> }
>> return true;
>> 
>> Switching in the above code instead of the call to regionMatches(…) at the end of StringUtils.equals(CharSequence, CharSequence) would avoid repeating all the edge case checks of length at the start of that method and the case insensitivity functionality. 
>> 
>> The StringUtils.equals method already detects if String is input as both arguments and defaults to that if possible. So this is basically for any other combination of CharSequence types where a simple stepwise charAt comparison is wanted.
>> 
>> 
>>> Doesn't seem like a bad idea, though I'm curious whether there's a use-case
>>> where toString() on both and comparing isn't more expedient.
>> 
>> Just the memory overhead of duplicating to create a String. If a match is unlikely, especially near the start, then this is a cost to consider for longer strings.
>> 
>> I was just after something to put in place of the incorrect usage of:
>> 
>> CharSequence cs1, cs2;
>> cs1.equals(cs2);
>> 
>> Which is not part of the CharSequence interface and works only if inputting 2 objects that support equals correctly, like String or StringBuilder.
>> 
>> I’ve just has a look for .equals() in all of [text] and this is actually a bug that is in the newly submitted JaroWinklerSimilarity too. 
>> 
>> I’ll do a PR to fix that one.
>> 
>>> 
>>> On Sat, Mar 2, 2019 at 11:53 AM Alex Herbert <alex.d.herbert@gmail.com <ma...@gmail.com>>
>>> wrote:
>>> 
>>>> I am helping with the PR for TEXT-126 to add to the similarity package.
>>>> 
>>>> Part of the new algorithm requires identifying if two CharSequences are
>>>> identical. Is there a utility in Text to do something like this:
>>>> 
>>>> public static boolean CharSequenceUtils.equals(CharSequence, CharSequence);
>>>> 
>>>> I cannot find one with a quick regex search of the library. I am not
>>>> familiar with Lang either but this is a dependency so a method from there
>>>> could be used.
>>>> 
>>>> The current PR is using left.equals(right) on the input CharSequence to
>>>> compare to one to another which is wrong if the two input CharSequences do
>>>> not support matching, e.g. if the input was a String and StringBuilder then
>>>> String.equals(StringBuilder) would not match, even if the characters were
>>>> the same.
>>>> 
>>>> Regards,
>>>> 
>>>> Alex
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org <ma...@commons.apache.org>
>>>> For additional commands, e-mail: dev-help@commons.apache.org <ma...@commons.apache.org>
>>>> 
>>>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [Text, Lang] Matching two CharSequence instances

Posted by "Bruno P. Kinoshita" <ki...@apache.org>.

 Hi Alex,
Also found the two implementations similar, but confusing.
- The parameter name inconsistency
+1


- The edge case logic inconsistency

+1

- Change to a stepwise charAt comparison
+1 if the behaviour is kept. Right now some crazy characters like those old latin letters are supported (e.g. LATIN CAPITAL LIGATURE IJ), and the result for equal and equal ignore case are as expected.
And right now those methods do not work with surrogate characters (e.g. GEORGIAN LETTER AN and GEORGIAN CAPITAL LETTER AN are false for equals and equals ignore case). But this is an edge case that we can ignore for now I believe.
CheersBruno

    On Sunday, 3 March 2019, 1:29:12 pm NZDT, Alex Herbert <al...@gmail.com> wrote:  
 
 Having looked a bit more at StringUtils it appears that:

public static boolean equals(final CharSequence cs1, final CharSequence cs2);
public static boolean equalsIgnoreCase(final CharSequence str1, final CharSequence str2);

share edge case logic checking but it is implemented differently (although with the same effect). They also have inconsistent parameter names (cs1 vs str1).

The equals method then ends up calling CharSequenceUtils.regionMatches but without the ignore case option. This method does a lot more work than necessary as it performs a lot of edge case checks then loops with regard to checking for case. It would be cleaner to mimic String.equals by using a stepwise charAt comparison.

Any objections to a PR to fix:

- The parameter name inconsistency
- The edge case logic inconsistency
- Change to a stepwise charAt comparison

Alex

> On 2 Mar 2019, at 20:49, Alex Herbert <al...@gmail.com> wrote:
> 
> 
>> On 2 Mar 2019, at 16:59, Mark Dacek <mark@syberion.com <ma...@syberion.com>> wrote:
>> 
>> Is your proposed method a stepwise charAt comparison across both, assuming
>> non-null and equal length?
> 
> Yes. Although the StringUtils.equals(CharSequence, CharSequence) from [lang] will do the job correctly (thanks Gary). It currently does all the edge case checks then calls a region matching method using the entire length but the effect is the same as:
> 
> for (int i = 0; i < cs1.length(); i++) {
>    if (cs1.charAt(i) != cs2.charAt(i)) {
>        return false;
>    }
> }
> return true;
> 
> Switching in the above code instead of the call to regionMatches(…) at the end of StringUtils.equals(CharSequence, CharSequence) would avoid repeating all the edge case checks of length at the start of that method and the case insensitivity functionality. 
> 
> The StringUtils.equals method already detects if String is input as both arguments and defaults to that if possible. So this is basically for any other combination of CharSequence types where a simple stepwise charAt comparison is wanted.
> 
> 
>> Doesn't seem like a bad idea, though I'm curious whether there's a use-case
>> where toString() on both and comparing isn't more expedient.
> 
> Just the memory overhead of duplicating to create a String. If a match is unlikely, especially near the start, then this is a cost to consider for longer strings.
> 
> I was just after something to put in place of the incorrect usage of:
> 
> CharSequence cs1, cs2;
> cs1.equals(cs2);
> 
> Which is not part of the CharSequence interface and works only if inputting 2 objects that support equals correctly, like String or StringBuilder.
> 
> I’ve just has a look for .equals() in all of [text] and this is actually a bug that is in the newly submitted JaroWinklerSimilarity too. 
> 
> I’ll do a PR to fix that one.
> 
>> 
>> On Sat, Mar 2, 2019 at 11:53 AM Alex Herbert <alex.d.herbert@gmail.com <ma...@gmail.com>>
>> wrote:
>> 
>>> I am helping with the PR for TEXT-126 to add to the similarity package.
>>> 
>>> Part of the new algorithm requires identifying if two CharSequences are
>>> identical. Is there a utility in Text to do something like this:
>>> 
>>> public static boolean CharSequenceUtils.equals(CharSequence, CharSequence);
>>> 
>>> I cannot find one with a quick regex search of the library. I am not
>>> familiar with Lang either but this is a dependency so a method from there
>>> could be used.
>>> 
>>> The current PR is using left.equals(right) on the input CharSequence to
>>> compare to one to another which is wrong if the two input CharSequences do
>>> not support matching, e.g. if the input was a String and StringBuilder then
>>> String.equals(StringBuilder) would not match, even if the characters were
>>> the same.
>>> 
>>> Regards,
>>> 
>>> Alex
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org <ma...@commons.apache.org>
>>> For additional commands, e-mail: dev-help@commons.apache.org <ma...@commons.apache.org>
>>> 
>>> 
>

Re: [Text, Lang] Matching two CharSequence instances

Posted by Alex Herbert <al...@gmail.com>.

Having looked a bit more at StringUtils it appears that:

public static boolean equals(final CharSequence cs1, final CharSequence cs2);
public static boolean equalsIgnoreCase(final CharSequence str1, final CharSequence str2);

share edge case logic checking but it is implemented differently (although with the same effect). They also have inconsistent parameter names (cs1 vs str1).

The equals method then ends up calling CharSequenceUtils.regionMatches but without the ignore case option. This method does a lot more work than necessary as it performs a lot of edge case checks then loops with regard to checking for case. It would be cleaner to mimic String.equals by using a stepwise charAt comparison.

Any objections to a PR to fix:

- The parameter name inconsistency
- The edge case logic inconsistency
- Change to a stepwise charAt comparison

Alex

> On 2 Mar 2019, at 20:49, Alex Herbert <al...@gmail.com> wrote:
> 
> 
>> On 2 Mar 2019, at 16:59, Mark Dacek <mark@syberion.com <ma...@syberion.com>> wrote:
>> 
>> Is your proposed method a stepwise charAt comparison across both, assuming
>> non-null and equal length?
> 
> Yes. Although the StringUtils.equals(CharSequence, CharSequence) from [lang] will do the job correctly (thanks Gary). It currently does all the edge case checks then calls a region matching method using the entire length but the effect is the same as:
> 
> for (int i = 0; i < cs1.length(); i++) {
>     if (cs1.charAt(i) != cs2.charAt(i)) {
>         return false;
>     }
> }
> return true;
> 
> Switching in the above code instead of the call to regionMatches(…) at the end of StringUtils.equals(CharSequence, CharSequence) would avoid repeating all the edge case checks of length at the start of that method and the case insensitivity functionality. 
> 
> The StringUtils.equals method already detects if String is input as both arguments and defaults to that if possible. So this is basically for any other combination of CharSequence types where a simple stepwise charAt comparison is wanted.
> 
> 
>> Doesn't seem like a bad idea, though I'm curious whether there's a use-case
>> where toString() on both and comparing isn't more expedient.
> 
> Just the memory overhead of duplicating to create a String. If a match is unlikely, especially near the start, then this is a cost to consider for longer strings.
> 
> I was just after something to put in place of the incorrect usage of:
> 
> CharSequence cs1, cs2;
> cs1.equals(cs2);
> 
> Which is not part of the CharSequence interface and works only if inputting 2 objects that support equals correctly, like String or StringBuilder.
> 
> I’ve just has a look for .equals() in all of [text] and this is actually a bug that is in the newly submitted JaroWinklerSimilarity too. 
> 
> I’ll do a PR to fix that one.
> 
>> 
>> On Sat, Mar 2, 2019 at 11:53 AM Alex Herbert <alex.d.herbert@gmail.com <ma...@gmail.com>>
>> wrote:
>> 
>>> I am helping with the PR for TEXT-126 to add to the similarity package.
>>> 
>>> Part of the new algorithm requires identifying if two CharSequences are
>>> identical. Is there a utility in Text to do something like this:
>>> 
>>> public static boolean CharSequenceUtils.equals(CharSequence, CharSequence);
>>> 
>>> I cannot find one with a quick regex search of the library. I am not
>>> familiar with Lang either but this is a dependency so a method from there
>>> could be used.
>>> 
>>> The current PR is using left.equals(right) on the input CharSequence to
>>> compare to one to another which is wrong if the two input CharSequences do
>>> not support matching, e.g. if the input was a String and StringBuilder then
>>> String.equals(StringBuilder) would not match, even if the characters were
>>> the same.
>>> 
>>> Regards,
>>> 
>>> Alex
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org <ma...@commons.apache.org>
>>> For additional commands, e-mail: dev-help@commons.apache.org <ma...@commons.apache.org>
>>> 
>>> 
>

Re: [Text, Lang] Matching two CharSequence instances

Posted by Alex Herbert <al...@gmail.com>.

> On 2 Mar 2019, at 16:59, Mark Dacek <ma...@syberion.com> wrote:
> 
> Is your proposed method a stepwise charAt comparison across both, assuming
> non-null and equal length?

Yes. Although the StringUtils.equals(CharSequence, CharSequence) from [lang] will do the job correctly (thanks Gary). It currently does all the edge case checks then calls a region matching method using the entire length but the effect is the same as:

for (int i = 0; i < cs1.length(); i++) {
    if (cs1.charAt(i) != cs2.charAt(i)) {
        return false;
    }
}
return true;

Switching in the above code instead of the call to regionMatches(…) at the end of StringUtils.equals(CharSequence, CharSequence) would avoid repeating all the edge case checks of length at the start of that method and the case insensitivity functionality. 

The StringUtils.equals method already detects if String is input as both arguments and defaults to that if possible. So this is basically for any other combination of CharSequence types where a simple stepwise charAt comparison is wanted.

> Doesn't seem like a bad idea, though I'm curious whether there's a use-case
> where toString() on both and comparing isn't more expedient.

Just the memory overhead of duplicating to create a String. If a match is unlikely, especially near the start, then this is a cost to consider for longer strings.

I was just after something to put in place of the incorrect usage of:

CharSequence cs1, cs2;
cs1.equals(cs2);

Which is not part of the CharSequence interface and works only if inputting 2 objects that support equals correctly, like String or StringBuilder.

I’ve just has a look for .equals() in all of [text] and this is actually a bug that is in the newly submitted JaroWinklerSimilarity too. 

I’ll do a PR to fix that one.

> 
> On Sat, Mar 2, 2019 at 11:53 AM Alex Herbert <al...@gmail.com>
> wrote:
> 
>> I am helping with the PR for TEXT-126 to add to the similarity package.
>> 
>> Part of the new algorithm requires identifying if two CharSequences are
>> identical. Is there a utility in Text to do something like this:
>> 
>> public static boolean CharSequenceUtils.equals(CharSequence, CharSequence);
>> 
>> I cannot find one with a quick regex search of the library. I am not
>> familiar with Lang either but this is a dependency so a method from there
>> could be used.
>> 
>> The current PR is using left.equals(right) on the input CharSequence to
>> compare to one to another which is wrong if the two input CharSequences do
>> not support matching, e.g. if the input was a String and StringBuilder then
>> String.equals(StringBuilder) would not match, even if the characters were
>> the same.
>> 
>> Regards,
>> 
>> Alex
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>> 
>>

Re: [Text, Lang] Matching two CharSequence instances

Posted by Mark Dacek <ma...@syberion.com>.

Is your proposed method a stepwise charAt comparison across both, assuming
non-null and equal length?
Doesn't seem like a bad idea, though I'm curious whether there's a use-case
where toString() on both and comparing isn't more expedient.

On Sat, Mar 2, 2019 at 11:53 AM Alex Herbert <al...@gmail.com>
wrote:

> I am helping with the PR for TEXT-126 to add to the similarity package.
>
> Part of the new algorithm requires identifying if two CharSequences are
> identical. Is there a utility in Text to do something like this:
>
> public static boolean CharSequenceUtils.equals(CharSequence, CharSequence);
>
> I cannot find one with a quick regex search of the library. I am not
> familiar with Lang either but this is a dependency so a method from there
> could be used.
>
> The current PR is using left.equals(right) on the input CharSequence to
> compare to one to another which is wrong if the two input CharSequences do
> not support matching, e.g. if the input was a String and StringBuilder then
> String.equals(StringBuilder) would not match, even if the characters were
> the same.
>
> Regards,
>
> Alex
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

Re: [Text, Lang] Matching two CharSequence instances

Posted by Gary Gregory <ga...@gmail.com>.

See org.apache.commons.lang3.StringUtils.equals(CharSequence, CharSequence)

Gary

On Sat, Mar 2, 2019 at 11:53 AM Alex Herbert <al...@gmail.com>
wrote:

> I am helping with the PR for TEXT-126 to add to the similarity package.
>
> Part of the new algorithm requires identifying if two CharSequences are
> identical. Is there a utility in Text to do something like this:
>
> public static boolean CharSequenceUtils.equals(CharSequence, CharSequence);
>
> I cannot find one with a quick regex search of the library. I am not
> familiar with Lang either but this is a dependency so a method from there
> could be used.
>
> The current PR is using left.equals(right) on the input CharSequence to
> compare to one to another which is wrong if the two input CharSequences do
> not support matching, e.g. if the input was a String and StringBuilder then
> String.equals(StringBuilder) would not match, even if the characters were
> the same.
>
> Regards,
>
> Alex
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>