You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Alex Herbert <al...@gmail.com> on 2019/03/07 15:49:41 UTC

[Text] JaccardSimilarity

A quick question about the JaccardSimilarity class:

Q. Why does it round the similarity to 2 decimal places?

This is not documented.

It is also done in the complimentary JaccardDistance class.

Looking at the history in git it seems to have always been that way. 
First commit was 2016-11-27.

Thanks,

Alex



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [Text] JaccardSimilarity

Posted by Alex Herbert <al...@gmail.com>.

> On 8 Mar 2019, at 00:01, Bruno P. Kinoshita <br...@yahoo.com.br.INVALID> wrote:
> 
>> I’d favour dropping the round and adding it to the Changes.xml via a Jira ticket so it is noted if someone upgrades. They can always restore functionality to as-it-was by doing a round on the output of the class. 
> +1
>> I’ve already made the test using the python distance.jaccard function from the distance library in the PR for Text-155. So changing the test is simple. It’s just the decision on whether to do it.
> I think we can aim at implementing this for 1.7 (which from the looks of it will have several bug fixes & improvements!).
> CheersBruno

I'll put the changes into a Jira and PR.

Alex


> 
> 
>    On Friday, 8 March 2019, 10:54:32 am NZDT, Alex Herbert <al...@gmail.com> wrote:  
> 
> Hi Bruno,
> 
>> On 7 Mar 2019, at 21:18, Bruno P. Kinoshita <ki...@apache.org> wrote:
>> 
>> Hi Alex,
>> Can't recall why it was done that way. When the initial code for the edit distances was created, some Java libraries like Simmetrics, java-string-similarity, Lucene, and also R/Python code were used to verify the output of the edit distances.
>> Maybe we used Math.round just to get a test passing, which I agree it had to be documented.
>> But even better if we just drop the Math.round and instead update the tests with that assertEquals(expected, actual, threshold) method, with a good enough threshold.
>> What do you think?
> 
> I’d favour dropping the round and adding it to the Changes.xml via a Jira ticket so it is noted if someone upgrades. They can always restore functionality to as-it-was by doing a round on the output of the class. 
> 
> If I understand the metric correctly (intersect over union) to have a difference in the 3rd decimal place would require the union of the two character sets to be above 200, i.e. a string containing over 200 unique characters, e.g. 
> 
> A) 0/200 = 0
> B) 1/200 = 0.005
> C) 2/200 = 0.01
> 
> In this case result A and C can be distinguished but not B and C due to round up.
> 
> So in practical terms it would not make a difference unless using a large character set. For ASCII strings there is no difference.
> 
> I’ve already made the test using the python distance.jaccard function from the distance library in the PR for Text-155. So changing the test is simple. It’s just the decision on whether to do it.
> 
> Alex
> 
> 
>> CheersBruno
>> 
>>     On Friday, 8 March 2019, 4:49:52 am NZDT, Alex Herbert <al...@gmail.com> wrote:  
>> 
>> A quick question about the JaccardSimilarity class:
>> 
>> Q. Why does it round the similarity to 2 decimal places?
>> 
>> This is not documented.
>> 
>> It is also done in the complimentary JaccardDistance class.
>> 
>> Looking at the history in git it seems to have always been that way. 
>> First commit was 2016-11-27.
>> 
>> Thanks,
>> 
>> Alex
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [Text] JaccardSimilarity

Posted by "Bruno P. Kinoshita" <br...@yahoo.com.br.INVALID>.
 >I’d favour dropping the round and adding it to the Changes.xml via a Jira ticket so it is noted if someone upgrades. They can always restore functionality to as-it-was by doing a round on the output of the class. 
+1
>I’ve already made the test using the python distance.jaccard function from the distance library in the PR for Text-155. So changing the test is simple. It’s just the decision on whether to do it.
I think we can aim at implementing this for 1.7 (which from the looks of it will have several bug fixes & improvements!).
CheersBruno


    On Friday, 8 March 2019, 10:54:32 am NZDT, Alex Herbert <al...@gmail.com> wrote:  
 
 Hi Bruno,

> On 7 Mar 2019, at 21:18, Bruno P. Kinoshita <ki...@apache.org> wrote:
> 
> Hi Alex,
> Can't recall why it was done that way. When the initial code for the edit distances was created, some Java libraries like Simmetrics, java-string-similarity, Lucene, and also R/Python code were used to verify the output of the edit distances.
> Maybe we used Math.round just to get a test passing, which I agree it had to be documented.
> But even better if we just drop the Math.round and instead update the tests with that assertEquals(expected, actual, threshold) method, with a good enough threshold.
> What do you think?

I’d favour dropping the round and adding it to the Changes.xml via a Jira ticket so it is noted if someone upgrades. They can always restore functionality to as-it-was by doing a round on the output of the class. 

If I understand the metric correctly (intersect over union) to have a difference in the 3rd decimal place would require the union of the two character sets to be above 200, i.e. a string containing over 200 unique characters, e.g. 

A) 0/200 = 0
B) 1/200 = 0.005
C) 2/200 = 0.01

In this case result A and C can be distinguished but not B and C due to round up.

So in practical terms it would not make a difference unless using a large character set. For ASCII strings there is no difference.

I’ve already made the test using the python distance.jaccard function from the distance library in the PR for Text-155. So changing the test is simple. It’s just the decision on whether to do it.

Alex


> CheersBruno
> 
>    On Friday, 8 March 2019, 4:49:52 am NZDT, Alex Herbert <al...@gmail.com> wrote:  
> 
> A quick question about the JaccardSimilarity class:
> 
> Q. Why does it round the similarity to 2 decimal places?
> 
> This is not documented.
> 
> It is also done in the complimentary JaccardDistance class.
> 
> Looking at the history in git it seems to have always been that way. 
> First commit was 2016-11-27.
> 
> Thanks,
> 
> Alex
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org
  

Re: [Text] JaccardSimilarity

Posted by Alex Herbert <al...@gmail.com>.
Hi Bruno,

> On 7 Mar 2019, at 21:18, Bruno P. Kinoshita <ki...@apache.org> wrote:
> 
> Hi Alex,
> Can't recall why it was done that way. When the initial code for the edit distances was created, some Java libraries like Simmetrics, java-string-similarity, Lucene, and also R/Python code were used to verify the output of the edit distances.
> Maybe we used Math.round just to get a test passing, which I agree it had to be documented.
> But even better if we just drop the Math.round and instead update the tests with that assertEquals(expected, actual, threshold) method, with a good enough threshold.
> What do you think?

I’d favour dropping the round and adding it to the Changes.xml via a Jira ticket so it is noted if someone upgrades. They can always restore functionality to as-it-was by doing a round on the output of the class. 

If I understand the metric correctly (intersect over union) to have a difference in the 3rd decimal place would require the union of the two character sets to be above 200, i.e. a string containing over 200 unique characters, e.g. 

A) 0/200 = 0
B) 1/200 = 0.005
C) 2/200 = 0.01

In this case result A and C can be distinguished but not B and C due to round up.

So in practical terms it would not make a difference unless using a large character set. For ASCII strings there is no difference.

I’ve already made the test using the python distance.jaccard function from the distance library in the PR for Text-155. So changing the test is simple. It’s just the decision on whether to do it.

Alex


> CheersBruno
> 
>    On Friday, 8 March 2019, 4:49:52 am NZDT, Alex Herbert <al...@gmail.com> wrote:  
> 
> A quick question about the JaccardSimilarity class:
> 
> Q. Why does it round the similarity to 2 decimal places?
> 
> This is not documented.
> 
> It is also done in the complimentary JaccardDistance class.
> 
> Looking at the history in git it seems to have always been that way. 
> First commit was 2016-11-27.
> 
> Thanks,
> 
> Alex
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [Text] JaccardSimilarity

Posted by "Bruno P. Kinoshita" <ki...@apache.org>.
 Hi Alex,
Can't recall why it was done that way. When the initial code for the edit distances was created, some Java libraries like Simmetrics, java-string-similarity, Lucene, and also R/Python code were used to verify the output of the edit distances.
Maybe we used Math.round just to get a test passing, which I agree it had to be documented.
But even better if we just drop the Math.round and instead update the tests with that assertEquals(expected, actual, threshold) method, with a good enough threshold.
What do you think?
CheersBruno

    On Friday, 8 March 2019, 4:49:52 am NZDT, Alex Herbert <al...@gmail.com> wrote:  
 
 A quick question about the JaccardSimilarity class:

Q. Why does it round the similarity to 2 decimal places?

This is not documented.

It is also done in the complimentary JaccardDistance class.

Looking at the history in git it seems to have always been that way. 
First commit was 2016-11-27.

Thanks,

Alex



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org