You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Benedikt Ritter <br...@apache.org> on 2015/04/19 11:01:34 UTC

[TEXT] Review of the code base

Hi all,

I've looked through the code base of [text] and I already did some clean
up. However there are a few things, that need more work IMHO:

1. IP clearance! There are several comments talking about code adapted from
PHP libraries. This needs to be listed in NOTICE.txt. For an example see
the NOTICE.txt in [lang].

2. The names package needs work. Currently the HumanNameParser takes a Name
object (which is just a wrapper around a String) as parameter. The parse
result is then saved in fields (e.g. firstName). This makes the parser
unusable after it has parsed a name. I think the API should be reworked
such as:
- The constructor of the parser takes configuration options which can be
reused for several names to parse
- the parse method takes a string as parameter, containing a name
- the parse method returns an immutable Name objects which has getters for
firstName, lastName etc.

3. There are a some useless JavaDocs, which can be dropped. For example:
/**
 * Middle name.
 */
private String middle;

Better rename the variable to middleName and drop the comment.

4. Several classes have no tests. The overall test coverage looks good, so
this may not be a problem.

5. I don't think the shade configuration is correct. To me it looks like
commons-text depends on [lang] and also shades some classes. Do we really
need the explicit dependency? Furthermore, why do we need to shade so many
classes? Why do we need anything beside StringUtils?

Nevertheless, this library looks very good. Kudos to Bruno for pushing this
forward!

Regards,
Benedikt






-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter

Re: [TEXT] Review of the code base

Posted by "Bruno P. Kinoshita" <br...@yahoo.com.br>.
Great! Thanks for taking care of this Benedikt!
There are four open issues. 

TEXT-2: Add Jaccard Index and Jaccard DistanceTEXT-10: A more complex Levenshtein distanceTEXT-14: Create a generic class that calculates a distance based on a similarity scoreTEXT-16: Improve HumanNameParser
I've collected a list of features that I think could be a good contribution to [text], but I'll re-read and filter the list before creating new issues. Things like working with syllables, different ways to convert text to vectors (useful for NLP and text analytics), text alignment (left, right, justify), Burrows–Wheeler transform, etc.

But I think prior to a 1.0 release, TEXT-16 is the only pending issue. What do you think?
CheersBruno

 
      From: Benedikt Ritter <br...@apache.org>
 To: Commons Developers List <de...@commons.apache.org> 
 Sent: Sunday, April 26, 2015 10:14 PM
 Subject: Re: [TEXT] Review of the code base
   
2015-04-26 11:52 GMT+02:00 Benedikt Ritter <br...@apache.org>:

>
>
> 2015-04-19 17:29 GMT+02:00 Benedikt Ritter <br...@apache.org>:
>
>>
>>
>> 2015-04-19 17:24 GMT+02:00 Benedikt Ritter <br...@apache.org>:
>>
>>> Hello Bruno,
>>>
>>> 2015-04-19 12:15 GMT+02:00 Bruno P. Kinoshita <
>>> brunodepaulak@yahoo.com.br>:
>>>
>>>> Thanks for the thorough code review Benedikt! Comments inline.
>>>>
>>>> > 1. IP clearance! There are several comments talking about code
>>>> adapted from
>>>> >PHP libraries. This needs to be listed in NOTICE.txt. For an example
>>>> see
>>>> >the NOTICE.txt in [lang].
>>>> Filed https://issues.apache.org/jira/browse/SANDBOX-497
>>>> And first try to fix it:
>>>>
>>>>
>>>> https://git1-us-west.apache.org/repos/asf?p=commons-text.git;a=commitdiff;h=refs/heads/SANDBOX-497
>>>> Does that look correct? I also checked with the author before porting
>>>> it to the initial Java version, and told him about the [text] component -
>>>> https://github.com/jasonpriem/HumanNameParser.php/issues/11
>>>
>>>
>>> I've adjusted the changes a bit and merged it into master.
>>>
>>>
>>>>
>>>> > 2. The names package needs work.
>>>> Completely agree. We can try to make it thread safe too. This will take
>>>> a little longer.
>>>> File new issue for tracking it
>>>> https://issues.apache.org/jira/browse/SANDBOX-498
>>>
>>>
>>> I've pushed a new branch containing a proposal how to fix this. Please
>>> review.
>>>
>>>
>>>>
>>>>
>>>> > 3. There are a some useless JavaDocs, which can be dropped. For
>>>> example:
>>>> > (...)> Better rename the variable to middleName and drop the comment.
>>>> What about renaming the variables but leaving the comments?
>>>>
>>>
>>> For I've corrected this for the names package in the SANDBOX-498 branch.
>>>
>>>
>>>>
>>>> > 4. Several classes have no tests. The overall test coverage looks
>>>> good, so
>>>> >this may not be a problem.
>>>> Before the initial release I can start a development cycle just to add
>>>> new tests for parts that seem important - like code with somewhat high [+5]
>>>> cyclomatic complexity and uncovered branches. Just to make sure that the
>>>> initial release will have an acceptable quality and less hidden bugs :-)
>>>>
>>>
>>> Nice!
>>>
>>>
>>>> > 5. I don't think the shade configuration is correct. To me it looks
>>>> like
>>>> >commons-text depends on [lang] and also shades some classes. Do we
>>>> really
>>>> >need the explicit dependency? Furthermore, why do we need to shade so
>>>> many
>>>> >classes? Why do we need anything beside StringUtils?
>>>> My mistake. I copied the [functor] configuration without much thinking.
>>>> IIRC, this dependency was added because of the names package. But it
>>>> shouldn't be too complicate to replace it. It is mainly used for checking
>>>> empty/blank values.
>>>>
>>>> So let's drop the [lang] dependency and use something like `$VAR !=
>>>> null && ! "".equals($VAR)` instead?
>>>>
>>>
>>> I think we also need StringUtils in the similarity package. We should
>>> handle this after we have merged the changes made to the names package.
>>>
>>>
>>>> Thanks again for reviewing the code!
>>>>
>>>
>>> No problem, thanks for pushing this forward.
>>>
>>> Here are some more things we need to take care of before 1.0:
>>> - JIRA project
>>>
>>
>> Requested a JIRA project:
>> https://issues.apache.org/jira/servicedesk/customer/portal/1/INFRA-9477
>>
>
> Here is the JIRA project [1]. I'll find out, how we can move the Commons
> Text issues from the Sandbox project to the new project.
>

I've migrated all issues from the Sandbox project to the new Commons Text
project. I've also updated the project documentation.


>
> Benedikt
>
> [1] https://issues.apache.org/jira/browse/TEXT
>
>
>>
>>
>>> - Rename similarity package to distance package? We have five Distance
>>> implementations but only one Similarity. Furthermore we have FuzzyScore. Is
>>> it a distance or a similarity? It should be renamed accordingly.
>>> - ConsineSimilariy uses Vectors modeled as Map<CharSequence, Integer>.
>>> Does it make sense to introduce a new class called Vector?
>>>
>>> Benedikt
>>>
>>>
>>>>
>>>> All the best,Bruno
>>>>
>>>>
>>>>      From: Benedikt Ritter <br...@apache.org>
>>>>  To: Commons Developers List <de...@commons.apache.org>
>>>>  Sent: Sunday, April 19, 2015 9:01 PM
>>>>  Subject: [TEXT] Review of the code base
>>>>
>>>> Hi all,
>>>>
>>>> I've looked through the code base of [text] and I already did some clean
>>>> up. However there are a few things, that need more work IMHO:
>>>>
>>>> 1. IP clearance! There are several comments talking about code adapted
>>>> from
>>>> PHP libraries. This needs to be listed in NOTICE.txt. For an example see
>>>> the NOTICE.txt in [lang].
>>>>
>>>> 2. The names package needs work. Currently the HumanNameParser takes a
>>>> Name
>>>> object (which is just a wrapper around a String) as parameter. The parse
>>>> result is then saved in fields (e.g. firstName). This makes the parser
>>>> unusable after it has parsed a name. I think the API should be reworked
>>>> such as:
>>>> - The constructor of the parser takes configuration options which can be
>>>> reused for several names to parse
>>>> - the parse method takes a string as parameter, containing a name
>>>> - the parse method returns an immutable Name objects which has getters
>>>> for
>>>> firstName, lastName etc.
>>>>
>>>> 3. There are a some useless JavaDocs, which can be dropped. For example:
>>>> /**
>>>>  * Middle name.
>>>>  */
>>>> private String middle;
>>>>
>>>> Better rename the variable to middleName and drop the comment.
>>>>
>>>> 4. Several classes have no tests. The overall test coverage looks good,
>>>> so
>>>> this may not be a problem.
>>>>
>>>> 5. I don't think the shade configuration is correct. To me it looks like
>>>> commons-text depends on [lang] and also shades some classes. Do we
>>>> really
>>>> need the explicit dependency? Furthermore, why do we need to shade so
>>>> many
>>>> classes? Why do we need anything beside StringUtils?
>>>>
>>>> Nevertheless, this library looks very good. Kudos to Bruno for pushing
>>>> this
>>>> forward!
>>>>
>>>> Regards,
>>>> Benedikt
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> http://people.apache.org/~britter/
>>>> http://www.systemoutprintln.de/
>>>> http://twitter.com/BenediktRitter
>>>> http://github.com/britter


>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> http://people.apache.org/~britter/
>>> http://www.systemoutprintln.de/
>>> http://twitter.com/BenediktRitter
>>> http://github.com/britter
>>>
>>
>>
>>
>> --
>> http://people.apache.org/~britter/
>> http://www.systemoutprintln.de/
>> http://twitter.com/BenediktRitter
>> http://github.com/britter
>>
>
>
>
> --
> http://people.apache.org/~britter/
> http://www.systemoutprintln.de/
> http://twitter.com/BenediktRitter
> http://github.com/britter
>



-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter


   

Re: [TEXT] Review of the code base

Posted by Benedikt Ritter <br...@apache.org>.
2015-04-27 8:10 GMT+02:00 Benedikt Ritter <br...@apache.org>:

> Hello Phil,
>
> 2015-04-27 4:56 GMT+02:00 Phil Steitz <ph...@gmail.com>:
>
>> On 4/26/15 3:14 AM, Benedikt Ritter wrote:
>> > 2015-04-26 11:52 GMT+02:00 Benedikt Ritter <br...@apache.org>:
>> >
>> >>
>> >> 2015-04-19 17:29 GMT+02:00 Benedikt Ritter <br...@apache.org>:
>> >>
>> >>>
>> >>> 2015-04-19 17:24 GMT+02:00 Benedikt Ritter <br...@apache.org>:
>> >>>
>> >>>> Hello Bruno,
>> >>>>
>> >>>> 2015-04-19 12:15 GMT+02:00 Bruno P. Kinoshita <
>> >>>> brunodepaulak@yahoo.com.br>:
>> >>>>
>> >>>>> Thanks for the thorough code review Benedikt! Comments inline.
>> >>>>>
>> >>>>>> 1. IP clearance! There are several comments talking about code
>> >>>>> adapted from
>> >>>>>> PHP libraries. This needs to be listed in NOTICE.txt. For an
>> example
>> >>>>> see
>> >>>>>> the NOTICE.txt in [lang].
>> >>>>> Filed https://issues.apache.org/jira/browse/SANDBOX-497
>> >>>>> And first try to fix it:
>> >>>>>
>> >>>>>
>> >>>>>
>> https://git1-us-west.apache.org/repos/asf?p=commons-text.git;a=commitdiff;h=refs/heads/SANDBOX-497
>> >>>>> Does that look correct? I also checked with the author before
>> porting
>> >>>>> it to the initial Java version, and told him about the [text]
>> component -
>> >>>>> https://github.com/jasonpriem/HumanNameParser.php/issues/11
>> >>>>
>> >>>> I've adjusted the changes a bit and merged it into master.
>> >>>>
>> >>>>
>> >>>>>> 2. The names package needs work.
>> >>>>> Completely agree. We can try to make it thread safe too. This will
>> take
>> >>>>> a little longer.
>> >>>>> File new issue for tracking it
>> >>>>> https://issues.apache.org/jira/browse/SANDBOX-498
>> >>>>
>> >>>> I've pushed a new branch containing a proposal how to fix this.
>> Please
>> >>>> review.
>> >>>>
>> >>>>
>> >>>>>
>> >>>>>> 3. There are a some useless JavaDocs, which can be dropped. For
>> >>>>> example:
>> >>>>>> (...)> Better rename the variable to middleName and drop the
>> comment.
>> >>>>> What about renaming the variables but leaving the comments?
>> >>>>>
>> >>>> For I've corrected this for the names package in the SANDBOX-498
>> branch.
>> >>>>
>> >>>>
>> >>>>>> 4. Several classes have no tests. The overall test coverage looks
>> >>>>> good, so
>> >>>>>> this may not be a problem.
>> >>>>> Before the initial release I can start a development cycle just to
>> add
>> >>>>> new tests for parts that seem important - like code with somewhat
>> high [+5]
>> >>>>> cyclomatic complexity and uncovered branches. Just to make sure
>> that the
>> >>>>> initial release will have an acceptable quality and less hidden
>> bugs :-)
>> >>>>>
>> >>>> Nice!
>> >>>>
>> >>>>
>> >>>>>> 5. I don't think the shade configuration is correct. To me it looks
>> >>>>> like
>> >>>>>> commons-text depends on [lang] and also shades some classes. Do we
>> >>>>> really
>> >>>>>> need the explicit dependency? Furthermore, why do we need to shade
>> so
>> >>>>> many
>> >>>>>> classes? Why do we need anything beside StringUtils?
>> >>>>> My mistake. I copied the [functor] configuration without much
>> thinking.
>> >>>>> IIRC, this dependency was added because of the names package. But it
>> >>>>> shouldn't be too complicate to replace it. It is mainly used for
>> checking
>> >>>>> empty/blank values.
>> >>>>>
>> >>>>> So let's drop the [lang] dependency and use something like `$VAR !=
>> >>>>> null && ! "".equals($VAR)` instead?
>> >>>>>
>> >>>> I think we also need StringUtils in the similarity package. We should
>> >>>> handle this after we have merged the changes made to the names
>> package.
>> >>>>
>> >>>>
>> >>>>> Thanks again for reviewing the code!
>> >>>>>
>> >>>> No problem, thanks for pushing this forward.
>> >>>>
>> >>>> Here are some more things we need to take care of before 1.0:
>> >>>> - JIRA project
>> >>>>
>> >>> Requested a JIRA project:
>> >>>
>> https://issues.apache.org/jira/servicedesk/customer/portal/1/INFRA-9477
>> >>>
>> >> Here is the JIRA project [1]. I'll find out, how we can move the
>> Commons
>> >> Text issues from the Sandbox project to the new project.
>> >>
>> > I've migrated all issues from the Sandbox project to the new Commons
>> Text
>> > project. I've also updated the project documentation.
>>
>> Did I miss the promotion VOTE?
>>
>
> No, but I think the component should be in a state to be released before
> starting a promotion vote. We have examples of components which have been
> promoted to early and I don't want this to happen with [text].
>

I hope we haven't stepped on anybody's toes or raised the impression we're
doing stuff without properly informing the dev ML about our plans. I will
write an informational e-mail together illustrating what we have
implemented, what our plans are and what the road map for promotion is.

Thank you!
Benedikt


>
> Benedikt
>
>
>>
>> Phil
>> >
>> >
>> >> Benedikt
>> >>
>> >> [1] https://issues.apache.org/jira/browse/TEXT
>> >>
>> >>
>> >>>
>> >>>> - Rename similarity package to distance package? We have five
>> Distance
>> >>>> implementations but only one Similarity. Furthermore we have
>> FuzzyScore. Is
>> >>>> it a distance or a similarity? It should be renamed accordingly.
>> >>>> - ConsineSimilariy uses Vectors modeled as Map<CharSequence,
>> Integer>.
>> >>>> Does it make sense to introduce a new class called Vector?
>> >>>>
>> >>>> Benedikt
>> >>>>
>> >>>>
>> >>>>> All the best,Bruno
>> >>>>>
>> >>>>>
>> >>>>>       From: Benedikt Ritter <br...@apache.org>
>> >>>>>  To: Commons Developers List <de...@commons.apache.org>
>> >>>>>  Sent: Sunday, April 19, 2015 9:01 PM
>> >>>>>  Subject: [TEXT] Review of the code base
>> >>>>>
>> >>>>> Hi all,
>> >>>>>
>> >>>>> I've looked through the code base of [text] and I already did some
>> clean
>> >>>>> up. However there are a few things, that need more work IMHO:
>> >>>>>
>> >>>>> 1. IP clearance! There are several comments talking about code
>> adapted
>> >>>>> from
>> >>>>> PHP libraries. This needs to be listed in NOTICE.txt. For an
>> example see
>> >>>>> the NOTICE.txt in [lang].
>> >>>>>
>> >>>>> 2. The names package needs work. Currently the HumanNameParser
>> takes a
>> >>>>> Name
>> >>>>> object (which is just a wrapper around a String) as parameter. The
>> parse
>> >>>>> result is then saved in fields (e.g. firstName). This makes the
>> parser
>> >>>>> unusable after it has parsed a name. I think the API should be
>> reworked
>> >>>>> such as:
>> >>>>> - The constructor of the parser takes configuration options which
>> can be
>> >>>>> reused for several names to parse
>> >>>>> - the parse method takes a string as parameter, containing a name
>> >>>>> - the parse method returns an immutable Name objects which has
>> getters
>> >>>>> for
>> >>>>> firstName, lastName etc.
>> >>>>>
>> >>>>> 3. There are a some useless JavaDocs, which can be dropped. For
>> example:
>> >>>>> /**
>> >>>>>  * Middle name.
>> >>>>>  */
>> >>>>> private String middle;
>> >>>>>
>> >>>>> Better rename the variable to middleName and drop the comment.
>> >>>>>
>> >>>>> 4. Several classes have no tests. The overall test coverage looks
>> good,
>> >>>>> so
>> >>>>> this may not be a problem.
>> >>>>>
>> >>>>> 5. I don't think the shade configuration is correct. To me it looks
>> like
>> >>>>> commons-text depends on [lang] and also shades some classes. Do we
>> >>>>> really
>> >>>>> need the explicit dependency? Furthermore, why do we need to shade
>> so
>> >>>>> many
>> >>>>> classes? Why do we need anything beside StringUtils?
>> >>>>>
>> >>>>> Nevertheless, this library looks very good. Kudos to Bruno for
>> pushing
>> >>>>> this
>> >>>>> forward!
>> >>>>>
>> >>>>> Regards,
>> >>>>> Benedikt
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> http://people.apache.org/~britter/
>> >>>>> http://www.systemoutprintln.de/
>> >>>>> http://twitter.com/BenediktRitter
>> >>>>> http://github.com/britter
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> http://people.apache.org/~britter/
>> >>>> http://www.systemoutprintln.de/
>> >>>> http://twitter.com/BenediktRitter
>> >>>> http://github.com/britter
>> >>>>
>> >>>
>> >>>
>> >>> --
>> >>> http://people.apache.org/~britter/
>> >>> http://www.systemoutprintln.de/
>> >>> http://twitter.com/BenediktRitter
>> >>> http://github.com/britter
>> >>>
>> >>
>> >>
>> >> --
>> >> http://people.apache.org/~britter/
>> >> http://www.systemoutprintln.de/
>> >> http://twitter.com/BenediktRitter
>> >> http://github.com/britter
>> >>
>> >
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
>>
>
>
> --
> http://people.apache.org/~britter/
> http://www.systemoutprintln.de/
> http://twitter.com/BenediktRitter
> http://github.com/britter
>



-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter

Re: [TEXT] Review of the code base

Posted by Benedikt Ritter <br...@apache.org>.
Hello Phil,

2015-04-27 4:56 GMT+02:00 Phil Steitz <ph...@gmail.com>:

> On 4/26/15 3:14 AM, Benedikt Ritter wrote:
> > 2015-04-26 11:52 GMT+02:00 Benedikt Ritter <br...@apache.org>:
> >
> >>
> >> 2015-04-19 17:29 GMT+02:00 Benedikt Ritter <br...@apache.org>:
> >>
> >>>
> >>> 2015-04-19 17:24 GMT+02:00 Benedikt Ritter <br...@apache.org>:
> >>>
> >>>> Hello Bruno,
> >>>>
> >>>> 2015-04-19 12:15 GMT+02:00 Bruno P. Kinoshita <
> >>>> brunodepaulak@yahoo.com.br>:
> >>>>
> >>>>> Thanks for the thorough code review Benedikt! Comments inline.
> >>>>>
> >>>>>> 1. IP clearance! There are several comments talking about code
> >>>>> adapted from
> >>>>>> PHP libraries. This needs to be listed in NOTICE.txt. For an example
> >>>>> see
> >>>>>> the NOTICE.txt in [lang].
> >>>>> Filed https://issues.apache.org/jira/browse/SANDBOX-497
> >>>>> And first try to fix it:
> >>>>>
> >>>>>
> >>>>>
> https://git1-us-west.apache.org/repos/asf?p=commons-text.git;a=commitdiff;h=refs/heads/SANDBOX-497
> >>>>> Does that look correct? I also checked with the author before porting
> >>>>> it to the initial Java version, and told him about the [text]
> component -
> >>>>> https://github.com/jasonpriem/HumanNameParser.php/issues/11
> >>>>
> >>>> I've adjusted the changes a bit and merged it into master.
> >>>>
> >>>>
> >>>>>> 2. The names package needs work.
> >>>>> Completely agree. We can try to make it thread safe too. This will
> take
> >>>>> a little longer.
> >>>>> File new issue for tracking it
> >>>>> https://issues.apache.org/jira/browse/SANDBOX-498
> >>>>
> >>>> I've pushed a new branch containing a proposal how to fix this. Please
> >>>> review.
> >>>>
> >>>>
> >>>>>
> >>>>>> 3. There are a some useless JavaDocs, which can be dropped. For
> >>>>> example:
> >>>>>> (...)> Better rename the variable to middleName and drop the
> comment.
> >>>>> What about renaming the variables but leaving the comments?
> >>>>>
> >>>> For I've corrected this for the names package in the SANDBOX-498
> branch.
> >>>>
> >>>>
> >>>>>> 4. Several classes have no tests. The overall test coverage looks
> >>>>> good, so
> >>>>>> this may not be a problem.
> >>>>> Before the initial release I can start a development cycle just to
> add
> >>>>> new tests for parts that seem important - like code with somewhat
> high [+5]
> >>>>> cyclomatic complexity and uncovered branches. Just to make sure that
> the
> >>>>> initial release will have an acceptable quality and less hidden bugs
> :-)
> >>>>>
> >>>> Nice!
> >>>>
> >>>>
> >>>>>> 5. I don't think the shade configuration is correct. To me it looks
> >>>>> like
> >>>>>> commons-text depends on [lang] and also shades some classes. Do we
> >>>>> really
> >>>>>> need the explicit dependency? Furthermore, why do we need to shade
> so
> >>>>> many
> >>>>>> classes? Why do we need anything beside StringUtils?
> >>>>> My mistake. I copied the [functor] configuration without much
> thinking.
> >>>>> IIRC, this dependency was added because of the names package. But it
> >>>>> shouldn't be too complicate to replace it. It is mainly used for
> checking
> >>>>> empty/blank values.
> >>>>>
> >>>>> So let's drop the [lang] dependency and use something like `$VAR !=
> >>>>> null && ! "".equals($VAR)` instead?
> >>>>>
> >>>> I think we also need StringUtils in the similarity package. We should
> >>>> handle this after we have merged the changes made to the names
> package.
> >>>>
> >>>>
> >>>>> Thanks again for reviewing the code!
> >>>>>
> >>>> No problem, thanks for pushing this forward.
> >>>>
> >>>> Here are some more things we need to take care of before 1.0:
> >>>> - JIRA project
> >>>>
> >>> Requested a JIRA project:
> >>>
> https://issues.apache.org/jira/servicedesk/customer/portal/1/INFRA-9477
> >>>
> >> Here is the JIRA project [1]. I'll find out, how we can move the Commons
> >> Text issues from the Sandbox project to the new project.
> >>
> > I've migrated all issues from the Sandbox project to the new Commons Text
> > project. I've also updated the project documentation.
>
> Did I miss the promotion VOTE?
>

No, but I think the component should be in a state to be released before
starting a promotion vote. We have examples of components which have been
promoted to early and I don't want this to happen with [text].

Benedikt


>
> Phil
> >
> >
> >> Benedikt
> >>
> >> [1] https://issues.apache.org/jira/browse/TEXT
> >>
> >>
> >>>
> >>>> - Rename similarity package to distance package? We have five Distance
> >>>> implementations but only one Similarity. Furthermore we have
> FuzzyScore. Is
> >>>> it a distance or a similarity? It should be renamed accordingly.
> >>>> - ConsineSimilariy uses Vectors modeled as Map<CharSequence, Integer>.
> >>>> Does it make sense to introduce a new class called Vector?
> >>>>
> >>>> Benedikt
> >>>>
> >>>>
> >>>>> All the best,Bruno
> >>>>>
> >>>>>
> >>>>>       From: Benedikt Ritter <br...@apache.org>
> >>>>>  To: Commons Developers List <de...@commons.apache.org>
> >>>>>  Sent: Sunday, April 19, 2015 9:01 PM
> >>>>>  Subject: [TEXT] Review of the code base
> >>>>>
> >>>>> Hi all,
> >>>>>
> >>>>> I've looked through the code base of [text] and I already did some
> clean
> >>>>> up. However there are a few things, that need more work IMHO:
> >>>>>
> >>>>> 1. IP clearance! There are several comments talking about code
> adapted
> >>>>> from
> >>>>> PHP libraries. This needs to be listed in NOTICE.txt. For an example
> see
> >>>>> the NOTICE.txt in [lang].
> >>>>>
> >>>>> 2. The names package needs work. Currently the HumanNameParser takes
> a
> >>>>> Name
> >>>>> object (which is just a wrapper around a String) as parameter. The
> parse
> >>>>> result is then saved in fields (e.g. firstName). This makes the
> parser
> >>>>> unusable after it has parsed a name. I think the API should be
> reworked
> >>>>> such as:
> >>>>> - The constructor of the parser takes configuration options which
> can be
> >>>>> reused for several names to parse
> >>>>> - the parse method takes a string as parameter, containing a name
> >>>>> - the parse method returns an immutable Name objects which has
> getters
> >>>>> for
> >>>>> firstName, lastName etc.
> >>>>>
> >>>>> 3. There are a some useless JavaDocs, which can be dropped. For
> example:
> >>>>> /**
> >>>>>  * Middle name.
> >>>>>  */
> >>>>> private String middle;
> >>>>>
> >>>>> Better rename the variable to middleName and drop the comment.
> >>>>>
> >>>>> 4. Several classes have no tests. The overall test coverage looks
> good,
> >>>>> so
> >>>>> this may not be a problem.
> >>>>>
> >>>>> 5. I don't think the shade configuration is correct. To me it looks
> like
> >>>>> commons-text depends on [lang] and also shades some classes. Do we
> >>>>> really
> >>>>> need the explicit dependency? Furthermore, why do we need to shade so
> >>>>> many
> >>>>> classes? Why do we need anything beside StringUtils?
> >>>>>
> >>>>> Nevertheless, this library looks very good. Kudos to Bruno for
> pushing
> >>>>> this
> >>>>> forward!
> >>>>>
> >>>>> Regards,
> >>>>> Benedikt
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> http://people.apache.org/~britter/
> >>>>> http://www.systemoutprintln.de/
> >>>>> http://twitter.com/BenediktRitter
> >>>>> http://github.com/britter
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> http://people.apache.org/~britter/
> >>>> http://www.systemoutprintln.de/
> >>>> http://twitter.com/BenediktRitter
> >>>> http://github.com/britter
> >>>>
> >>>
> >>>
> >>> --
> >>> http://people.apache.org/~britter/
> >>> http://www.systemoutprintln.de/
> >>> http://twitter.com/BenediktRitter
> >>> http://github.com/britter
> >>>
> >>
> >>
> >> --
> >> http://people.apache.org/~britter/
> >> http://www.systemoutprintln.de/
> >> http://twitter.com/BenediktRitter
> >> http://github.com/britter
> >>
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>


-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter

Re: [TEXT] Review of the code base

Posted by Phil Steitz <ph...@gmail.com>.
On 4/26/15 3:14 AM, Benedikt Ritter wrote:
> 2015-04-26 11:52 GMT+02:00 Benedikt Ritter <br...@apache.org>:
>
>>
>> 2015-04-19 17:29 GMT+02:00 Benedikt Ritter <br...@apache.org>:
>>
>>>
>>> 2015-04-19 17:24 GMT+02:00 Benedikt Ritter <br...@apache.org>:
>>>
>>>> Hello Bruno,
>>>>
>>>> 2015-04-19 12:15 GMT+02:00 Bruno P. Kinoshita <
>>>> brunodepaulak@yahoo.com.br>:
>>>>
>>>>> Thanks for the thorough code review Benedikt! Comments inline.
>>>>>
>>>>>> 1. IP clearance! There are several comments talking about code
>>>>> adapted from
>>>>>> PHP libraries. This needs to be listed in NOTICE.txt. For an example
>>>>> see
>>>>>> the NOTICE.txt in [lang].
>>>>> Filed https://issues.apache.org/jira/browse/SANDBOX-497
>>>>> And first try to fix it:
>>>>>
>>>>>
>>>>> https://git1-us-west.apache.org/repos/asf?p=commons-text.git;a=commitdiff;h=refs/heads/SANDBOX-497
>>>>> Does that look correct? I also checked with the author before porting
>>>>> it to the initial Java version, and told him about the [text] component -
>>>>> https://github.com/jasonpriem/HumanNameParser.php/issues/11
>>>>
>>>> I've adjusted the changes a bit and merged it into master.
>>>>
>>>>
>>>>>> 2. The names package needs work.
>>>>> Completely agree. We can try to make it thread safe too. This will take
>>>>> a little longer.
>>>>> File new issue for tracking it
>>>>> https://issues.apache.org/jira/browse/SANDBOX-498
>>>>
>>>> I've pushed a new branch containing a proposal how to fix this. Please
>>>> review.
>>>>
>>>>
>>>>>
>>>>>> 3. There are a some useless JavaDocs, which can be dropped. For
>>>>> example:
>>>>>> (...)> Better rename the variable to middleName and drop the comment.
>>>>> What about renaming the variables but leaving the comments?
>>>>>
>>>> For I've corrected this for the names package in the SANDBOX-498 branch.
>>>>
>>>>
>>>>>> 4. Several classes have no tests. The overall test coverage looks
>>>>> good, so
>>>>>> this may not be a problem.
>>>>> Before the initial release I can start a development cycle just to add
>>>>> new tests for parts that seem important - like code with somewhat high [+5]
>>>>> cyclomatic complexity and uncovered branches. Just to make sure that the
>>>>> initial release will have an acceptable quality and less hidden bugs :-)
>>>>>
>>>> Nice!
>>>>
>>>>
>>>>>> 5. I don't think the shade configuration is correct. To me it looks
>>>>> like
>>>>>> commons-text depends on [lang] and also shades some classes. Do we
>>>>> really
>>>>>> need the explicit dependency? Furthermore, why do we need to shade so
>>>>> many
>>>>>> classes? Why do we need anything beside StringUtils?
>>>>> My mistake. I copied the [functor] configuration without much thinking.
>>>>> IIRC, this dependency was added because of the names package. But it
>>>>> shouldn't be too complicate to replace it. It is mainly used for checking
>>>>> empty/blank values.
>>>>>
>>>>> So let's drop the [lang] dependency and use something like `$VAR !=
>>>>> null && ! "".equals($VAR)` instead?
>>>>>
>>>> I think we also need StringUtils in the similarity package. We should
>>>> handle this after we have merged the changes made to the names package.
>>>>
>>>>
>>>>> Thanks again for reviewing the code!
>>>>>
>>>> No problem, thanks for pushing this forward.
>>>>
>>>> Here are some more things we need to take care of before 1.0:
>>>> - JIRA project
>>>>
>>> Requested a JIRA project:
>>> https://issues.apache.org/jira/servicedesk/customer/portal/1/INFRA-9477
>>>
>> Here is the JIRA project [1]. I'll find out, how we can move the Commons
>> Text issues from the Sandbox project to the new project.
>>
> I've migrated all issues from the Sandbox project to the new Commons Text
> project. I've also updated the project documentation.

Did I miss the promotion VOTE?

Phil
>
>
>> Benedikt
>>
>> [1] https://issues.apache.org/jira/browse/TEXT
>>
>>
>>>
>>>> - Rename similarity package to distance package? We have five Distance
>>>> implementations but only one Similarity. Furthermore we have FuzzyScore. Is
>>>> it a distance or a similarity? It should be renamed accordingly.
>>>> - ConsineSimilariy uses Vectors modeled as Map<CharSequence, Integer>.
>>>> Does it make sense to introduce a new class called Vector?
>>>>
>>>> Benedikt
>>>>
>>>>
>>>>> All the best,Bruno
>>>>>
>>>>>
>>>>>       From: Benedikt Ritter <br...@apache.org>
>>>>>  To: Commons Developers List <de...@commons.apache.org>
>>>>>  Sent: Sunday, April 19, 2015 9:01 PM
>>>>>  Subject: [TEXT] Review of the code base
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I've looked through the code base of [text] and I already did some clean
>>>>> up. However there are a few things, that need more work IMHO:
>>>>>
>>>>> 1. IP clearance! There are several comments talking about code adapted
>>>>> from
>>>>> PHP libraries. This needs to be listed in NOTICE.txt. For an example see
>>>>> the NOTICE.txt in [lang].
>>>>>
>>>>> 2. The names package needs work. Currently the HumanNameParser takes a
>>>>> Name
>>>>> object (which is just a wrapper around a String) as parameter. The parse
>>>>> result is then saved in fields (e.g. firstName). This makes the parser
>>>>> unusable after it has parsed a name. I think the API should be reworked
>>>>> such as:
>>>>> - The constructor of the parser takes configuration options which can be
>>>>> reused for several names to parse
>>>>> - the parse method takes a string as parameter, containing a name
>>>>> - the parse method returns an immutable Name objects which has getters
>>>>> for
>>>>> firstName, lastName etc.
>>>>>
>>>>> 3. There are a some useless JavaDocs, which can be dropped. For example:
>>>>> /**
>>>>>  * Middle name.
>>>>>  */
>>>>> private String middle;
>>>>>
>>>>> Better rename the variable to middleName and drop the comment.
>>>>>
>>>>> 4. Several classes have no tests. The overall test coverage looks good,
>>>>> so
>>>>> this may not be a problem.
>>>>>
>>>>> 5. I don't think the shade configuration is correct. To me it looks like
>>>>> commons-text depends on [lang] and also shades some classes. Do we
>>>>> really
>>>>> need the explicit dependency? Furthermore, why do we need to shade so
>>>>> many
>>>>> classes? Why do we need anything beside StringUtils?
>>>>>
>>>>> Nevertheless, this library looks very good. Kudos to Bruno for pushing
>>>>> this
>>>>> forward!
>>>>>
>>>>> Regards,
>>>>> Benedikt
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> http://people.apache.org/~britter/
>>>>> http://www.systemoutprintln.de/
>>>>> http://twitter.com/BenediktRitter
>>>>> http://github.com/britter
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> http://people.apache.org/~britter/
>>>> http://www.systemoutprintln.de/
>>>> http://twitter.com/BenediktRitter
>>>> http://github.com/britter
>>>>
>>>
>>>
>>> --
>>> http://people.apache.org/~britter/
>>> http://www.systemoutprintln.de/
>>> http://twitter.com/BenediktRitter
>>> http://github.com/britter
>>>
>>
>>
>> --
>> http://people.apache.org/~britter/
>> http://www.systemoutprintln.de/
>> http://twitter.com/BenediktRitter
>> http://github.com/britter
>>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [TEXT] Review of the code base

Posted by Benedikt Ritter <br...@apache.org>.
2015-04-26 11:52 GMT+02:00 Benedikt Ritter <br...@apache.org>:

>
>
> 2015-04-19 17:29 GMT+02:00 Benedikt Ritter <br...@apache.org>:
>
>>
>>
>> 2015-04-19 17:24 GMT+02:00 Benedikt Ritter <br...@apache.org>:
>>
>>> Hello Bruno,
>>>
>>> 2015-04-19 12:15 GMT+02:00 Bruno P. Kinoshita <
>>> brunodepaulak@yahoo.com.br>:
>>>
>>>> Thanks for the thorough code review Benedikt! Comments inline.
>>>>
>>>> > 1. IP clearance! There are several comments talking about code
>>>> adapted from
>>>> >PHP libraries. This needs to be listed in NOTICE.txt. For an example
>>>> see
>>>> >the NOTICE.txt in [lang].
>>>> Filed https://issues.apache.org/jira/browse/SANDBOX-497
>>>> And first try to fix it:
>>>>
>>>>
>>>> https://git1-us-west.apache.org/repos/asf?p=commons-text.git;a=commitdiff;h=refs/heads/SANDBOX-497
>>>> Does that look correct? I also checked with the author before porting
>>>> it to the initial Java version, and told him about the [text] component -
>>>> https://github.com/jasonpriem/HumanNameParser.php/issues/11
>>>
>>>
>>> I've adjusted the changes a bit and merged it into master.
>>>
>>>
>>>>
>>>> > 2. The names package needs work.
>>>> Completely agree. We can try to make it thread safe too. This will take
>>>> a little longer.
>>>> File new issue for tracking it
>>>> https://issues.apache.org/jira/browse/SANDBOX-498
>>>
>>>
>>> I've pushed a new branch containing a proposal how to fix this. Please
>>> review.
>>>
>>>
>>>>
>>>>
>>>> > 3. There are a some useless JavaDocs, which can be dropped. For
>>>> example:
>>>> > (...)> Better rename the variable to middleName and drop the comment.
>>>> What about renaming the variables but leaving the comments?
>>>>
>>>
>>> For I've corrected this for the names package in the SANDBOX-498 branch.
>>>
>>>
>>>>
>>>> > 4. Several classes have no tests. The overall test coverage looks
>>>> good, so
>>>> >this may not be a problem.
>>>> Before the initial release I can start a development cycle just to add
>>>> new tests for parts that seem important - like code with somewhat high [+5]
>>>> cyclomatic complexity and uncovered branches. Just to make sure that the
>>>> initial release will have an acceptable quality and less hidden bugs :-)
>>>>
>>>
>>> Nice!
>>>
>>>
>>>> > 5. I don't think the shade configuration is correct. To me it looks
>>>> like
>>>> >commons-text depends on [lang] and also shades some classes. Do we
>>>> really
>>>> >need the explicit dependency? Furthermore, why do we need to shade so
>>>> many
>>>> >classes? Why do we need anything beside StringUtils?
>>>> My mistake. I copied the [functor] configuration without much thinking.
>>>> IIRC, this dependency was added because of the names package. But it
>>>> shouldn't be too complicate to replace it. It is mainly used for checking
>>>> empty/blank values.
>>>>
>>>> So let's drop the [lang] dependency and use something like `$VAR !=
>>>> null && ! "".equals($VAR)` instead?
>>>>
>>>
>>> I think we also need StringUtils in the similarity package. We should
>>> handle this after we have merged the changes made to the names package.
>>>
>>>
>>>> Thanks again for reviewing the code!
>>>>
>>>
>>> No problem, thanks for pushing this forward.
>>>
>>> Here are some more things we need to take care of before 1.0:
>>> - JIRA project
>>>
>>
>> Requested a JIRA project:
>> https://issues.apache.org/jira/servicedesk/customer/portal/1/INFRA-9477
>>
>
> Here is the JIRA project [1]. I'll find out, how we can move the Commons
> Text issues from the Sandbox project to the new project.
>

I've migrated all issues from the Sandbox project to the new Commons Text
project. I've also updated the project documentation.


>
> Benedikt
>
> [1] https://issues.apache.org/jira/browse/TEXT
>
>
>>
>>
>>> - Rename similarity package to distance package? We have five Distance
>>> implementations but only one Similarity. Furthermore we have FuzzyScore. Is
>>> it a distance or a similarity? It should be renamed accordingly.
>>> - ConsineSimilariy uses Vectors modeled as Map<CharSequence, Integer>.
>>> Does it make sense to introduce a new class called Vector?
>>>
>>> Benedikt
>>>
>>>
>>>>
>>>> All the best,Bruno
>>>>
>>>>
>>>>       From: Benedikt Ritter <br...@apache.org>
>>>>  To: Commons Developers List <de...@commons.apache.org>
>>>>  Sent: Sunday, April 19, 2015 9:01 PM
>>>>  Subject: [TEXT] Review of the code base
>>>>
>>>> Hi all,
>>>>
>>>> I've looked through the code base of [text] and I already did some clean
>>>> up. However there are a few things, that need more work IMHO:
>>>>
>>>> 1. IP clearance! There are several comments talking about code adapted
>>>> from
>>>> PHP libraries. This needs to be listed in NOTICE.txt. For an example see
>>>> the NOTICE.txt in [lang].
>>>>
>>>> 2. The names package needs work. Currently the HumanNameParser takes a
>>>> Name
>>>> object (which is just a wrapper around a String) as parameter. The parse
>>>> result is then saved in fields (e.g. firstName). This makes the parser
>>>> unusable after it has parsed a name. I think the API should be reworked
>>>> such as:
>>>> - The constructor of the parser takes configuration options which can be
>>>> reused for several names to parse
>>>> - the parse method takes a string as parameter, containing a name
>>>> - the parse method returns an immutable Name objects which has getters
>>>> for
>>>> firstName, lastName etc.
>>>>
>>>> 3. There are a some useless JavaDocs, which can be dropped. For example:
>>>> /**
>>>>  * Middle name.
>>>>  */
>>>> private String middle;
>>>>
>>>> Better rename the variable to middleName and drop the comment.
>>>>
>>>> 4. Several classes have no tests. The overall test coverage looks good,
>>>> so
>>>> this may not be a problem.
>>>>
>>>> 5. I don't think the shade configuration is correct. To me it looks like
>>>> commons-text depends on [lang] and also shades some classes. Do we
>>>> really
>>>> need the explicit dependency? Furthermore, why do we need to shade so
>>>> many
>>>> classes? Why do we need anything beside StringUtils?
>>>>
>>>> Nevertheless, this library looks very good. Kudos to Bruno for pushing
>>>> this
>>>> forward!
>>>>
>>>> Regards,
>>>> Benedikt
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> http://people.apache.org/~britter/
>>>> http://www.systemoutprintln.de/
>>>> http://twitter.com/BenediktRitter
>>>> http://github.com/britter
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> http://people.apache.org/~britter/
>>> http://www.systemoutprintln.de/
>>> http://twitter.com/BenediktRitter
>>> http://github.com/britter
>>>
>>
>>
>>
>> --
>> http://people.apache.org/~britter/
>> http://www.systemoutprintln.de/
>> http://twitter.com/BenediktRitter
>> http://github.com/britter
>>
>
>
>
> --
> http://people.apache.org/~britter/
> http://www.systemoutprintln.de/
> http://twitter.com/BenediktRitter
> http://github.com/britter
>



-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter

Re: [TEXT] Review of the code base

Posted by Benedikt Ritter <br...@apache.org>.
2015-04-19 17:29 GMT+02:00 Benedikt Ritter <br...@apache.org>:

>
>
> 2015-04-19 17:24 GMT+02:00 Benedikt Ritter <br...@apache.org>:
>
>> Hello Bruno,
>>
>> 2015-04-19 12:15 GMT+02:00 Bruno P. Kinoshita <brunodepaulak@yahoo.com.br
>> >:
>>
>>> Thanks for the thorough code review Benedikt! Comments inline.
>>>
>>> > 1. IP clearance! There are several comments talking about code adapted
>>> from
>>> >PHP libraries. This needs to be listed in NOTICE.txt. For an example see
>>> >the NOTICE.txt in [lang].
>>> Filed https://issues.apache.org/jira/browse/SANDBOX-497
>>> And first try to fix it:
>>>
>>>
>>> https://git1-us-west.apache.org/repos/asf?p=commons-text.git;a=commitdiff;h=refs/heads/SANDBOX-497
>>> Does that look correct? I also checked with the author before porting it
>>> to the initial Java version, and told him about the [text] component -
>>> https://github.com/jasonpriem/HumanNameParser.php/issues/11
>>
>>
>> I've adjusted the changes a bit and merged it into master.
>>
>>
>>>
>>> > 2. The names package needs work.
>>> Completely agree. We can try to make it thread safe too. This will take
>>> a little longer.
>>> File new issue for tracking it
>>> https://issues.apache.org/jira/browse/SANDBOX-498
>>
>>
>> I've pushed a new branch containing a proposal how to fix this. Please
>> review.
>>
>>
>>>
>>>
>>> > 3. There are a some useless JavaDocs, which can be dropped. For
>>> example:
>>> > (...)> Better rename the variable to middleName and drop the comment.
>>> What about renaming the variables but leaving the comments?
>>>
>>
>> For I've corrected this for the names package in the SANDBOX-498 branch.
>>
>>
>>>
>>> > 4. Several classes have no tests. The overall test coverage looks
>>> good, so
>>> >this may not be a problem.
>>> Before the initial release I can start a development cycle just to add
>>> new tests for parts that seem important - like code with somewhat high [+5]
>>> cyclomatic complexity and uncovered branches. Just to make sure that the
>>> initial release will have an acceptable quality and less hidden bugs :-)
>>>
>>
>> Nice!
>>
>>
>>> > 5. I don't think the shade configuration is correct. To me it looks
>>> like
>>> >commons-text depends on [lang] and also shades some classes. Do we
>>> really
>>> >need the explicit dependency? Furthermore, why do we need to shade so
>>> many
>>> >classes? Why do we need anything beside StringUtils?
>>> My mistake. I copied the [functor] configuration without much thinking.
>>> IIRC, this dependency was added because of the names package. But it
>>> shouldn't be too complicate to replace it. It is mainly used for checking
>>> empty/blank values.
>>>
>>> So let's drop the [lang] dependency and use something like `$VAR != null
>>> && ! "".equals($VAR)` instead?
>>>
>>
>> I think we also need StringUtils in the similarity package. We should
>> handle this after we have merged the changes made to the names package.
>>
>>
>>> Thanks again for reviewing the code!
>>>
>>
>> No problem, thanks for pushing this forward.
>>
>> Here are some more things we need to take care of before 1.0:
>> - JIRA project
>>
>
> Requested a JIRA project:
> https://issues.apache.org/jira/servicedesk/customer/portal/1/INFRA-9477
>

Here is the JIRA project [1]. I'll find out, how we can move the Commons
Text issues from the Sandbox project to the new project.

Benedikt

[1] https://issues.apache.org/jira/browse/TEXT


>
>
>> - Rename similarity package to distance package? We have five Distance
>> implementations but only one Similarity. Furthermore we have FuzzyScore. Is
>> it a distance or a similarity? It should be renamed accordingly.
>> - ConsineSimilariy uses Vectors modeled as Map<CharSequence, Integer>.
>> Does it make sense to introduce a new class called Vector?
>>
>> Benedikt
>>
>>
>>>
>>> All the best,Bruno
>>>
>>>
>>>       From: Benedikt Ritter <br...@apache.org>
>>>  To: Commons Developers List <de...@commons.apache.org>
>>>  Sent: Sunday, April 19, 2015 9:01 PM
>>>  Subject: [TEXT] Review of the code base
>>>
>>> Hi all,
>>>
>>> I've looked through the code base of [text] and I already did some clean
>>> up. However there are a few things, that need more work IMHO:
>>>
>>> 1. IP clearance! There are several comments talking about code adapted
>>> from
>>> PHP libraries. This needs to be listed in NOTICE.txt. For an example see
>>> the NOTICE.txt in [lang].
>>>
>>> 2. The names package needs work. Currently the HumanNameParser takes a
>>> Name
>>> object (which is just a wrapper around a String) as parameter. The parse
>>> result is then saved in fields (e.g. firstName). This makes the parser
>>> unusable after it has parsed a name. I think the API should be reworked
>>> such as:
>>> - The constructor of the parser takes configuration options which can be
>>> reused for several names to parse
>>> - the parse method takes a string as parameter, containing a name
>>> - the parse method returns an immutable Name objects which has getters
>>> for
>>> firstName, lastName etc.
>>>
>>> 3. There are a some useless JavaDocs, which can be dropped. For example:
>>> /**
>>>  * Middle name.
>>>  */
>>> private String middle;
>>>
>>> Better rename the variable to middleName and drop the comment.
>>>
>>> 4. Several classes have no tests. The overall test coverage looks good,
>>> so
>>> this may not be a problem.
>>>
>>> 5. I don't think the shade configuration is correct. To me it looks like
>>> commons-text depends on [lang] and also shades some classes. Do we really
>>> need the explicit dependency? Furthermore, why do we need to shade so
>>> many
>>> classes? Why do we need anything beside StringUtils?
>>>
>>> Nevertheless, this library looks very good. Kudos to Bruno for pushing
>>> this
>>> forward!
>>>
>>> Regards,
>>> Benedikt
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> http://people.apache.org/~britter/
>>> http://www.systemoutprintln.de/
>>> http://twitter.com/BenediktRitter
>>> http://github.com/britter
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>> --
>> http://people.apache.org/~britter/
>> http://www.systemoutprintln.de/
>> http://twitter.com/BenediktRitter
>> http://github.com/britter
>>
>
>
>
> --
> http://people.apache.org/~britter/
> http://www.systemoutprintln.de/
> http://twitter.com/BenediktRitter
> http://github.com/britter
>



-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter

Re: [TEXT] Review of the code base

Posted by Benedikt Ritter <br...@apache.org>.
2015-04-19 17:29 GMT+02:00 Benedikt Ritter <br...@apache.org>:

>
>
> 2015-04-19 17:24 GMT+02:00 Benedikt Ritter <br...@apache.org>:
>
>> Hello Bruno,
>>
>> 2015-04-19 12:15 GMT+02:00 Bruno P. Kinoshita <brunodepaulak@yahoo.com.br
>> >:
>>
>>> Thanks for the thorough code review Benedikt! Comments inline.
>>>
>>> > 1. IP clearance! There are several comments talking about code adapted
>>> from
>>> >PHP libraries. This needs to be listed in NOTICE.txt. For an example see
>>> >the NOTICE.txt in [lang].
>>> Filed https://issues.apache.org/jira/browse/SANDBOX-497
>>> And first try to fix it:
>>>
>>>
>>> https://git1-us-west.apache.org/repos/asf?p=commons-text.git;a=commitdiff;h=refs/heads/SANDBOX-497
>>> Does that look correct? I also checked with the author before porting it
>>> to the initial Java version, and told him about the [text] component -
>>> https://github.com/jasonpriem/HumanNameParser.php/issues/11
>>
>>
>> I've adjusted the changes a bit and merged it into master.
>>
>>
>>>
>>> > 2. The names package needs work.
>>> Completely agree. We can try to make it thread safe too. This will take
>>> a little longer.
>>> File new issue for tracking it
>>> https://issues.apache.org/jira/browse/SANDBOX-498
>>
>>
>> I've pushed a new branch containing a proposal how to fix this. Please
>> review.
>>
>>
>>>
>>>
>>> > 3. There are a some useless JavaDocs, which can be dropped. For
>>> example:
>>> > (...)> Better rename the variable to middleName and drop the comment.
>>> What about renaming the variables but leaving the comments?
>>>
>>
>> For I've corrected this for the names package in the SANDBOX-498 branch.
>>
>>
>>>
>>> > 4. Several classes have no tests. The overall test coverage looks
>>> good, so
>>> >this may not be a problem.
>>> Before the initial release I can start a development cycle just to add
>>> new tests for parts that seem important - like code with somewhat high [+5]
>>> cyclomatic complexity and uncovered branches. Just to make sure that the
>>> initial release will have an acceptable quality and less hidden bugs :-)
>>>
>>
>> Nice!
>>
>>
>>> > 5. I don't think the shade configuration is correct. To me it looks
>>> like
>>> >commons-text depends on [lang] and also shades some classes. Do we
>>> really
>>> >need the explicit dependency? Furthermore, why do we need to shade so
>>> many
>>> >classes? Why do we need anything beside StringUtils?
>>> My mistake. I copied the [functor] configuration without much thinking.
>>> IIRC, this dependency was added because of the names package. But it
>>> shouldn't be too complicate to replace it. It is mainly used for checking
>>> empty/blank values.
>>>
>>> So let's drop the [lang] dependency and use something like `$VAR != null
>>> && ! "".equals($VAR)` instead?
>>>
>>
>> I think we also need StringUtils in the similarity package. We should
>> handle this after we have merged the changes made to the names package.
>>
>>
>>> Thanks again for reviewing the code!
>>>
>>
>> No problem, thanks for pushing this forward.
>>
>> Here are some more things we need to take care of before 1.0:
>> - JIRA project
>>
>
> Requested a JIRA project:
> https://issues.apache.org/jira/servicedesk/customer/portal/1/INFRA-9477
>
>
>> - Rename similarity package to distance package? We have five Distance
>> implementations but only one Similarity. Furthermore we have FuzzyScore. Is
>> it a distance or a similarity? It should be renamed accordingly.
>> - ConsineSimilariy uses Vectors modeled as Map<CharSequence, Integer>.
>> Does it make sense to introduce a new class called Vector?
>>
>
We also need a logo! [1] Anybody creative enough to do that? :-)

Benedikt

[1] https://issues.apache.org/jira/browse/SANDBOX-499


>
>> Benedikt
>>
>>
>>>
>>> All the best,Bruno
>>>
>>>
>>>       From: Benedikt Ritter <br...@apache.org>
>>>  To: Commons Developers List <de...@commons.apache.org>
>>>  Sent: Sunday, April 19, 2015 9:01 PM
>>>  Subject: [TEXT] Review of the code base
>>>
>>> Hi all,
>>>
>>> I've looked through the code base of [text] and I already did some clean
>>> up. However there are a few things, that need more work IMHO:
>>>
>>> 1. IP clearance! There are several comments talking about code adapted
>>> from
>>> PHP libraries. This needs to be listed in NOTICE.txt. For an example see
>>> the NOTICE.txt in [lang].
>>>
>>> 2. The names package needs work. Currently the HumanNameParser takes a
>>> Name
>>> object (which is just a wrapper around a String) as parameter. The parse
>>> result is then saved in fields (e.g. firstName). This makes the parser
>>> unusable after it has parsed a name. I think the API should be reworked
>>> such as:
>>> - The constructor of the parser takes configuration options which can be
>>> reused for several names to parse
>>> - the parse method takes a string as parameter, containing a name
>>> - the parse method returns an immutable Name objects which has getters
>>> for
>>> firstName, lastName etc.
>>>
>>> 3. There are a some useless JavaDocs, which can be dropped. For example:
>>> /**
>>>  * Middle name.
>>>  */
>>> private String middle;
>>>
>>> Better rename the variable to middleName and drop the comment.
>>>
>>> 4. Several classes have no tests. The overall test coverage looks good,
>>> so
>>> this may not be a problem.
>>>
>>> 5. I don't think the shade configuration is correct. To me it looks like
>>> commons-text depends on [lang] and also shades some classes. Do we really
>>> need the explicit dependency? Furthermore, why do we need to shade so
>>> many
>>> classes? Why do we need anything beside StringUtils?
>>>
>>> Nevertheless, this library looks very good. Kudos to Bruno for pushing
>>> this
>>> forward!
>>>
>>> Regards,
>>> Benedikt
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> http://people.apache.org/~britter/
>>> http://www.systemoutprintln.de/
>>> http://twitter.com/BenediktRitter
>>> http://github.com/britter
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>> --
>> http://people.apache.org/~britter/
>> http://www.systemoutprintln.de/
>> http://twitter.com/BenediktRitter
>> http://github.com/britter
>>
>
>
>
> --
> http://people.apache.org/~britter/
> http://www.systemoutprintln.de/
> http://twitter.com/BenediktRitter
> http://github.com/britter
>



-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter

Re: [TEXT] Review of the code base

Posted by Benedikt Ritter <br...@apache.org>.
2015-04-19 17:24 GMT+02:00 Benedikt Ritter <br...@apache.org>:

> Hello Bruno,
>
> 2015-04-19 12:15 GMT+02:00 Bruno P. Kinoshita <br...@yahoo.com.br>
> :
>
>> Thanks for the thorough code review Benedikt! Comments inline.
>>
>> > 1. IP clearance! There are several comments talking about code adapted
>> from
>> >PHP libraries. This needs to be listed in NOTICE.txt. For an example see
>> >the NOTICE.txt in [lang].
>> Filed https://issues.apache.org/jira/browse/SANDBOX-497
>> And first try to fix it:
>>
>>
>> https://git1-us-west.apache.org/repos/asf?p=commons-text.git;a=commitdiff;h=refs/heads/SANDBOX-497
>> Does that look correct? I also checked with the author before porting it
>> to the initial Java version, and told him about the [text] component -
>> https://github.com/jasonpriem/HumanNameParser.php/issues/11
>
>
> I've adjusted the changes a bit and merged it into master.
>
>
>>
>> > 2. The names package needs work.
>> Completely agree. We can try to make it thread safe too. This will take a
>> little longer.
>> File new issue for tracking it
>> https://issues.apache.org/jira/browse/SANDBOX-498
>
>
> I've pushed a new branch containing a proposal how to fix this. Please
> review.
>
>
>>
>>
>> > 3. There are a some useless JavaDocs, which can be dropped. For example:
>> > (...)> Better rename the variable to middleName and drop the comment.
>> What about renaming the variables but leaving the comments?
>>
>
> For I've corrected this for the names package in the SANDBOX-498 branch.
>
>
>>
>> > 4. Several classes have no tests. The overall test coverage looks good,
>> so
>> >this may not be a problem.
>> Before the initial release I can start a development cycle just to add
>> new tests for parts that seem important - like code with somewhat high [+5]
>> cyclomatic complexity and uncovered branches. Just to make sure that the
>> initial release will have an acceptable quality and less hidden bugs :-)
>>
>
> Nice!
>
>
>> > 5. I don't think the shade configuration is correct. To me it looks like
>> >commons-text depends on [lang] and also shades some classes. Do we really
>> >need the explicit dependency? Furthermore, why do we need to shade so
>> many
>> >classes? Why do we need anything beside StringUtils?
>> My mistake. I copied the [functor] configuration without much thinking.
>> IIRC, this dependency was added because of the names package. But it
>> shouldn't be too complicate to replace it. It is mainly used for checking
>> empty/blank values.
>>
>> So let's drop the [lang] dependency and use something like `$VAR != null
>> && ! "".equals($VAR)` instead?
>>
>
> I think we also need StringUtils in the similarity package. We should
> handle this after we have merged the changes made to the names package.
>
>
>> Thanks again for reviewing the code!
>>
>
> No problem, thanks for pushing this forward.
>
> Here are some more things we need to take care of before 1.0:
> - JIRA project
>

Requested a JIRA project:
https://issues.apache.org/jira/servicedesk/customer/portal/1/INFRA-9477


> - Rename similarity package to distance package? We have five Distance
> implementations but only one Similarity. Furthermore we have FuzzyScore. Is
> it a distance or a similarity? It should be renamed accordingly.
> - ConsineSimilariy uses Vectors modeled as Map<CharSequence, Integer>.
> Does it make sense to introduce a new class called Vector?
>
> Benedikt
>
>
>>
>> All the best,Bruno
>>
>>
>>       From: Benedikt Ritter <br...@apache.org>
>>  To: Commons Developers List <de...@commons.apache.org>
>>  Sent: Sunday, April 19, 2015 9:01 PM
>>  Subject: [TEXT] Review of the code base
>>
>> Hi all,
>>
>> I've looked through the code base of [text] and I already did some clean
>> up. However there are a few things, that need more work IMHO:
>>
>> 1. IP clearance! There are several comments talking about code adapted
>> from
>> PHP libraries. This needs to be listed in NOTICE.txt. For an example see
>> the NOTICE.txt in [lang].
>>
>> 2. The names package needs work. Currently the HumanNameParser takes a
>> Name
>> object (which is just a wrapper around a String) as parameter. The parse
>> result is then saved in fields (e.g. firstName). This makes the parser
>> unusable after it has parsed a name. I think the API should be reworked
>> such as:
>> - The constructor of the parser takes configuration options which can be
>> reused for several names to parse
>> - the parse method takes a string as parameter, containing a name
>> - the parse method returns an immutable Name objects which has getters for
>> firstName, lastName etc.
>>
>> 3. There are a some useless JavaDocs, which can be dropped. For example:
>> /**
>>  * Middle name.
>>  */
>> private String middle;
>>
>> Better rename the variable to middleName and drop the comment.
>>
>> 4. Several classes have no tests. The overall test coverage looks good, so
>> this may not be a problem.
>>
>> 5. I don't think the shade configuration is correct. To me it looks like
>> commons-text depends on [lang] and also shades some classes. Do we really
>> need the explicit dependency? Furthermore, why do we need to shade so many
>> classes? Why do we need anything beside StringUtils?
>>
>> Nevertheless, this library looks very good. Kudos to Bruno for pushing
>> this
>> forward!
>>
>> Regards,
>> Benedikt
>>
>>
>>
>>
>>
>>
>> --
>> http://people.apache.org/~britter/
>> http://www.systemoutprintln.de/
>> http://twitter.com/BenediktRitter
>> http://github.com/britter
>>
>>
>>
>>
>>
>
>
>
> --
> http://people.apache.org/~britter/
> http://www.systemoutprintln.de/
> http://twitter.com/BenediktRitter
> http://github.com/britter
>



-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter

Re: [TEXT] Review of the code base

Posted by Benedikt Ritter <br...@apache.org>.
Hello Bruno,

2015-04-19 12:15 GMT+02:00 Bruno P. Kinoshita <br...@yahoo.com.br>:

> Thanks for the thorough code review Benedikt! Comments inline.
>
> > 1. IP clearance! There are several comments talking about code adapted
> from
> >PHP libraries. This needs to be listed in NOTICE.txt. For an example see
> >the NOTICE.txt in [lang].
> Filed https://issues.apache.org/jira/browse/SANDBOX-497
> And first try to fix it:
>
>
> https://git1-us-west.apache.org/repos/asf?p=commons-text.git;a=commitdiff;h=refs/heads/SANDBOX-497
> Does that look correct? I also checked with the author before porting it
> to the initial Java version, and told him about the [text] component -
> https://github.com/jasonpriem/HumanNameParser.php/issues/11


I've adjusted the changes a bit and merged it into master.


>
> > 2. The names package needs work.
> Completely agree. We can try to make it thread safe too. This will take a
> little longer.
> File new issue for tracking it
> https://issues.apache.org/jira/browse/SANDBOX-498


I've pushed a new branch containing a proposal how to fix this. Please
review.


>
>
> > 3. There are a some useless JavaDocs, which can be dropped. For example:
> > (...)> Better rename the variable to middleName and drop the comment.
> What about renaming the variables but leaving the comments?
>

For I've corrected this for the names package in the SANDBOX-498 branch.


>
> > 4. Several classes have no tests. The overall test coverage looks good,
> so
> >this may not be a problem.
> Before the initial release I can start a development cycle just to add new
> tests for parts that seem important - like code with somewhat high [+5]
> cyclomatic complexity and uncovered branches. Just to make sure that the
> initial release will have an acceptable quality and less hidden bugs :-)
>

Nice!


> > 5. I don't think the shade configuration is correct. To me it looks like
> >commons-text depends on [lang] and also shades some classes. Do we really
> >need the explicit dependency? Furthermore, why do we need to shade so many
> >classes? Why do we need anything beside StringUtils?
> My mistake. I copied the [functor] configuration without much thinking.
> IIRC, this dependency was added because of the names package. But it
> shouldn't be too complicate to replace it. It is mainly used for checking
> empty/blank values.
>
> So let's drop the [lang] dependency and use something like `$VAR != null
> && ! "".equals($VAR)` instead?
>

I think we also need StringUtils in the similarity package. We should
handle this after we have merged the changes made to the names package.


> Thanks again for reviewing the code!
>

No problem, thanks for pushing this forward.

Here are some more things we need to take care of before 1.0:
- JIRA project
- Rename similarity package to distance package? We have five Distance
implementations but only one Similarity. Furthermore we have FuzzyScore. Is
it a distance or a similarity? It should be renamed accordingly.
- ConsineSimilariy uses Vectors modeled as Map<CharSequence, Integer>. Does
it make sense to introduce a new class called Vector?

Benedikt


>
> All the best,Bruno
>
>
>       From: Benedikt Ritter <br...@apache.org>
>  To: Commons Developers List <de...@commons.apache.org>
>  Sent: Sunday, April 19, 2015 9:01 PM
>  Subject: [TEXT] Review of the code base
>
> Hi all,
>
> I've looked through the code base of [text] and I already did some clean
> up. However there are a few things, that need more work IMHO:
>
> 1. IP clearance! There are several comments talking about code adapted from
> PHP libraries. This needs to be listed in NOTICE.txt. For an example see
> the NOTICE.txt in [lang].
>
> 2. The names package needs work. Currently the HumanNameParser takes a Name
> object (which is just a wrapper around a String) as parameter. The parse
> result is then saved in fields (e.g. firstName). This makes the parser
> unusable after it has parsed a name. I think the API should be reworked
> such as:
> - The constructor of the parser takes configuration options which can be
> reused for several names to parse
> - the parse method takes a string as parameter, containing a name
> - the parse method returns an immutable Name objects which has getters for
> firstName, lastName etc.
>
> 3. There are a some useless JavaDocs, which can be dropped. For example:
> /**
>  * Middle name.
>  */
> private String middle;
>
> Better rename the variable to middleName and drop the comment.
>
> 4. Several classes have no tests. The overall test coverage looks good, so
> this may not be a problem.
>
> 5. I don't think the shade configuration is correct. To me it looks like
> commons-text depends on [lang] and also shades some classes. Do we really
> need the explicit dependency? Furthermore, why do we need to shade so many
> classes? Why do we need anything beside StringUtils?
>
> Nevertheless, this library looks very good. Kudos to Bruno for pushing this
> forward!
>
> Regards,
> Benedikt
>
>
>
>
>
>
> --
> http://people.apache.org/~britter/
> http://www.systemoutprintln.de/
> http://twitter.com/BenediktRitter
> http://github.com/britter
>
>
>
>
>



-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter

Re: [TEXT] Review of the code base

Posted by "Bruno P. Kinoshita" <br...@yahoo.com.br>.
Thanks for the thorough code review Benedikt! Comments inline.

> 1. IP clearance! There are several comments talking about code adapted from
>PHP libraries. This needs to be listed in NOTICE.txt. For an example see
>the NOTICE.txt in [lang].
Filed https://issues.apache.org/jira/browse/SANDBOX-497
And first try to fix it: 

https://git1-us-west.apache.org/repos/asf?p=commons-text.git;a=commitdiff;h=refs/heads/SANDBOX-497
Does that look correct? I also checked with the author before porting it to the initial Java version, and told him about the [text] component - https://github.com/jasonpriem/HumanNameParser.php/issues/11
> 2. The names package needs work.
Completely agree. We can try to make it thread safe too. This will take a little longer.
File new issue for tracking it https://issues.apache.org/jira/browse/SANDBOX-498

> 3. There are a some useless JavaDocs, which can be dropped. For example:
> (...)> Better rename the variable to middleName and drop the comment.
What about renaming the variables but leaving the comments? 

> 4. Several classes have no tests. The overall test coverage looks good, so
>this may not be a problem.
Before the initial release I can start a development cycle just to add new tests for parts that seem important - like code with somewhat high [+5] cyclomatic complexity and uncovered branches. Just to make sure that the initial release will have an acceptable quality and less hidden bugs :-)
> 5. I don't think the shade configuration is correct. To me it looks like
>commons-text depends on [lang] and also shades some classes. Do we really
>need the explicit dependency? Furthermore, why do we need to shade so many
>classes? Why do we need anything beside StringUtils?
My mistake. I copied the [functor] configuration without much thinking. IIRC, this dependency was added because of the names package. But it shouldn't be too complicate to replace it. It is mainly used for checking empty/blank values. 

So let's drop the [lang] dependency and use something like `$VAR != null && ! "".equals($VAR)` instead?
Thanks again for reviewing the code!

All the best,Bruno

 
      From: Benedikt Ritter <br...@apache.org>
 To: Commons Developers List <de...@commons.apache.org> 
 Sent: Sunday, April 19, 2015 9:01 PM
 Subject: [TEXT] Review of the code base
   
Hi all,

I've looked through the code base of [text] and I already did some clean
up. However there are a few things, that need more work IMHO:

1. IP clearance! There are several comments talking about code adapted from
PHP libraries. This needs to be listed in NOTICE.txt. For an example see
the NOTICE.txt in [lang].

2. The names package needs work. Currently the HumanNameParser takes a Name
object (which is just a wrapper around a String) as parameter. The parse
result is then saved in fields (e.g. firstName). This makes the parser
unusable after it has parsed a name. I think the API should be reworked
such as:
- The constructor of the parser takes configuration options which can be
reused for several names to parse
- the parse method takes a string as parameter, containing a name
- the parse method returns an immutable Name objects which has getters for
firstName, lastName etc.

3. There are a some useless JavaDocs, which can be dropped. For example:
/**
 * Middle name.
 */
private String middle;

Better rename the variable to middleName and drop the comment.

4. Several classes have no tests. The overall test coverage looks good, so
this may not be a problem.

5. I don't think the shade configuration is correct. To me it looks like
commons-text depends on [lang] and also shades some classes. Do we really
need the explicit dependency? Furthermore, why do we need to shade so many
classes? Why do we need anything beside StringUtils?

Nevertheless, this library looks very good. Kudos to Bruno for pushing this
forward!

Regards,
Benedikt






-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter