You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "mark meiklejohn (JIRA)" <ji...@apache.org> on 2011/07/26 20:49:09 UTC

[jira] [Created] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Case Sensitivie Flag & Custom Tag Dictionary
--------------------------------------------

                 Key: OPENNLP-239
                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
             Project: OpenNLP
          Issue Type: New Feature
          Components: Parser
    Affects Versions: tools-1.5.1-incubating
            Reporter: mark meiklejohn
             Fix For: tools-1.5.2-incubating


Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "James Kosin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079119#comment-13079119 ] 

James Kosin commented on OPENNLP-239:
-------------------------------------

I'm fine either way.  If we remove the case; do we have to review the methods that use the dictionary to verify the proper usages?

James

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "James Kosin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079120#comment-13079120 ] 

James Kosin commented on OPENNLP-239:
-------------------------------------

I'll do the changes, unless you already have.  I'm not familiar on how to roll back the changes; but, they are very easy to do.

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088636#comment-13088636 ] 

Jörn Kottmann commented on OPENNLP-239:
---------------------------------------

James, are we done here? Can the issue be closed?

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Issue Comment Edited] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "James Kosin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081362#comment-13081362 ] 

James Kosin edited comment on OPENNLP-239 at 8/9/11 3:51 AM:
-------------------------------------------------------------

Jorn,

Sorry, I've been busy...  The default was because I didn't know the 
default state for many of the models.  Most of the time it is based on 
how they are created.  I can fix that easily; so that it gets set to 
true if not present.

I found a way to implement without being static.

I'm also not familiar with the UMIA branch of OpenNLP or what the StringDictionary's requirements are... it appears to not be using any case sensitive flags; so... I wrote a default of true (to keep the flag the default).

James



      was (Author: jkosin):
    Jorn,

Sorry, I've been busy...  The default was because I didn't know the 
default state for many of the models.  Most of the time it is based on 
how they are created.  I can fix that easily; so that it gets set to 
true if not present.

The static was required because the function is static and doesn't have 
access to the non-static members.  I agree it was a nasty compromise.  
The other way to go would be to add the serializing to the Dictionary 
object itself... but, I don't know the problems with growing the 
Dictionary class too large... lastly, we could have the serializing as 
non-static meaning we would need to create a DictionarySerializer to use.

James

I'm also not familiar with the


  
> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "William Colen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13071722#comment-13071722 ] 

William Colen commented on OPENNLP-239:
---------------------------------------

"Should case sensitivity or insensitivity be dominant?"

It depends on the tool. We should be able to try both while creating a new model and decide it in terms of F1.

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "James Kosin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078568#comment-13078568 ] 

James Kosin commented on OPENNLP-239:
-------------------------------------

Could someone review the changes please.  I've made the entry a lowercase "case" attribute instead of upper case as in the example above.
Thanks,

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080379#comment-13080379 ] 

Jörn Kottmann commented on OPENNLP-239:
---------------------------------------

POSDictionary is doing based on the flag a String.toLowerCase(). The current implementation works correctly. To speed up things we would need to create another wrapper and use a hashCode method which also ignores the case.

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Closed] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "Joern Kottmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joern Kottmann closed OPENNLP-239.
----------------------------------


> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "James Kosin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079182#comment-13079182 ] 

James Kosin commented on OPENNLP-239:
-------------------------------------

I've made changes; however, I have to review 3 other classes that use the DictionarySerializer to serialize other Dictionary Types.

NGramModel.java  -- looks like this is really just serializing the data.  No refrences to caseSensitivity.
POSDictionary.java -- uses caseSensitivity; however, comments say it doesn't work and is always set to true.
DetokenizationDictionary.java  -- doesn't seem to use and is just serializing.

Please REVIEW these classes as well and make any needed changes.  I'm going to keep digging for answers.

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "James Kosin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089102#comment-13089102 ] 

James Kosin commented on OPENNLP-239:
-------------------------------------

Jorn,

Will close this one.  We do need to open a new discussion on the purpose of the POSDictionary and its usage of the caseSensitivity flag.  And if needed JIRA issues created to re-work that code.

Also, I'm not sure if the changes for saving the case sensitivity flag actaully fix the problem/issue here or that I've only addressed one aspect of the issue.

James

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "James Kosin (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

James Kosin reassigned OPENNLP-239:
-----------------------------------

    Assignee: James Kosin

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "James Kosin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078560#comment-13078560 ] 

James Kosin commented on OPENNLP-239:
-------------------------------------

Okay,
(1) I'm going to implement an attribute for the case sensitivity setting for the entries being saved to the dictionary.  I'm going to keep them separate for now and not try to combine them into the dictionary object.  The reason being; I'd like to keep the created Dictionary object default the user's required default setting for the comparison and not for the original dictionary elements as they where added.  You'll understand as things progress.

This means the new dictionary output will be like this
<code>
<dictionary>
<entry CASE="true">
<token>Patrick</token>
</entry>
</code>

(2)  Case sensitivity will be dependant on how the dictionary is created; however, if the user selects case insensitive comparison from the command line, the false value will take precidence to allow for a case insensitive compare on equals().  This means a dictionary created with the false default will not compare case sensitve regardless of any true value specified by the user.  So, create the dictionary accordingly to the true case sensitivity that is required.

(3)  I'm hoping the attributes in the XML will allow us to be backward compatible; since they will not exist otherwise.  I'm also hoping the case attribute will be backward compatible as well in that the older inplementation never used an attribute value system.



> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "James Kosin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078733#comment-13078733 ] 

James Kosin commented on OPENNLP-239:
-------------------------------------

Maybe... the issue is also with comparisons.  The Dictionary class takes a flag for the case sensitivity and the equals() method is adjusted based on this flag for all the classes and sub-classes that are returned from the Dictionary to allow the flag to persist.  It is one reason why it would be difficult to remove the duplicate isCaseSensitive flag from the StringListWrapper class.

If the dictionary were created with the flag to true, then all entries would use a case sensitive comparison.  The down side is that some of the tools allow us to modify the case sensitivity after dictionary creation in the sense of chaning the meaning (when I say creation I mean when it was created from the original data and not when the user is using the dictionary).

The original code however would always use the dictionary flag when created to determine the case comparison rules (when the user created the dictionary from the XML stream).  This looses the original intent I believe of the flag and the dictionary may not work as intended.

The changes I've done so far allows us to change the dictionary flag (to the users preference), if allowed, and allow the dictionary to be used to compare two different strings with the appropriate case sensitive or insensitive test.
Unmutable also means the DictionarySerializer would need to return the Dictionary class created from the input stream.... which I was trying to avoid if possible.

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078750#comment-13078750 ] 

Jörn Kottmann commented on OPENNLP-239:
---------------------------------------

Just looked a little at the code, the StringListWrapper.equals is then also a bit simpler, because it only needs to use one flag to determine if the compare is case sensitive or not.

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "William Colen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13071735#comment-13071735 ] 

William Colen commented on OPENNLP-239:
---------------------------------------

James, are you following the conversation "Case sensitivity of a dictionary loaded from the model" in Dev list?

In my opinion we should stop using this StringListWrapper. This class is a good idea but is making things complicated. It wraps the StringList to add the case sensitivity flag and overrides equals and hashcode methods according to the flag.
Instead of it, in my opinion, in case of a case insensitive dictionary, we should lower case the entries during the dictionary creation and persist the flag to the serialized file.

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "Joern Kottmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089108#comment-13089108 ] 

Joern Kottmann commented on OPENNLP-239:
----------------------------------------

I fixed the POSDictionary, or did I miss something? Can you please review the changes I made there?

You pointed out an issue in the Dictionary method which simply returns a string set, I believe that still needs fixing, right?

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "James Kosin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080337#comment-13080337 ] 

James Kosin commented on OPENNLP-239:
-------------------------------------

Jorn,
The problem is the POSDictionary isn't like the normal Dictionary so; it doesn't use the StringWrapper.  The POSDictionary also uses a HashTable... So the case flag is just there; however, even if it is set doesn't do anything.  What we need to do is decide if we want to use the normal Dictionary class or create another wrapper class around the strings to get the correct compares.  That aside, I'm not sure how the hash() function comes into play; but, we may have problems with the Dictionary class already, because the hash() is always with lower-case letters regardless of the case sensitivity setting.  This maybe wrong if Java uses the hash() to determine if two items are identical.
I have to dig a bit more into this to determine if we need any more changes.


> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080005#comment-13080005 ] 

Jörn Kottmann commented on OPENNLP-239:
---------------------------------------

The POSDictionary.create method needs to set the case sensitive flag. The code looks like ti will work when this is fixed. I already changed that in my local copy, should I check in this change James?

The comments said it does not work, is there any other reason for that?

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078741#comment-13078741 ] 

Jörn Kottmann commented on OPENNLP-239:
---------------------------------------

I understand the comparison issue and that is why we need to write the case flag at the begin of the dictionary. The Dictionary class never supported switching of the case after it was created. So just having one final flag per dictionary is fine.

StringListWarpper will be made non-static, and the case sensitive flag static. We need to change the serialization API a little to support to read and store the case flag. And then our issues are solved.

Since Dictionary should be immutable we should think about making the put method deprecated. 

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080963#comment-13080963 ] 

Jörn Kottmann commented on OPENNLP-239:
---------------------------------------

There is also a compile error in the uima integration code, there is a class which tries to create a dictionary, but the method now requires also a boolean flag. So obviously that was not a backward compatible change.

Can we do that in the dictionary serializer code? I am not sure, lets speak about it on the ML.
Would be nice to get a few opinons about this and then decide based on that.

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "James Kosin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13071442#comment-13071442 ] 

James Kosin commented on OPENNLP-239:
-------------------------------------

Review...

The dictionary has the case sensitivity flag and implementes multiple copies throughout.... hmmm... this will make it hard to directly relate any new setting for the caseSensitivity.
I'm going to compare with the older code to see what changed.


> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "James Kosin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078506#comment-13078506 ] 

James Kosin commented on OPENNLP-239:
-------------------------------------

Alright, 
(1) Even thought the StringListWrapper makes things complicated, there isn't an easier way to return a StringList with case sensitivity or insensitivity while using the equals() method.  The other method would involve going through all the uses for the Dictionary and the StringList and determine how to allow for case sensitive and insensitive testing.  Not really nice.
(2) Saving the case sensitivity of the Dictionary is becoming a requirement; however, we need clairity on two options... (a) default setting and the priority... we can make case insensitive the default and the priority... ie: setting another to true and trying to compare will result still in a case insensitive or the otherway around.  (b) saving the case sensitivity is a bit problematic in that the flag is required for the dictionary on creation and is used as entries are added... this means, it is difficult to restore the flag after the fact, or even before without creating a new dictionary class to store the entries.

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "James Kosin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080338#comment-13080338 ] 

James Kosin commented on OPENNLP-239:
-------------------------------------

Okay, brushed up on hashing.... and don't see any problems with the current implementation.
I'll need to get the POSDictionary a wrapper though for the String compares to be case sensitive/insensitive as set by the flag.


> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080971#comment-13080971 ] 

Jörn Kottmann commented on OPENNLP-239:
---------------------------------------

One more comment, the case sensitive flag in DictionarySerializer must not be static, because this causes a thread safety issue.
We need to find another way to implement this.

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078621#comment-13078621 ] 

Jörn Kottmann commented on OPENNLP-239:
---------------------------------------

We should not add the flag per entry, it should be only once at the beginning of the dictionary. I think it should be removed from StringListWrapper as well, and only be in the Dictionary class.

If we have duplicate entries it does not matter because the dictionary is a set, which contains all strings mentioned in the dictionary file, if there is one twice it will still only be one in the set.

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "James Kosin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13071481#comment-13071481 ] 

James Kosin commented on OPENNLP-239:
-------------------------------------

Should case sensitivity or insensitivity be dominant?

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "James Kosin (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

James Kosin resolved OPENNLP-239.
---------------------------------

    Resolution: Fixed

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "James Kosin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084512#comment-13084512 ] 

James Kosin commented on OPENNLP-239:
-------------------------------------

I've gone over everything for the Dictionary class now.  The tests are all working.  
The tests for case sensitive and insensitive dictionaries hints at a few areas we may want to work on.  But, I'll defer them for other JIRA issues after discussing them on the development list.

The POS Dictionary needs a lot of work.  I'll open a new issue for that with details.  Mostly because the POSDictionary class tries to implement its own case sensitivity flag for other reasons... or maybe the same reasons.  I'll post that as a bigger re-write for a later release.

Can we close this issue?

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "James Kosin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079187#comment-13079187 ] 

James Kosin commented on OPENNLP-239:
-------------------------------------

I also fixed a bug.  No effect to the code; because the flag wasn't being used outside of setting true/false.
There was an
<code>
if () {
} else if () {
} else if() {
}
</code>
that was testing the same token string in the first and last if () cases.  Of course, the last one would NEVER get triggered; so, the flag would never get set false... once set.


> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "William Colen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072130#comment-13072130 ] 

William Colen commented on OPENNLP-239:
---------------------------------------

The asStringSet returns an special implementation of Set. It relies on the dictionary field entrySet that is a Set<StringListWrapper>, so the hashcode and equals are override and the implementation use the flag. Also the returned Set<String> overrides the contains method:

      public boolean contains(Object obj) {
        boolean result = false;
        if (obj instanceof String) {
          String str = (String) obj;
          result = entrySet.contains(new StringListWrapper(new StringList(str),
              caseSensitive));
        }
        return result;
      }

it wraps the string with the StringListWrapper adding the flag so the entrySet.contains(..) should work.

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "James Kosin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13071424#comment-13071424 ] 

James Kosin commented on OPENNLP-239:
-------------------------------------

I'll look into the case sensitivity flag and determine the best course of action.
Currently, it looks like we will be adding a configuration item for this to the saved dictionary structure with a method to allow modification.

Currently, I believe the case sensitivity is only used to determine if two entries that are different case will be added to the dictionary or not.
I believe what we may be wanting is a flag that can do multiple things; (a) determine if an entry with different case is allowed in the dictionary, (b) be used in the use of the dictionary for case sensitivity, (c) be able to override the default value for the dictionary.

Let me know if I'm wrong on any counts, as I review the code.


> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "William Colen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072131#comment-13072131 ] 

William Colen commented on OPENNLP-239:
---------------------------------------

BTW there are unit tests: 
opennlp.tools.dictionary.DictionaryAsSetCaseInsensitiveTest
opennlp.tools.dictionary.DictionaryAsSetCaseSensitiveTest

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "James Kosin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089118#comment-13089118 ] 

James Kosin commented on OPENNLP-239:
-------------------------------------

Yes, that along with why POSDictionary is needed.  I didn't see much that was different from the current Dictionary.


> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "James Kosin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078718#comment-13078718 ] 

James Kosin commented on OPENNLP-239:
-------------------------------------

Okay,

Here are some of the issues:
(a)  I didn't add a check for duplicates, it was only a statement of fact that there is only one.  But, it does matter when building the dictionary that if case sensitive or not that a dictionary that is not case sensitive may want to keep the duplicates out, especially if when building the dictionary they are just adding them as they are originally done which may cause issues later.

(b)  I tried reviewing a method to remove the isCaseSensitive entry from the StringListWrapper... unfortunately, we have a few small problems: (1) the StringListWrapper is a static class (most likely to increase performance), this means the StringListWrapper doesn't have access to the Dictionary caseSensitive flag to follow the entries.  (2) even if I change that, then a user will be allowed to change a Dictionary that was created to be case insensitive to a case sensitive type which will most likely produce undesireable results [what may be happening now].

The current implementation, allows a bit more flexibility:
Dictionary    Entry      Other Entry     Result
--------------     --------     ----------------      -----------
true             true        true                a case sensitive comparison is done with equals()
false            true        false              a case insensitive comparison is done with equals()
xxxx            false       xxxx             a case insensitive comparison is done with equals()

This can be changed later if needed; but in English.  If either one is case insensitive when doing a compare then the compare is done with case insensitive comparison.  ie:  'Hello' and 'hello' are always the same equals(),

I kept them separate for now in that we could create a Dictionary class later that allows a mixed comparison with other strings... ie: If we had a string we know always to be capitalized and case sensitive like 'ASCII' we could put the entry in a Dictionary that also contains entries that have case insensitivity, with other words containign false for the case attribute.

The bad part about putting the flag only in the Dictionary flag is that then we only have one flag, and we loose the original Dictionary meaning and then having the flag also comes into question for building the Dictionary.

Sorry, I don't really mean to sound ugly... I'm just trying to get some discussion on this and what we could do about the problem.  I'm okay with moving the attribute to the dictionary attributes section instead of the entry attributes section.  Just trying to keep options open; since, it doesn't look like moving the flag is easily done without causing more issues.


> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "James Kosin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13071450#comment-13071450 ] 

James Kosin commented on OPENNLP-239:
-------------------------------------

Okay,

I found one problem, outside of the multiple copies.
Any new dictionary item setting the case sensitivity will be voided by the orignal dictionary setting.

<code>
if (isCaseSensitive) {
         result = this.stringlist.equals(other.getStringList());
} ...
</code>

Ignores the case setting for other and will result in different results depending on the order of the compare.

It probably should read:
<code>
if (isCaseSensitive || other.isCaseSensitive) {
...
</code>


James

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080956#comment-13080956 ] 

Jörn Kottmann commented on OPENNLP-239:
---------------------------------------

James, it looks like you flipped the default, right? Before we had this flag, the dictionary was always case sensitive after it was deserialized from a model, now it is case insensitive after it is deserialized from a model. It is not a big deal, but it can be very confusing in certain cases. I suggest that the default is case sensitive if the attribute in the xml does not exist.

What do you think?

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078753#comment-13078753 ] 

Jörn Kottmann commented on OPENNLP-239:
---------------------------------------

The new put method should be removed again, since it makes it unclear how to handle the case flag. This add the semantics of a case flag per entry to the Dictionary, which we did not have previously.

What do you think James? Should we go back to the revision at which you started, and then simply the code to only rely on one final case flag?

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079309#comment-13079309 ] 

Jörn Kottmann commented on OPENNLP-239:
---------------------------------------

Inside DictionarySerializer you define a string constant ATTRIBUTE_CASE. It would be more expressive if it is named ATTRIBUTE_CASE_SENSITIVE, and the same goes for the value. "Is it case sensitive?" makes more sense than "Is it case?".

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "James Kosin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079193#comment-13079193 ] 

James Kosin commented on OPENNLP-239:
-------------------------------------

Okay, the code is saving the case sensitivity now.  To support the older dictionaries, the code will treat any older dictionary as a 'false' for the case sensitivity.
The original poster seems to have a problem with the POSDictionary I believe.  So, I'm off to fix that.


> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "James Kosin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072123#comment-13072123 ] 

James Kosin commented on OPENNLP-239:
-------------------------------------

Hmm... very interesting.

I've actually got the Dictionary{} class replaced by a completely empty class and the caseSensitivity flag isn't really being used anywhere.

ie: TokenizerME uses dictionary to get a asStringSet() to return a Set<String> for usage... loosing case flag.

List goes on...
I'll get a more complete list soon.


> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078741#comment-13078741 ] 

Jörn Kottmann edited comment on OPENNLP-239 at 8/3/11 1:33 PM:
---------------------------------------------------------------

I understand the comparison issue and that is why we need to write the case flag at the begin of the dictionary. The Dictionary class never supported switching of the case after it was created. So just having one final flag per dictionary is fine.

StringListWarpper will be made non-static, and the case sensitive flag final. We need to change the serialization API a little to support to read and store the case flag. And then our issues are solved.

Since Dictionary should be immutable we should think about making the put method deprecated. 

      was (Author: joern):
    I understand the comparison issue and that is why we need to write the case flag at the begin of the dictionary. The Dictionary class never supported switching of the case after it was created. So just having one final flag per dictionary is fine.

StringListWarpper will be made non-static, and the case sensitive flag static. We need to change the serialization API a little to support to read and store the case flag. And then our issues are solved.

Since Dictionary should be immutable we should think about making the put method deprecated. 
  
> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078720#comment-13078720 ] 

Jörn Kottmann commented on OPENNLP-239:
---------------------------------------

I believe it was a mistake to make the Dictionary class not immutable. The case sensitive flag should be final, and it should not be possible to change it. This way I think most of the problems explained above go away, right?

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "James Kosin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072139#comment-13072139 ] 

James Kosin commented on OPENNLP-239:
-------------------------------------

Thanks William,

I see that now.

> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-239) Case Sensitivie Flag & Custom Tag Dictionary

Posted by "James Kosin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081362#comment-13081362 ] 

James Kosin commented on OPENNLP-239:
-------------------------------------

Jorn,

Sorry, I've been busy...  The default was because I didn't know the 
default state for many of the models.  Most of the time it is based on 
how they are created.  I can fix that easily; so that it gets set to 
true if not present.

The static was required because the function is static and doesn't have 
access to the non-static members.  I agree it was a nasty compromise.  
The other way to go would be to add the serializing to the Dictionary 
object itself... but, I don't know the problems with growing the 
Dictionary class too large... lastly, we could have the serializing as 
non-static meaning we would need to create a DictionarySerializer to use.

James

I'm also not familiar with the



> Case Sensitivie Flag & Custom Tag Dictionary
> --------------------------------------------
>
>                 Key: OPENNLP-239
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-239
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Parser
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: mark meiklejohn
>            Assignee: James Kosin
>             Fix For: tools-1.5.2-incubating
>
>
> Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira