You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@uima.apache.org by "Thomas Hampp (JIRA)" <de...@uima.apache.org> on 2010/05/17 10:23:43 UTC

[jira] Created: (UIMA-1782) Encoding of text files during import should be confugurable

Encoding of text files during import should be confugurable
-----------------------------------------------------------

                 Key: UIMA-1782
                 URL: https://issues.apache.org/jira/browse/UIMA-1782
             Project: UIMA
          Issue Type: Improvement
          Components: CasEditor
    Affects Versions: 2.3
            Reporter: Thomas Hampp


During import of text files into a corpus it seems to be impossible to control the encoding used. Looks like the default platform encoding is used (Latin 1 on Western Windows systems). The Eclipse default encoding settings for text files don't seem to affect import encoding. That makes it impossible to import documents with international characters in UTF8.
Ideally the encoding should be selectable in a drop down field in the import wizard.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (UIMA-1782) Encoding of text files during import should be confugurable

Posted by "Jörn Kottmann (JIRA)" <de...@uima.apache.org>.

     [ https://issues.apache.org/jira/browse/UIMA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jörn Kottmann resolved UIMA-1782.
---------------------------------

    Resolution: Fixed

Thomas please test the new encoding option and comment/close the issue.

> Encoding of text files during import should be confugurable
> -----------------------------------------------------------
>
>                 Key: UIMA-1782
>                 URL: https://issues.apache.org/jira/browse/UIMA-1782
>             Project: UIMA
>          Issue Type: Improvement
>          Components: CasEditor
>    Affects Versions: 2.3
>            Reporter: Thomas Hampp
>            Assignee: Jörn Kottmann
>             Fix For: 2.3.1
>
>
> During import of text files into a corpus it seems to be impossible to control the encoding used. Looks like the default platform encoding is used (Latin 1 on Western Windows systems). The Eclipse default encoding settings for text files don't seem to affect import encoding. That makes it impossible to import documents with international characters in UTF8.
> Ideally the encoding should be selectable in a drop down field in the import wizard.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (UIMA-1782) Encoding of text files during import should be confugurable

Posted by "Jörn Kottmann (JIRA)" <de...@uima.apache.org>.

    [ https://issues.apache.org/jira/browse/UIMA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868626#action_12868626 ] 

Jörn Kottmann commented on UIMA-1782:
-------------------------------------

In case an invalid encoding is chosen the dialog now shows a page error message telling the user about the problem.

> Encoding of text files during import should be confugurable
> -----------------------------------------------------------
>
>                 Key: UIMA-1782
>                 URL: https://issues.apache.org/jira/browse/UIMA-1782
>             Project: UIMA
>          Issue Type: Improvement
>          Components: CasEditor
>    Affects Versions: 2.3
>            Reporter: Thomas Hampp
>            Assignee: Jörn Kottmann
>             Fix For: 2.3.1
>
>
> During import of text files into a corpus it seems to be impossible to control the encoding used. Looks like the default platform encoding is used (Latin 1 on Western Windows systems). The Eclipse default encoding settings for text files don't seem to affect import encoding. That makes it impossible to import documents with international characters in UTF8.
> Ideally the encoding should be selectable in a drop down field in the import wizard.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (UIMA-1782) Encoding of text files during import should be confugurable

Posted by Thilo Götz <tw...@gmx.de>.

On 5/18/2010 17:55, Marshall Schor wrote:
> 
> 
> On 5/18/2010 11:28 AM, Thilo Götz wrote:
>> On 5/18/2010 17:03, Jörn Kottmann wrote:
>>   
>>> Thilo Götz wrote:
>>>     
>>>> FYI, here's how you can create a list of all available text
>>>> encodings in the JVM you're running in.  This can lead to a
>>>> very long combo box, though :-)
>>>>
>>>>     Map<String, Charset> charsetMap = Charset.availableCharsets();
>>>>
>>>>   
>>>>       
>>> Actually I tried this first, but got a very long list of encodings.
>>> Then I decided to do it like th eclipse guys in the "Properties ->
>>> Resource"
>>> dialog, they only display the Java standard encodings, and its possible
>>> to type in any supported encoding.
>>>
>>> Do you think we should just display all encodings (maybe plus aliases) ?
>>>
>>> Jörn
>>>     
>> That's a matter of taste.  It's true that it's a very long
>> list, and most of them are never used by anyone.  On the
>> other hand, if you *are* using one of the rarer ones, and
>> you have to type it in every time, that's also annoying.
>>   
> 
> What about making the list the common ones, and then making any that get
> typed in "sticky"?
> -Marshall

That's what I had in CVD for years.  It finally got so annoying
that I changed it to display the whole list of available charsets.
What I would probably change now is to put the common ones in
front.  ATM, you have to go to the end of the list for the UTF-*
encodings, which is a pain.  Still, better than the manual add
I had before, IMHO.

--Thilo

>> So I didn't mean to push you in that direction, I just
>> didn't know if you were aware of the alternative.  As you
>> have obviously thought about it, I would just leave it
>> and see what our users say.
>>
>> --Thilo
>>
>>
>>

Re: [jira] Commented: (UIMA-1782) Encoding of text files during import should be confugurable

Posted by Marshall Schor <ms...@schor.com>.


On 5/18/2010 11:28 AM, Thilo Götz wrote:
> On 5/18/2010 17:03, Jörn Kottmann wrote:
>   
>> Thilo Götz wrote:
>>     
>>> FYI, here's how you can create a list of all available text
>>> encodings in the JVM you're running in.  This can lead to a
>>> very long combo box, though :-)
>>>
>>>     Map<String, Charset> charsetMap = Charset.availableCharsets();
>>>
>>>   
>>>       
>> Actually I tried this first, but got a very long list of encodings.
>> Then I decided to do it like th eclipse guys in the "Properties ->
>> Resource"
>> dialog, they only display the Java standard encodings, and its possible
>> to type in any supported encoding.
>>
>> Do you think we should just display all encodings (maybe plus aliases) ?
>>
>> Jörn
>>     
> That's a matter of taste.  It's true that it's a very long
> list, and most of them are never used by anyone.  On the
> other hand, if you *are* using one of the rarer ones, and
> you have to type it in every time, that's also annoying.
>   

What about making the list the common ones, and then making any that get
typed in "sticky"?
-Marshall
> So I didn't mean to push you in that direction, I just
> didn't know if you were aware of the alternative.  As you
> have obviously thought about it, I would just leave it
> and see what our users say.
>
> --Thilo
>
>
>

Re: [jira] Commented: (UIMA-1782) Encoding of text files during import should be confugurable

Posted by Thilo Götz <tw...@gmx.de>.

On 5/18/2010 17:03, Jörn Kottmann wrote:
> Thilo Götz wrote:
>> FYI, here's how you can create a list of all available text
>> encodings in the JVM you're running in.  This can lead to a
>> very long combo box, though :-)
>>
>>     Map<String, Charset> charsetMap = Charset.availableCharsets();
>>
>>   
> Actually I tried this first, but got a very long list of encodings.
> Then I decided to do it like th eclipse guys in the "Properties ->
> Resource"
> dialog, they only display the Java standard encodings, and its possible
> to type in any supported encoding.
> 
> Do you think we should just display all encodings (maybe plus aliases) ?
> 
> Jörn

That's a matter of taste.  It's true that it's a very long
list, and most of them are never used by anyone.  On the
other hand, if you *are* using one of the rarer ones, and
you have to type it in every time, that's also annoying.

So I didn't mean to push you in that direction, I just
didn't know if you were aware of the alternative.  As you
have obviously thought about it, I would just leave it
and see what our users say.

--Thilo

Re: [jira] Commented: (UIMA-1782) Encoding of text files during import should be confugurable

Posted by Jörn Kottmann <ko...@gmail.com>.

Thilo Götz wrote:
> FYI, here's how you can create a list of all available text
> encodings in the JVM you're running in.  This can lead to a
> very long combo box, though :-)
>
>     Map<String, Charset> charsetMap = Charset.availableCharsets();
>
>   
Actually I tried this first, but got a very long list of encodings.
Then I decided to do it like th eclipse guys in the "Properties -> 
Resource"
dialog, they only display the Java standard encodings, and its possible
to type in any supported encoding.

Do you think we should just display all encodings (maybe plus aliases) ?

Jörn

Re: [jira] Commented: (UIMA-1782) Encoding of text files during import should be confugurable

Posted by Thilo Götz <tw...@gmx.de>.

FYI, here's how you can create a list of all available text
encodings in the JVM you're running in.  This can lead to a
very long combo box, though :-)

    Map<String, Charset> charsetMap = Charset.availableCharsets();

--Thilo

On 5/18/2010 01:40, Jörn Kottmann (JIRA) wrote:
> 
>     [ https://issues.apache.org/jira/browse/UIMA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868448#action_12868448 ] 
> 
> Jörn Kottmann commented on UIMA-1782:
> -------------------------------------
> 
> There is now an option to specify the encoding of the text import files. It is always preset to the default platform encoding. The combo box displays the Java standard charsets (see here: http://java.sun.com/j2se/1.4.2/docs/api/java/nio/charset/Charset.html).
> In case the user wants to use a non-standard Java charset (which usually are there) he has to type in the name of the charset he wants to use, while the name is typed in, it is validated if the charset is available and he can proceed with the import, otherwise the "Apply"  button just remains disabled. 
> 
> It would be nice to add a warning to tell the user that the "Apply" button is disable because of an invalid charset name or unsupported charset.
> 
>> Encoding of text files during import should be confugurable
>> -----------------------------------------------------------
>>
>>                 Key: UIMA-1782
>>                 URL: https://issues.apache.org/jira/browse/UIMA-1782
>>             Project: UIMA
>>          Issue Type: Improvement
>>          Components: CasEditor
>>    Affects Versions: 2.3
>>            Reporter: Thomas Hampp
>>            Assignee: Jörn Kottmann
>>             Fix For: 2.3.1
>>
>>
>> During import of text files into a corpus it seems to be impossible to control the encoding used. Looks like the default platform encoding is used (Latin 1 on Western Windows systems). The Eclipse default encoding settings for text files don't seem to affect import encoding. That makes it impossible to import documents with international characters in UTF8.
>> Ideally the encoding should be selectable in a drop down field in the import wizard.
>

[jira] Commented: (UIMA-1782) Encoding of text files during import should be confugurable

Posted by "Jörn Kottmann (JIRA)" <de...@uima.apache.org>.

    [ https://issues.apache.org/jira/browse/UIMA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868448#action_12868448 ] 

Jörn Kottmann commented on UIMA-1782:
-------------------------------------

There is now an option to specify the encoding of the text import files. It is always preset to the default platform encoding. The combo box displays the Java standard charsets (see here: http://java.sun.com/j2se/1.4.2/docs/api/java/nio/charset/Charset.html).
In case the user wants to use a non-standard Java charset (which usually are there) he has to type in the name of the charset he wants to use, while the name is typed in, it is validated if the charset is available and he can proceed with the import, otherwise the "Apply"  button just remains disabled. 

It would be nice to add a warning to tell the user that the "Apply" button is disable because of an invalid charset name or unsupported charset.

> Encoding of text files during import should be confugurable
> -----------------------------------------------------------
>
>                 Key: UIMA-1782
>                 URL: https://issues.apache.org/jira/browse/UIMA-1782
>             Project: UIMA
>          Issue Type: Improvement
>          Components: CasEditor
>    Affects Versions: 2.3
>            Reporter: Thomas Hampp
>            Assignee: Jörn Kottmann
>             Fix For: 2.3.1
>
>
> During import of text files into a corpus it seems to be impossible to control the encoding used. Looks like the default platform encoding is used (Latin 1 on Western Windows systems). The Eclipse default encoding settings for text files don't seem to affect import encoding. That makes it impossible to import documents with international characters in UTF8.
> Ideally the encoding should be selectable in a drop down field in the import wizard.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (UIMA-1782) Encoding of text files during import should be confugurable

Posted by "Jörn Kottmann (JIRA)" <de...@uima.apache.org>.

     [ https://issues.apache.org/jira/browse/UIMA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jörn Kottmann updated UIMA-1782:
--------------------------------

         Assignee: Jörn Kottmann
    Fix Version/s: 2.3.1

> Encoding of text files during import should be confugurable
> -----------------------------------------------------------
>
>                 Key: UIMA-1782
>                 URL: https://issues.apache.org/jira/browse/UIMA-1782
>             Project: UIMA
>          Issue Type: Improvement
>          Components: CasEditor
>    Affects Versions: 2.3
>            Reporter: Thomas Hampp
>            Assignee: Jörn Kottmann
>             Fix For: 2.3.1
>
>
> During import of text files into a corpus it seems to be impossible to control the encoding used. Looks like the default platform encoding is used (Latin 1 on Western Windows systems). The Eclipse default encoding settings for text files don't seem to affect import encoding. That makes it impossible to import documents with international characters in UTF8.
> Ideally the encoding should be selectable in a drop down field in the import wizard.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (UIMA-1782) Encoding of text files during import should be confugurable

Posted by "Marshall Schor (JIRA)" <de...@uima.apache.org>.

     [ https://issues.apache.org/jira/browse/UIMA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marshall Schor updated UIMA-1782:
---------------------------------

    Fix Version/s: 2.3.1SDK
                       (was: 2.3.1)

> Encoding of text files during import should be confugurable
> -----------------------------------------------------------
>
>                 Key: UIMA-1782
>                 URL: https://issues.apache.org/jira/browse/UIMA-1782
>             Project: UIMA
>          Issue Type: Improvement
>          Components: CasEditor
>    Affects Versions: 2.3
>            Reporter: Thomas Hampp
>            Assignee: Jörn Kottmann
>             Fix For: 2.3.1SDK
>
>
> During import of text files into a corpus it seems to be impossible to control the encoding used. Looks like the default platform encoding is used (Latin 1 on Western Windows systems). The Eclipse default encoding settings for text files don't seem to affect import encoding. That makes it impossible to import documents with international characters in UTF8.
> Ideally the encoding should be selectable in a drop down field in the import wizard.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.