You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by "Jörn Kottmann (JIRA)" <de...@uima.apache.org> on 2011/07/14 12:27:00 UTC

[jira] [Created] (UIMA-2165) Cas Editor document import wizard should replace or remove non-xml characters

Cas Editor document import wizard should replace or remove non-xml characters
-----------------------------------------------------------------------------

                 Key: UIMA-2165
                 URL: https://issues.apache.org/jira/browse/UIMA-2165
             Project: UIMA
          Issue Type: Improvement
          Components: CasEditor
            Reporter: Jörn Kottmann
            Assignee: Jörn Kottmann
            Priority: Minor


When importing a text file which contains non-xml characters the Cas Editor should automatically replace or remove these.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Closed] (UIMA-2165) Cas Editor document import wizard should replace or remove non-xml characters

Posted by "Jörn Kottmann (JIRA)" <de...@uima.apache.org>.
     [ https://issues.apache.org/jira/browse/UIMA-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jörn Kottmann closed UIMA-2165.
-------------------------------

       Resolution: Fixed
    Fix Version/s: 2.3.2SDK

> Cas Editor document import wizard should replace or remove non-xml characters
> -----------------------------------------------------------------------------
>
>                 Key: UIMA-2165
>                 URL: https://issues.apache.org/jira/browse/UIMA-2165
>             Project: UIMA
>          Issue Type: Improvement
>          Components: CasEditor
>            Reporter: Jörn Kottmann
>            Assignee: Jörn Kottmann
>            Priority: Minor
>             Fix For: 2.3.2SDK
>
>
> When importing a text file which contains non-xml characters the Cas Editor should automatically replace or remove these.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Re: [jira] [Commented] (UIMA-2165) Cas Editor document import wizard should replace or remove non-xml characters

Posted by Richard Eckart de Castilho <ec...@tk.informatik.tu-darmstadt.de>.
>> any warning message given if things are removed (I hope)?

> No, maybe we should make it optional. Then the following would happen:
> 
> 1. User selects a couple of files, one or more contains non-xml chars
> 2. Import fails because of that complaining about the first file and suggest
>     to enable "Remove non-xml chars" option
> 3. User enables "Remove non-xml chars" option and retries
> 
> What do you think?

+1 for the warning ;) Having the option sounds like a good idea. I guess these illegal characters should only be very few that typically do not ever occur in a text (control characters, etc.?)

> Maybe we should speak a little about how the import wizard should be.
> The current one can only import plain/text and rtf files. And it supports
> only one view.

One view is fine for me.

> One more restriction we currently have is that it only imports 
> plain/text from
> files which end with .txt (and .rtf). Should we remove this limitation?

How about using TIKA in the importer?

> Do we need to set the language in the wizard?

Would be very nice to have the option.

> Do you think the name "Document" import wizard is fine?

I think that's ok. You an audio file or video probably wouldn't be called document by most people. A Word or PDF, however, would be and can be converted to plain text.

Cheers,

Richard

-- 
------------------------------------------------------------------- 
Richard Eckart de Castilho
Technical Lead
Ubiquitous Knowledge Processing Lab 
FB 20 Computer Science Department      
Technische Universität Darmstadt 
Hochschulstr. 10, D-64289 Darmstadt, Germany 
phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117
eckartde@tk.informatik.tu-darmstadt.de 
www.ukp.tu-darmstadt.de 
Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
------------------------------------------------------------------- 





Re: [jira] [Commented] (UIMA-2165) Cas Editor document import wizard should replace or remove non-xml characters

Posted by Jörn Kottmann <ko...@gmail.com>.
No, maybe we should make it optional. Then the following would happen:

1. User selects a couple of files, one or more contains non-xml chars
2. Import fails because of that complaining about the first file and suggest
     to enable "Remove non-xml chars" option
3. User enables "Remove non-xml chars" option and retries

What do you think?

Maybe we should speak a little about how the import wizard should be.
The current one can only import plain/text and rtf files. And it supports
only one view.

Beside that I think (as suggest here on the list) it needs to remember the
last selected encoding, and the encodings already entered by the user.

One more restriction we currently have is that it only imports 
plain/text from
files which end with .txt (and .rtf). Should we remove this limitation?

Do we need to set the language in the wizard?

I think it makes sense when we make this wizard focus on importing only
text sofas, since a wizard for binary sofas would be different.

Do you think the name "Document" import wizard is fine?

Jörn

On 7/14/11 4:49 PM, Marshall Schor wrote:
> any warning message given if things are removed (I hope)?
>
> -Marshall
>
> On 7/14/2011 10:40 AM, Jörn Kottmann (JIRA) wrote:
>>      [ https://issues.apache.org/jira/browse/UIMA-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065294#comment-13065294 ]
>>
>> Jörn Kottmann commented on UIMA-2165:
>> -------------------------------------
>>
>> Now all non-xml chars are removed.
>>
>>> Cas Editor document import wizard should replace or remove non-xml characters
>>> -----------------------------------------------------------------------------
>>>
>>>                  Key: UIMA-2165
>>>                  URL: https://issues.apache.org/jira/browse/UIMA-2165
>>>              Project: UIMA
>>>           Issue Type: Improvement
>>>           Components: CasEditor
>>>             Reporter: Jörn Kottmann
>>>             Assignee: Jörn Kottmann
>>>             Priority: Minor
>>>              Fix For: 2.3.2SDK
>>>
>>>
>>> When importing a text file which contains non-xml characters the Cas Editor should automatically replace or remove these.
>> --
>> This message is automatically generated by JIRA.
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>
>>
>>


Re: [jira] [Commented] (UIMA-2165) Cas Editor document import wizard should replace or remove non-xml characters

Posted by Marshall Schor <ms...@schor.com>.
any warning message given if things are removed (I hope)?

-Marshall

On 7/14/2011 10:40 AM, Jörn Kottmann (JIRA) wrote:
>     [ https://issues.apache.org/jira/browse/UIMA-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065294#comment-13065294 ] 
>
> Jörn Kottmann commented on UIMA-2165:
> -------------------------------------
>
> Now all non-xml chars are removed.
>
>> Cas Editor document import wizard should replace or remove non-xml characters
>> -----------------------------------------------------------------------------
>>
>>                 Key: UIMA-2165
>>                 URL: https://issues.apache.org/jira/browse/UIMA-2165
>>             Project: UIMA
>>          Issue Type: Improvement
>>          Components: CasEditor
>>            Reporter: Jörn Kottmann
>>            Assignee: Jörn Kottmann
>>            Priority: Minor
>>             Fix For: 2.3.2SDK
>>
>>
>> When importing a text file which contains non-xml characters the Cas Editor should automatically replace or remove these.
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>        
>

[jira] [Commented] (UIMA-2165) Cas Editor document import wizard should replace or remove non-xml characters

Posted by "Jörn Kottmann (JIRA)" <de...@uima.apache.org>.
    [ https://issues.apache.org/jira/browse/UIMA-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065294#comment-13065294 ] 

Jörn Kottmann commented on UIMA-2165:
-------------------------------------

Now all non-xml chars are removed.

> Cas Editor document import wizard should replace or remove non-xml characters
> -----------------------------------------------------------------------------
>
>                 Key: UIMA-2165
>                 URL: https://issues.apache.org/jira/browse/UIMA-2165
>             Project: UIMA
>          Issue Type: Improvement
>          Components: CasEditor
>            Reporter: Jörn Kottmann
>            Assignee: Jörn Kottmann
>            Priority: Minor
>             Fix For: 2.3.2SDK
>
>
> When importing a text file which contains non-xml characters the Cas Editor should automatically replace or remove these.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira