You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by "Steven Bethard (JIRA)" <de...@uima.apache.org> on 2011/03/25 15:17:05 UTC

[jira] [Created] (UIMA-2101) CasToInlineXml adds whitespace

CasToInlineXml adds whitespace
------------------------------

                 Key: UIMA-2101
                 URL: https://issues.apache.org/jira/browse/UIMA-2101
             Project: UIMA
          Issue Type: Bug
    Affects Versions: 2.3.1SDK
            Reporter: Steven Bethard


CasToInlineXml adds indentation between adjacent XML elements. E.g. for a single character document with a single annotation covering that one character, it will write:

<?xml version="1.0" encoding="UTF-8"?>
<Document>
    <uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified">
        <uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation>
    </uima.tcas.DocumentAnnotation>
</Document>

I think it should instead write everything in a single line, that is:

<?xml version="1.0" encoding="UTF-8"?>
<Document><uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified"><uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation></uima.tcas.DocumentAnnotation></Document>

I believe this could be fixed by replacing the line:

XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream);

with the line:

XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream, false);

I think it's a bug that CasToInlineXml is changing the character offsets, but I would also be happy if there was an alternate constructor or a method on CasToInlineXml that allowed disabling the formatting.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] [Updated] (UIMA-2101) CasToInlineXml adds whitespace

Posted by Marshall Schor <ms...@schor.com>.

On 3/29/2011 1:49 AM, Richard Eckart de Castilho wrote:
> Hello Marshall,
>
> in a previous comment to the Jira issue, I have states similar concerns. However, I have to admit that Steven has a point in that the class can greatly facilitate getting things done if you employ a reasonably simple type-system and are sure that you do not have overlapping annotations. Steven's use-case seems to be to import XML data, process it and export it again.
>
> For somebody familiar with XML, all of the listed points should be acceptable - only that feature values longer than 64 chars are truncated seems a bit arbitrary.
>
> As for the DKPro Core XmlWriterInline - I should need to document the "inaccuracies" and possibly include some sanity checks that log warnings if a CAS contains overlapping annotations and complex feature structures being used as features - just to be that novice users are aware that strange things my be happening.

Good idea :-)  -Marshall
> Cheers,
>
> Richard
>
> Am 29.03.2011 um 02:46 schrieb Marshall Schor:
>
>> Just to be sure it's well known:
>>
>> The Javadoc for this class indicates that this code only does an "approximate"
>> representation of things.
>>
>> In particular, it says:
>>
>> * Generates an *approximate* inline XML representation of a CAS.
>> * Annotation types are represented as XML tags, features are represented as
>> attributes.
>> * 
>> * Features whose values are FeatureStructures are not represented.
>> * Feature values which are strings longer than 64 characters are truncated.
>> * Feature values which are arrays of primitives are represented by
>> * strings that look like [ xxx, xxx ]
>> *
>> * The Subject of analysis is presumed to be a text string.
>> *
>> * Some characters in the document's Subject-of-analysis
>> * are replaced by blanks, because the characters aren't valid in xml documents.
>> *
>> * It doesn't work for annotations which are overlapping, because these cannot
>> * be properly represented as properly - nested XML.
>>
>> Because of these "inaccuracies" are you sure you want to be using this class for
>> your projects?
>>
>> -Marshall
>>
>> On 3/28/2011 8:34 PM, Richard Eckart de Castilho (JIRA) wrote:
>>>     [ https://issues.apache.org/jira/browse/UIMA-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>>>
>>> Richard Eckart de Castilho updated UIMA-2101:
>>> ---------------------------------------------
>>>
>>>    Attachment: UIMA-2101-eckart-20110329.patch
>>>
>>> In addition to being able to disable formatting - as motivated by Steven - I would like to be able to access the SAX events generated from the CAS, so I can use a custom transformer in the DKPro Core component XmlWriterInline.
>>>
>>> Added a patch to address the issue. Patch is against SVN trunk rev 1085925 of the uimaj-core module.
>>>
>>> - Added new method CasToInlineXml.generateXML(CAS, FSMatchConstraint, ContentHandler) which allows the user to use a custom transformer or other SAX event handler.
>>> - Added new property outputFormatted controlling whether generated XML strings are formatted or not. This property does not affect the new generateXML(...) method (see above). Per default the property is set to true, resembling the state without the patch.
>>> - Added rudimentary test case to check if (not) formatting works. Code borrows from XmiCasDeserializerTest.
>>> - Auto-formatted using UIMA Eclipse Code profile added a few braces.
>>>
>>>
>>>> CasToInlineXml adds whitespace
>>>> ------------------------------
>>>>
>>>>                Key: UIMA-2101
>>>>                URL: https://issues.apache.org/jira/browse/UIMA-2101
>>>>            Project: UIMA
>>>>         Issue Type: Bug
>>>>   Affects Versions: 2.3.1SDK
>>>>           Reporter: Steven Bethard
>>>>        Attachments: UIMA-2101-eckart-20110329.patch
>>>>
>>>>
>>>> CasToInlineXml adds indentation between adjacent XML elements. E.g. for a single character document with a single annotation covering that one character, it will write:
>>>> {noformat}
>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>> <Document>
>>>>    <uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified">
>>>>        <uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation>
>>>>    </uima.tcas.DocumentAnnotation>
>>>> </Document>
>>>> {noformat}
>>>> I think it should instead write everything in a single line, that is:
>>>> {noformat}
>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>> <Document><uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified"><uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation></uima.tcas.DocumentAnnotation></Document>
>>>> {noformat}
>>>> I believe this could be fixed by replacing the line:
>>>> {noformat}
>>>> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream);
>>>> {noformat}
>>>> with the line:
>>>> {noformat}
>>>> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream, false);
>>>> {noformat}
>>>> I think it's a bug that CasToInlineXml is changing the character offsets, but I would also be happy if there was an alternate constructor or a method on CasToInlineXml that allowed disabling the formatting.
>>> --
>>> This message is automatically generated by JIRA.
>>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>>
>>>
> Richard Eckart de Castilho
>

Re: [jira] [Updated] (UIMA-2101) CasToInlineXml adds whitespace

Posted by Richard Eckart de Castilho <ec...@tk.informatik.tu-darmstadt.de>.
Hello Marshall,

in a previous comment to the Jira issue, I have states similar concerns. However, I have to admit that Steven has a point in that the class can greatly facilitate getting things done if you employ a reasonably simple type-system and are sure that you do not have overlapping annotations. Steven's use-case seems to be to import XML data, process it and export it again.

For somebody familiar with XML, all of the listed points should be acceptable - only that feature values longer than 64 chars are truncated seems a bit arbitrary.

As for the DKPro Core XmlWriterInline - I should need to document the "inaccuracies" and possibly include some sanity checks that log warnings if a CAS contains overlapping annotations and complex feature structures being used as features - just to be that novice users are aware that strange things my be happening.

Cheers,

Richard

Am 29.03.2011 um 02:46 schrieb Marshall Schor:

> Just to be sure it's well known:
> 
> The Javadoc for this class indicates that this code only does an "approximate"
> representation of things.
> 
> In particular, it says:
> 
> * Generates an *approximate* inline XML representation of a CAS.
> * Annotation types are represented as XML tags, features are represented as
> attributes.
> * 
> * Features whose values are FeatureStructures are not represented.
> * Feature values which are strings longer than 64 characters are truncated.
> * Feature values which are arrays of primitives are represented by
> * strings that look like [ xxx, xxx ]
> *
> * The Subject of analysis is presumed to be a text string.
> *
> * Some characters in the document's Subject-of-analysis
> * are replaced by blanks, because the characters aren't valid in xml documents.
> *
> * It doesn't work for annotations which are overlapping, because these cannot
> * be properly represented as properly - nested XML.
> 
> Because of these "inaccuracies" are you sure you want to be using this class for
> your projects?
> 
> -Marshall
> 
> On 3/28/2011 8:34 PM, Richard Eckart de Castilho (JIRA) wrote:
>>     [ https://issues.apache.org/jira/browse/UIMA-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>> 
>> Richard Eckart de Castilho updated UIMA-2101:
>> ---------------------------------------------
>> 
>>    Attachment: UIMA-2101-eckart-20110329.patch
>> 
>> In addition to being able to disable formatting - as motivated by Steven - I would like to be able to access the SAX events generated from the CAS, so I can use a custom transformer in the DKPro Core component XmlWriterInline.
>> 
>> Added a patch to address the issue. Patch is against SVN trunk rev 1085925 of the uimaj-core module.
>> 
>> - Added new method CasToInlineXml.generateXML(CAS, FSMatchConstraint, ContentHandler) which allows the user to use a custom transformer or other SAX event handler.
>> - Added new property outputFormatted controlling whether generated XML strings are formatted or not. This property does not affect the new generateXML(...) method (see above). Per default the property is set to true, resembling the state without the patch.
>> - Added rudimentary test case to check if (not) formatting works. Code borrows from XmiCasDeserializerTest.
>> - Auto-formatted using UIMA Eclipse Code profile added a few braces.
>> 
>> 
>>> CasToInlineXml adds whitespace
>>> ------------------------------
>>> 
>>>                Key: UIMA-2101
>>>                URL: https://issues.apache.org/jira/browse/UIMA-2101
>>>            Project: UIMA
>>>         Issue Type: Bug
>>>   Affects Versions: 2.3.1SDK
>>>           Reporter: Steven Bethard
>>>        Attachments: UIMA-2101-eckart-20110329.patch
>>> 
>>> 
>>> CasToInlineXml adds indentation between adjacent XML elements. E.g. for a single character document with a single annotation covering that one character, it will write:
>>> {noformat}
>>> <?xml version="1.0" encoding="UTF-8"?>
>>> <Document>
>>>    <uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified">
>>>        <uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation>
>>>    </uima.tcas.DocumentAnnotation>
>>> </Document>
>>> {noformat}
>>> I think it should instead write everything in a single line, that is:
>>> {noformat}
>>> <?xml version="1.0" encoding="UTF-8"?>
>>> <Document><uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified"><uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation></uima.tcas.DocumentAnnotation></Document>
>>> {noformat}
>>> I believe this could be fixed by replacing the line:
>>> {noformat}
>>> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream);
>>> {noformat}
>>> with the line:
>>> {noformat}
>>> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream, false);
>>> {noformat}
>>> I think it's a bug that CasToInlineXml is changing the character offsets, but I would also be happy if there was an alternate constructor or a method on CasToInlineXml that allowed disabling the formatting.
>> --
>> This message is automatically generated by JIRA.
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>> 
>> 

Richard Eckart de Castilho

-- 
------------------------------------------------------------------- 
Richard Eckart de Castilho
Technical Lead
Ubiquitous Knowledge Processing Lab 
FB 20 Computer Science Department      
Technische Universität Darmstadt 
Hochschulstr. 10, D-64289 Darmstadt, Germany 
phone +49 (6151) 16-7477, fax -5455, room S2/02/E225
eckartde@tk.informatik.tu-darmstadt.de 
www.ukp.tu-darmstadt.de 
------------------------------------------------------------------- 






Re: [jira] [Updated] (UIMA-2101) CasToInlineXml adds whitespace

Posted by Marshall Schor <ms...@schor.com>.
Just to be sure it's well known:

The Javadoc for this class indicates that this code only does an "approximate"
representation of things.

In particular, it says:

 * Generates an *approximate* inline XML representation of a CAS.
 * Annotation types are represented as XML tags, features are represented as
attributes.
 * 
 * Features whose values are FeatureStructures are not represented.
 * Feature values which are strings longer than 64 characters are truncated.
 * Feature values which are arrays of primitives are represented by
 * strings that look like [ xxx, xxx ]
 *
 * The Subject of analysis is presumed to be a text string.
 *
 * Some characters in the document's Subject-of-analysis
 * are replaced by blanks, because the characters aren't valid in xml documents.
 *
 * It doesn't work for annotations which are overlapping, because these cannot
 * be properly represented as properly - nested XML.

Because of these "inaccuracies" are you sure you want to be using this class for
your projects?

-Marshall

On 3/28/2011 8:34 PM, Richard Eckart de Castilho (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/UIMA-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Richard Eckart de Castilho updated UIMA-2101:
> ---------------------------------------------
>
>     Attachment: UIMA-2101-eckart-20110329.patch
>
> In addition to being able to disable formatting - as motivated by Steven - I would like to be able to access the SAX events generated from the CAS, so I can use a custom transformer in the DKPro Core component XmlWriterInline.
>
> Added a patch to address the issue. Patch is against SVN trunk rev 1085925 of the uimaj-core module.
>
> - Added new method CasToInlineXml.generateXML(CAS, FSMatchConstraint, ContentHandler) which allows the user to use a custom transformer or other SAX event handler.
> - Added new property outputFormatted controlling whether generated XML strings are formatted or not. This property does not affect the new generateXML(...) method (see above). Per default the property is set to true, resembling the state without the patch.
> - Added rudimentary test case to check if (not) formatting works. Code borrows from XmiCasDeserializerTest.
> - Auto-formatted using UIMA Eclipse Code profile added a few braces.
>
>
>> CasToInlineXml adds whitespace
>> ------------------------------
>>
>>                 Key: UIMA-2101
>>                 URL: https://issues.apache.org/jira/browse/UIMA-2101
>>             Project: UIMA
>>          Issue Type: Bug
>>    Affects Versions: 2.3.1SDK
>>            Reporter: Steven Bethard
>>         Attachments: UIMA-2101-eckart-20110329.patch
>>
>>
>> CasToInlineXml adds indentation between adjacent XML elements. E.g. for a single character document with a single annotation covering that one character, it will write:
>> {noformat}
>> <?xml version="1.0" encoding="UTF-8"?>
>> <Document>
>>     <uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified">
>>         <uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation>
>>     </uima.tcas.DocumentAnnotation>
>> </Document>
>> {noformat}
>> I think it should instead write everything in a single line, that is:
>> {noformat}
>> <?xml version="1.0" encoding="UTF-8"?>
>> <Document><uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified"><uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation></uima.tcas.DocumentAnnotation></Document>
>> {noformat}
>> I believe this could be fixed by replacing the line:
>> {noformat}
>> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream);
>> {noformat}
>> with the line:
>> {noformat}
>> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream, false);
>> {noformat}
>> I think it's a bug that CasToInlineXml is changing the character offsets, but I would also be happy if there was an alternate constructor or a method on CasToInlineXml that allowed disabling the formatting.
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>

[jira] [Updated] (UIMA-2101) CasToInlineXml adds whitespace

Posted by "Richard Eckart de Castilho (JIRA)" <de...@uima.apache.org>.
     [ https://issues.apache.org/jira/browse/UIMA-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Eckart de Castilho updated UIMA-2101:
---------------------------------------------

    Attachment: UIMA-2101-eckart-20110329.patch

In addition to being able to disable formatting - as motivated by Steven - I would like to be able to access the SAX events generated from the CAS, so I can use a custom transformer in the DKPro Core component XmlWriterInline.

Added a patch to address the issue. Patch is against SVN trunk rev 1085925 of the uimaj-core module.

- Added new method CasToInlineXml.generateXML(CAS, FSMatchConstraint, ContentHandler) which allows the user to use a custom transformer or other SAX event handler.
- Added new property outputFormatted controlling whether generated XML strings are formatted or not. This property does not affect the new generateXML(...) method (see above). Per default the property is set to true, resembling the state without the patch.
- Added rudimentary test case to check if (not) formatting works. Code borrows from XmiCasDeserializerTest.
- Auto-formatted using UIMA Eclipse Code profile added a few braces.


> CasToInlineXml adds whitespace
> ------------------------------
>
>                 Key: UIMA-2101
>                 URL: https://issues.apache.org/jira/browse/UIMA-2101
>             Project: UIMA
>          Issue Type: Bug
>    Affects Versions: 2.3.1SDK
>            Reporter: Steven Bethard
>         Attachments: UIMA-2101-eckart-20110329.patch
>
>
> CasToInlineXml adds indentation between adjacent XML elements. E.g. for a single character document with a single annotation covering that one character, it will write:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?>
> <Document>
>     <uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified">
>         <uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation>
>     </uima.tcas.DocumentAnnotation>
> </Document>
> {noformat}
> I think it should instead write everything in a single line, that is:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?>
> <Document><uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified"><uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation></uima.tcas.DocumentAnnotation></Document>
> {noformat}
> I believe this could be fixed by replacing the line:
> {noformat}
> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream);
> {noformat}
> with the line:
> {noformat}
> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream, false);
> {noformat}
> I think it's a bug that CasToInlineXml is changing the character offsets, but I would also be happy if there was an alternate constructor or a method on CasToInlineXml that allowed disabling the formatting.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (UIMA-2101) CasToInlineXml adds whitespace

Posted by "Steven Bethard (JIRA)" <de...@uima.apache.org>.
    [ https://issues.apache.org/jira/browse/UIMA-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011668#comment-13011668 ] 

Steven Bethard commented on UIMA-2101:
--------------------------------------

Perhaps I should clarify my whole use case here. I'm taking the XML output by CasToInlineXml and transforming the element names and attributes so that they match the ISO-TimeML XML format. I'm not doing this for fun - I'm doing it because I have to give someone files in ISO-TimeML and they will only give me ISO-TimeML files back.

So, no, I don't have control over what elements go where or contain what, and no I can't live in XMI land, I have to be able to communicate with the world outside of UIMA. ;-)

All I need is the ability to turn off the extra whitespace to accomplish this.

> CasToInlineXml adds whitespace
> ------------------------------
>
>                 Key: UIMA-2101
>                 URL: https://issues.apache.org/jira/browse/UIMA-2101
>             Project: UIMA
>          Issue Type: Bug
>    Affects Versions: 2.3.1SDK
>            Reporter: Steven Bethard
>
> CasToInlineXml adds indentation between adjacent XML elements. E.g. for a single character document with a single annotation covering that one character, it will write:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?>
> <Document>
>     <uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified">
>         <uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation>
>     </uima.tcas.DocumentAnnotation>
> </Document>
> {noformat}
> I think it should instead write everything in a single line, that is:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?>
> <Document><uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified"><uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation></uima.tcas.DocumentAnnotation></Document>
> {noformat}
> I believe this could be fixed by replacing the line:
> {noformat}
> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream);
> {noformat}
> with the line:
> {noformat}
> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream, false);
> {noformat}
> I think it's a bug that CasToInlineXml is changing the character offsets, but I would also be happy if there was an alternate constructor or a method on CasToInlineXml that allowed disabling the formatting.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (UIMA-2101) CasToInlineXml adds whitespace

Posted by "Richard Eckart de Castilho (JIRA)" <de...@uima.apache.org>.
     [ https://issues.apache.org/jira/browse/UIMA-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Eckart de Castilho updated UIMA-2101:
---------------------------------------------

    Attachment: UIMA-2101-eckart-20110329.patch

Previously attached wrong file. Now it is the correct one.

> CasToInlineXml adds whitespace
> ------------------------------
>
>                 Key: UIMA-2101
>                 URL: https://issues.apache.org/jira/browse/UIMA-2101
>             Project: UIMA
>          Issue Type: Bug
>    Affects Versions: 2.3.1SDK
>            Reporter: Steven Bethard
>         Attachments: UIMA-2101-eckart-20110329.patch
>
>
> CasToInlineXml adds indentation between adjacent XML elements. E.g. for a single character document with a single annotation covering that one character, it will write:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?>
> <Document>
>     <uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified">
>         <uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation>
>     </uima.tcas.DocumentAnnotation>
> </Document>
> {noformat}
> I think it should instead write everything in a single line, that is:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?>
> <Document><uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified"><uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation></uima.tcas.DocumentAnnotation></Document>
> {noformat}
> I believe this could be fixed by replacing the line:
> {noformat}
> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream);
> {noformat}
> with the line:
> {noformat}
> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream, false);
> {noformat}
> I think it's a bug that CasToInlineXml is changing the character offsets, but I would also be happy if there was an alternate constructor or a method on CasToInlineXml that allowed disabling the formatting.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (UIMA-2101) CasToInlineXml adds whitespace

Posted by "Richard Eckart de Castilho (JIRA)" <de...@uima.apache.org>.
     [ https://issues.apache.org/jira/browse/UIMA-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Eckart de Castilho updated UIMA-2101:
---------------------------------------------

    Attachment:     (was: UIMA-2101-eckart-20110329.patch)

> CasToInlineXml adds whitespace
> ------------------------------
>
>                 Key: UIMA-2101
>                 URL: https://issues.apache.org/jira/browse/UIMA-2101
>             Project: UIMA
>          Issue Type: Bug
>    Affects Versions: 2.3.1SDK
>            Reporter: Steven Bethard
>         Attachments: UIMA-2101-eckart-20110329.patch
>
>
> CasToInlineXml adds indentation between adjacent XML elements. E.g. for a single character document with a single annotation covering that one character, it will write:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?>
> <Document>
>     <uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified">
>         <uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation>
>     </uima.tcas.DocumentAnnotation>
> </Document>
> {noformat}
> I think it should instead write everything in a single line, that is:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?>
> <Document><uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified"><uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation></uima.tcas.DocumentAnnotation></Document>
> {noformat}
> I believe this could be fixed by replacing the line:
> {noformat}
> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream);
> {noformat}
> with the line:
> {noformat}
> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream, false);
> {noformat}
> I think it's a bug that CasToInlineXml is changing the character offsets, but I would also be happy if there was an alternate constructor or a method on CasToInlineXml that allowed disabling the formatting.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (UIMA-2101) CasToInlineXml adds whitespace

Posted by "Steven Bethard (JIRA)" <de...@uima.apache.org>.
     [ https://issues.apache.org/jira/browse/UIMA-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Bethard updated UIMA-2101:
---------------------------------

    Description: 
CasToInlineXml adds indentation between adjacent XML elements. E.g. for a single character document with a single annotation covering that one character, it will write:

{noformat}
<?xml version="1.0" encoding="UTF-8"?>
<Document>
    <uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified">
        <uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation>
    </uima.tcas.DocumentAnnotation>
</Document>
{noformat}

I think it should instead write everything in a single line, that is:

{noformat}
<?xml version="1.0" encoding="UTF-8"?>
<Document><uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified"><uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation></uima.tcas.DocumentAnnotation></Document>
{noformat}

I believe this could be fixed by replacing the line:

{noformat}
XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream);
{noformat}

with the line:

{noformat}
XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream, false);
{noformat}

I think it's a bug that CasToInlineXml is changing the character offsets, but I would also be happy if there was an alternate constructor or a method on CasToInlineXml that allowed disabling the formatting.

  was:
CasToInlineXml adds indentation between adjacent XML elements. E.g. for a single character document with a single annotation covering that one character, it will write:

<?xml version="1.0" encoding="UTF-8"?>
<Document>
    <uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified">
        <uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation>
    </uima.tcas.DocumentAnnotation>
</Document>

I think it should instead write everything in a single line, that is:

<?xml version="1.0" encoding="UTF-8"?>
<Document><uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified"><uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation></uima.tcas.DocumentAnnotation></Document>

I believe this could be fixed by replacing the line:

XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream);

with the line:

XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream, false);

I think it's a bug that CasToInlineXml is changing the character offsets, but I would also be happy if there was an alternate constructor or a method on CasToInlineXml that allowed disabling the formatting.


> CasToInlineXml adds whitespace
> ------------------------------
>
>                 Key: UIMA-2101
>                 URL: https://issues.apache.org/jira/browse/UIMA-2101
>             Project: UIMA
>          Issue Type: Bug
>    Affects Versions: 2.3.1SDK
>            Reporter: Steven Bethard
>
> CasToInlineXml adds indentation between adjacent XML elements. E.g. for a single character document with a single annotation covering that one character, it will write:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?>
> <Document>
>     <uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified">
>         <uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation>
>     </uima.tcas.DocumentAnnotation>
> </Document>
> {noformat}
> I think it should instead write everything in a single line, that is:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?>
> <Document><uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified"><uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation></uima.tcas.DocumentAnnotation></Document>
> {noformat}
> I believe this could be fixed by replacing the line:
> {noformat}
> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream);
> {noformat}
> with the line:
> {noformat}
> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream, false);
> {noformat}
> I think it's a bug that CasToInlineXml is changing the character offsets, but I would also be happy if there was an alternate constructor or a method on CasToInlineXml that allowed disabling the formatting.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (UIMA-2101) CasToInlineXml adds whitespace

Posted by "Steven Bethard (JIRA)" <de...@uima.apache.org>.
    [ https://issues.apache.org/jira/browse/UIMA-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011642#comment-13011642 ] 

Steven Bethard commented on UIMA-2101:
--------------------------------------

Basically, I think you should be able to round-trip writing to XML and reading it back again. As it currently stands, the extra whitespace means that if you read the XML back in again, you won't get the original document text set on your Sofa, you'll get a whitespace-mangled version of it.

In the example above, the original DocumentAnnotation contained only a single space. After XML conversion, the DocumentAnnotation contains a newline and nine spaces. And there's no way to figure out which of those spaces were in the original text and which were added by XML conversion.

> CasToInlineXml adds whitespace
> ------------------------------
>
>                 Key: UIMA-2101
>                 URL: https://issues.apache.org/jira/browse/UIMA-2101
>             Project: UIMA
>          Issue Type: Bug
>    Affects Versions: 2.3.1SDK
>            Reporter: Steven Bethard
>
> CasToInlineXml adds indentation between adjacent XML elements. E.g. for a single character document with a single annotation covering that one character, it will write:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?>
> <Document>
>     <uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified">
>         <uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation>
>     </uima.tcas.DocumentAnnotation>
> </Document>
> {noformat}
> I think it should instead write everything in a single line, that is:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?>
> <Document><uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified"><uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation></uima.tcas.DocumentAnnotation></Document>
> {noformat}
> I believe this could be fixed by replacing the line:
> {noformat}
> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream);
> {noformat}
> with the line:
> {noformat}
> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream, false);
> {noformat}
> I think it's a bug that CasToInlineXml is changing the character offsets, but I would also be happy if there was an alternate constructor or a method on CasToInlineXml that allowed disabling the formatting.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (UIMA-2101) CasToInlineXml adds whitespace

Posted by "Richard Eckart de Castilho (JIRA)" <de...@uima.apache.org>.
    [ https://issues.apache.org/jira/browse/UIMA-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011226#comment-13011226 ] 

Richard Eckart de Castilho commented on UIMA-2101:
--------------------------------------------------

What do you mean by "is changing the character offsts"?

> CasToInlineXml adds whitespace
> ------------------------------
>
>                 Key: UIMA-2101
>                 URL: https://issues.apache.org/jira/browse/UIMA-2101
>             Project: UIMA
>          Issue Type: Bug
>    Affects Versions: 2.3.1SDK
>            Reporter: Steven Bethard
>
> CasToInlineXml adds indentation between adjacent XML elements. E.g. for a single character document with a single annotation covering that one character, it will write:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?>
> <Document>
>     <uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified">
>         <uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation>
>     </uima.tcas.DocumentAnnotation>
> </Document>
> {noformat}
> I think it should instead write everything in a single line, that is:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?>
> <Document><uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified"><uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation></uima.tcas.DocumentAnnotation></Document>
> {noformat}
> I believe this could be fixed by replacing the line:
> {noformat}
> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream);
> {noformat}
> with the line:
> {noformat}
> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream, false);
> {noformat}
> I think it's a bug that CasToInlineXml is changing the character offsets, but I would also be happy if there was an alternate constructor or a method on CasToInlineXml that allowed disabling the formatting.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (UIMA-2101) CasToInlineXml adds whitespace

Posted by "Richard Eckart de Castilho (JIRA)" <de...@uima.apache.org>.
    [ https://issues.apache.org/jira/browse/UIMA-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011643#comment-13011643 ] 

Richard Eckart de Castilho commented on UIMA-2101:
--------------------------------------------------

Actually that depends on how you treat the XML data - document oriented or data oriented. In data-oriented XML, whitespace between two opening and two closing tags is so-called "ignorable whitespace" and may be added or omitted for sake for readability. Only whitespace between an opening and a closing tag needs to be preserved. If you look at the SAX handler interface, there are two different methods for receiving whitespace.

Thus, preserving the content in a round trip depends on what you had in mind when you implemented your parser and serializer. Looks like UIMA has data-oriented XML in mind when serializing. You should only need to respect that when parsing again.


> CasToInlineXml adds whitespace
> ------------------------------
>
>                 Key: UIMA-2101
>                 URL: https://issues.apache.org/jira/browse/UIMA-2101
>             Project: UIMA
>          Issue Type: Bug
>    Affects Versions: 2.3.1SDK
>            Reporter: Steven Bethard
>
> CasToInlineXml adds indentation between adjacent XML elements. E.g. for a single character document with a single annotation covering that one character, it will write:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?>
> <Document>
>     <uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified">
>         <uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation>
>     </uima.tcas.DocumentAnnotation>
> </Document>
> {noformat}
> I think it should instead write everything in a single line, that is:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?>
> <Document><uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified"><uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation></uima.tcas.DocumentAnnotation></Document>
> {noformat}
> I believe this could be fixed by replacing the line:
> {noformat}
> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream);
> {noformat}
> with the line:
> {noformat}
> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream, false);
> {noformat}
> I think it's a bug that CasToInlineXml is changing the character offsets, but I would also be happy if there was an alternate constructor or a method on CasToInlineXml that allowed disabling the formatting.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (UIMA-2101) CasToInlineXml adds whitespace

Posted by "Steven Bethard (JIRA)" <de...@uima.apache.org>.
    [ https://issues.apache.org/jira/browse/UIMA-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011645#comment-13011645 ] 

Steven Bethard commented on UIMA-2101:
--------------------------------------

I'm not sure what you're suggesting. How do you propose that I get back my original document text when parsing?

> CasToInlineXml adds whitespace
> ------------------------------
>
>                 Key: UIMA-2101
>                 URL: https://issues.apache.org/jira/browse/UIMA-2101
>             Project: UIMA
>          Issue Type: Bug
>    Affects Versions: 2.3.1SDK
>            Reporter: Steven Bethard
>
> CasToInlineXml adds indentation between adjacent XML elements. E.g. for a single character document with a single annotation covering that one character, it will write:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?>
> <Document>
>     <uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified">
>         <uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation>
>     </uima.tcas.DocumentAnnotation>
> </Document>
> {noformat}
> I think it should instead write everything in a single line, that is:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?>
> <Document><uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified"><uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation></uima.tcas.DocumentAnnotation></Document>
> {noformat}
> I believe this could be fixed by replacing the line:
> {noformat}
> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream);
> {noformat}
> with the line:
> {noformat}
> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream, false);
> {noformat}
> I think it's a bug that CasToInlineXml is changing the character offsets, but I would also be happy if there was an alternate constructor or a method on CasToInlineXml that allowed disabling the formatting.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (UIMA-2101) CasToInlineXml adds whitespace

Posted by "Richard Eckart de Castilho (JIRA)" <de...@uima.apache.org>.
    [ https://issues.apache.org/jira/browse/UIMA-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011660#comment-13011660 ] 

Richard Eckart de Castilho commented on UIMA-2101:
--------------------------------------------------

Generally trying to recover a document including annotations from a unlined XML will not work because the inlined XML data model is less expressive than the CAS data model: inlined XML cannot represent overlapping annotations.

The best option to recover you document from data-oriented XML is to make sure all you text is covered by an annotation (e.g. Token) which should be a leaf in the DOM and to use the offsets of these annotations to reconstruct the original string. That assumes that there is no text between Tokens, that is no text between two opening or two closing XML tags.

Otherwise formatting really needs to be turned off and serialization should happen in such a way that, as you say, offsets are preserved. Again this assumes that there are no overlapping annotations in the CAS. In this case you need to make sure that you do capture ignorable whitespace when parsing the XML.

I have tried for a considerable time to implement a system for annotated corpora based on XML as a data model and arrive at the conclusion that it does more harm than good. Today I happy to use the CAS and its XMI serialization as primary data and serialization models.

> CasToInlineXml adds whitespace
> ------------------------------
>
>                 Key: UIMA-2101
>                 URL: https://issues.apache.org/jira/browse/UIMA-2101
>             Project: UIMA
>          Issue Type: Bug
>    Affects Versions: 2.3.1SDK
>            Reporter: Steven Bethard
>
> CasToInlineXml adds indentation between adjacent XML elements. E.g. for a single character document with a single annotation covering that one character, it will write:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?>
> <Document>
>     <uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified">
>         <uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation>
>     </uima.tcas.DocumentAnnotation>
> </Document>
> {noformat}
> I think it should instead write everything in a single line, that is:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?>
> <Document><uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" language="x-unspecified"><uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> </uima.tcas.Annotation></uima.tcas.DocumentAnnotation></Document>
> {noformat}
> I believe this could be fixed by replacing the line:
> {noformat}
> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream);
> {noformat}
> with the line:
> {noformat}
> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream, false);
> {noformat}
> I think it's a bug that CasToInlineXml is changing the character offsets, but I would also be happy if there was an alternate constructor or a method on CasToInlineXml that allowed disabling the formatting.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira