You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by Mario Gazzo <ma...@gmail.com> on 2015/02/18 21:46:04 UTC

Approach for keeping track of formatting associated with text views

We are starting to use the UIMA framework for NL processing article text, which is usually stored with metadata in some XML format. We need to extract text elements to be processed by various NL analysis engines that only work with pure text but we also need to keep track of the formatting information related to the processed text. It is in general also valuable for us to be able to track every annotation back to the original XML to maintain provenance. Before embarking on this I like to validate our approach with more experienced users since this is the first application we are building with UIMA.

In the first step we would annotate every important element of the XML including formatting elements in the body. We maintain some DOM-like relationships between the body text and formatting annotations so that text formatting can be reproduced later with NLP annotations in some article viewer.

Next we would in another AE produce a pure text view of the text annotations in the XML view that need to be NL analysed. In this new text view we would annotate the different text elements with references back to their counterpart in the original XML view so that we can trace back positions in the original XML and the formatting relations. This of course will require mapping NLP annotation offsets in the text view back to the XML view but the information should then be there to make this possible.

This approach requires somewhat more handcrafted book keeping than we initially hoped would be necessary. We haven’t been able to find any examples of how this is usually done and the UIMA docs are vague regarding managing this kind of relationships across views. We would therefore really like to know if there is a simpler and better approach.

Any feedback is greatly appreciated. Thanks.

Re: Approach for keeping track of formatting associated with text views

Posted by Peter Klügl <pk...@uni-wuerzburg.de>.

Hi,

the HtmlConverter was built to create an annotated document containing 
the plain text of the html or xml source. It intends to remove all 
elements that would not be visible for one that takes a look at the 
interpreted html, e.g., in an html browser. Thus, it removes a lot of 
text of the document and updates the offsets of the existing 
annotations. Text, e.g., in the head of the html document, is therefore 
completely removed. The annotations in those text areas retain no 
meaningful values for their begin and end offsets (features). If we want 
to keep these annotations, then the questions arises which offsets they 
should get. Normally, I would asume that their begin and end is set to 
0. However, this can be problematic when one wants to apply Ruta rules 
on the resulting CAS. (This is probably not really related to the task 
you want to solve, but as a component of the ruta project, I have to 
take care that there wont't arise problems.) Another option is to assign 
the offsets of the document annotation. We could also forget the offsets 
and use feature structures instead of annotations, but this is not 
intended by the corresponding type system.

I am not sure that I fully caught your use case, e.g., the thing with 
the sentences. Can you maybe provide a minimal example for the provided 
input and the desired output?

Best,

Peter

Am 10.03.2015 um 13:54 schrieb Mario Gazzo:
> Thanks, I can of course open an issue for this.
>
> I have been playing with a modified version of the HTMLConverter, which is why my reply is delayed. I disabled the ‘inBody’-flag inside the HTMLConverterVisitor to get an idea of what the effects might be. It pretty much did want I thought I wanted except that there is no clear sentence boundaries between many of the metadata strings. Most of them are not really meaningful to NL process but a few we would want to analyse but the sentence separation is gone now. I have been looking at some of the conversion and line break options to get around this but I haven't found a good approach yet. I really only want to introduce some sentence separation like “. “ between different tag content outside the body.
>
> I am not sure I understand your offset question. Would you mind elaborating this to me? Our documents are in XML with a single body element containing HTML.
>
>
>> On 07 Mar 2015, at 17:33 , Peter Klügl <pkluegl@uni-wuerzburg.de <ma...@uni-wuerzburg.de>> wrote:
>>
>> Hi,
>>
>> there is no way yet to customize this behavior. The HtmlConverter only retains annotation of a length > 0 since annoations with length == 0 are rather problematic and should be avoided.
>>
>> I can add a configuration parameter for keeping these annoations if you want (best open an issue for it). What should be the offsets of the annotations for elements in the head of the html document? 0, those of the first token or those of the document annotation?
>>
>> Best,
>>
>> Peter
>>
>>
>> Am 06.03.2015 um 14:00 schrieb Mario Gazzo:
>>> We conducted some experiments with both the HtmlAnnotator and the HtmlConverter but we ran into an issue with the converter. It appears to only convert tag annotations that surround or are inside the body tag. Metadata elements like citations are ignored. The only way to get around this seems to be by forking and modifying the codebase, which I like to avoid. Both modules seem otherwise very useful to us but I am looking for a better approach to solve this issue. Is there some way to customise this behaviour without code modifications?
>>>
>>> Your input is appreciated, thanks.
>>>
>>>
>>>> On 18 Feb 2015, at 23:03 , Mario Gazzo <mario.gazzo@gmail.com <ma...@gmail.com>> wrote:
>>>>
>>>> Thanks. Looks interesting, seems that it could fit our use case. We will have a closer look at it.
>>>>
>>>>> On 18 Feb 2015, at 21:58 , Peter Klügl <pkluegl@uni-wuerzburg.de <ma...@uni-wuerzburg.de>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> you might want to take a look at two analysis engines of UIMA Ruta: HtmlAnnotator and HtmlConverter [1]
>>>>>
>>>>> The former one creates annotations for html element and therefore also for xml tags. The latter one creates a new view with only the plain text and adds existing annotations while adapting their offsets to the new document.
>>>>>
>>>>> Best,
>>>>>
>>>>> Peter
>>>>>
>>>>> [1] http://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.ae.html <http://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.ae.html>
>>>>>
>>>>> Am 18.02.2015 um 21:46 schrieb Mario Gazzo:
>>>>>> We are starting to use the UIMA framework for NL processing article text, which is usually stored with metadata in some XML format. We need to extract text elements to be processed by various NL analysis engines that only work with pure text but we also need to keep track of the formatting information related to the processed text. It is in general also valuable for us to be able to track every annotation back to the original XML to maintain provenance. Before embarking on this I like to validate our approach with more experienced users since this is the first application we are building with UIMA.
>>>>>>
>>>>>> In the first step we would annotate every important element of the XML including formatting elements in the body. We maintain some DOM-like relationships between the body text and formatting annotations so that text formatting can be reproduced later with NLP annotations in some article viewer.
>>>>>>
>>>>>> Next we would in another AE produce a pure text view of the text annotations in the XML view that need to be NL analysed. In this new text view we would annotate the different text elements with references back to their counterpart in the original XML view so that we can trace back positions in the original XML and the formatting relations. This of course will require mapping NLP annotation offsets in the text view back to the XML view but the information should then be there to make this possible.
>>>>>>
>>>>>> This approach requires somewhat more handcrafted book keeping than we initially hoped would be necessary. We haven’t been able to find any examples of how this is usually done and the UIMA docs are vague regarding managing this kind of relationships across views. We would therefore really like to know if there is a simpler and better approach.
>>>>>>
>>>>>> Any feedback is greatly appreciated. Thanks.
>

Re: Approach for keeping track of formatting associated with text views

Posted by Mario Gazzo <ma...@gmail.com>.

Thanks, I can of course open an issue for this.

I have been playing with a modified version of the HTMLConverter, which is why my reply is delayed. I disabled the ‘inBody’-flag inside the HTMLConverterVisitor to get an idea of what the effects might be. It pretty much did want I thought I wanted except that there is no clear sentence boundaries between many of the metadata strings. Most of them are not really meaningful to NL process but a few we would want to analyse but the sentence separation is gone now. I have been looking at some of the conversion and line break options to get around this but I haven't found a good approach yet. I really only want to introduce some sentence separation like “. “ between different tag content outside the body.

I am not sure I understand your offset question. Would you mind elaborating this to me? Our documents are in XML with a single body element containing HTML.


> On 07 Mar 2015, at 17:33 , Peter Klügl <pkluegl@uni-wuerzburg.de <ma...@uni-wuerzburg.de>> wrote:
> 
> Hi,
> 
> there is no way yet to customize this behavior. The HtmlConverter only retains annotation of a length > 0 since annoations with length == 0 are rather problematic and should be avoided.
> 
> I can add a configuration parameter for keeping these annoations if you want (best open an issue for it). What should be the offsets of the annotations for elements in the head of the html document? 0, those of the first token or those of the document annotation?
> 
> Best,
> 
> Peter
> 
> 
> Am 06.03.2015 um 14:00 schrieb Mario Gazzo:
>> We conducted some experiments with both the HtmlAnnotator and the HtmlConverter but we ran into an issue with the converter. It appears to only convert tag annotations that surround or are inside the body tag. Metadata elements like citations are ignored. The only way to get around this seems to be by forking and modifying the codebase, which I like to avoid. Both modules seem otherwise very useful to us but I am looking for a better approach to solve this issue. Is there some way to customise this behaviour without code modifications?
>> 
>> Your input is appreciated, thanks.
>> 
>> 
>>> On 18 Feb 2015, at 23:03 , Mario Gazzo <mario.gazzo@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Thanks. Looks interesting, seems that it could fit our use case. We will have a closer look at it.
>>> 
>>>> On 18 Feb 2015, at 21:58 , Peter Klügl <pkluegl@uni-wuerzburg.de <ma...@uni-wuerzburg.de>> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> you might want to take a look at two analysis engines of UIMA Ruta: HtmlAnnotator and HtmlConverter [1]
>>>> 
>>>> The former one creates annotations for html element and therefore also for xml tags. The latter one creates a new view with only the plain text and adds existing annotations while adapting their offsets to the new document.
>>>> 
>>>> Best,
>>>> 
>>>> Peter
>>>> 
>>>> [1] http://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.ae.html <http://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.ae.html>
>>>> 
>>>> Am 18.02.2015 um 21:46 schrieb Mario Gazzo:
>>>>> We are starting to use the UIMA framework for NL processing article text, which is usually stored with metadata in some XML format. We need to extract text elements to be processed by various NL analysis engines that only work with pure text but we also need to keep track of the formatting information related to the processed text. It is in general also valuable for us to be able to track every annotation back to the original XML to maintain provenance. Before embarking on this I like to validate our approach with more experienced users since this is the first application we are building with UIMA.
>>>>> 
>>>>> In the first step we would annotate every important element of the XML including formatting elements in the body. We maintain some DOM-like relationships between the body text and formatting annotations so that text formatting can be reproduced later with NLP annotations in some article viewer.
>>>>> 
>>>>> Next we would in another AE produce a pure text view of the text annotations in the XML view that need to be NL analysed. In this new text view we would annotate the different text elements with references back to their counterpart in the original XML view so that we can trace back positions in the original XML and the formatting relations. This of course will require mapping NLP annotation offsets in the text view back to the XML view but the information should then be there to make this possible.
>>>>> 
>>>>> This approach requires somewhat more handcrafted book keeping than we initially hoped would be necessary. We haven’t been able to find any examples of how this is usually done and the UIMA docs are vague regarding managing this kind of relationships across views. We would therefore really like to know if there is a simpler and better approach.
>>>>> 
>>>>> Any feedback is greatly appreciated. Thanks.
>

Re: Approach for keeping track of formatting associated with text views

Posted by Peter Klügl <pe...@averbis.com>.

Hi,

I implemented something. If you haven't seen it yet, take a look at the 
new HtmlConverterXmlTest for exmaples.

Best,

Peter

Am 14.03.2015 um 14:49 schrieb Mario Gazzo:
> No problem. You can contact me anytime in case you have additional questions.
>
>> On 14 Mar 2015, at 14:34 , Peter Klügl <pk...@uni-wuerzburg.de> wrote:
>>
>> Hi,
>>
>>
>>
>> thanks for the issue and sorry for the delayed response. I did not yet find the time to look into it, but I will the next days.
>>
>> Best,
>>
>> Peter
>>
>> Am 13.03.2015 um 23:51 schrieb Mario Gazzo:
>>> The issue has now been created:
>>>
>>> https://issues.apache.org/jira/browse/UIMA-4286 <https://issues.apache.org/jira/browse/UIMA-4286>
>>>
>>>
>>>> On 11 Mar 2015, at 14:47 , Mario Gazzo <ma...@gmail.com> wrote:
>>>>
>>>> Thanks, I understand the choices now. I would also probably prefer to use the document annotation if no text content is associated with the tag. However, ideally I would prefer that tag annotations get the offsets of content that is within their scope but otherwise get offsets of content within their closest shared ancestor element. Ultimately this could end up being the document annotation. E.g.
>>>>
>>>> <journal-meta>
>>>>     <journal-id journal-id-type="nlm-ta">Environ Health Perspect</journal-id>
>>>>     <journal-title>Environmental Health Perspectives</journal-title>
>>>>     <issn pub-type="ppub">0091-6765</issn>
>>>>     <publisher>
>>>>         <publisher-name>National Institute of Environmental Health Sciences</publisher-name>
>>>>     </publisher>
>>>> </journal-meta>
>>>>
>>>> I would here expect journal-meta to have the offsets of all content within its scope, which in the converted view of my experiments gets combined to the following “Environ Health PerspectEnvironmental Health Perspectives0091-6765National Institute of Environmental Health Sciences”. This works as expected when I just disable the “inBody”-flag of the HtmlConverterVisitor except that there is no clear separation between the content elements any longer, which is why I would like to have a sentence separator like “. ” between them so that I instead get: “Environ Health Perspect. Environmental Health Perspectives. 0091-6765. National Institute of Environmental Health Sciences.”. The dot separators should then of course not be included in the converters offsets since they are not part of the original text.
>>>>
>>>> Additionally there might be a case where a meta tag doesn’t have any content within its scope but it contains attribute values:
>>>>
>>>> <Parent>
>>>> 	<Child1 attribute=“someValue” />
>>>> 	<Child2/>Some content.</Child2>
>>>> </Parent>
>>>>
>>>> In this case I would prefer that Child1 has the same offsets as Child2 since the tag is most closely related to that content. In case there is no content within the scope of its parent then I would find the first ancestor that contains content within its scope and use that offset although this choice is questionable. I haven’t a good example of this case though so I presume they are in reality rare.
>>>>
>>>> That said, the latter is more complicated to implement, so I would be happy if I could just easily turn off the “inBody”-test in the HtmlConverterVisitor and have some way to add content separation between tags outside body without resorting to code modifications.
>>>>
>>>> Hope this feedback was helpful.
>>>>
>>>> Your time is much appreciated, thanks.
>>>>
>>>>
>>>>> On 09 Mar 2015, at 16:56 , Jens Grivolla <j+...@grivolla.net> wrote:
>>>>>
>>>>> Hi Peter, while I don't think I will be using the HtmlConverter right away,
>>>>> I would vote for using the length of the document annotation for
>>>>> annotations that relate to the whole document (such as metadata).  That
>>>>> makes them show up nicely in the CasEditor/Viewer and you could maintain it
>>>>> in all segments when you split a CAS (e.g. with something based on the
>>>>> SimpleTextSegmenter example).
>>>>>
>>>>> -- Jens
>>>>>
>>>>> On Sat, Mar 7, 2015 at 5:33 PM, Peter Klügl <pk...@uni-wuerzburg.de>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> there is no way yet to customize this behavior. The HtmlConverter only
>>>>>> retains annotation of a length > 0 since annoations with length == 0 are
>>>>>> rather problematic and should be avoided.
>>>>>>
>>>>>> I can add a configuration parameter for keeping these annoations if you
>>>>>> want (best open an issue for it). What should be the offsets of the
>>>>>> annotations for elements in the head of the html document? 0, those of the
>>>>>> first token or those of the document annotation?
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Peter
>>>>>>
>>>>>>
>>>>>> Am 06.03.2015 um 14:00 schrieb Mario Gazzo:
>>>>>>
>>>>>> We conducted some experiments with both the HtmlAnnotator and the
>>>>>>> HtmlConverter but we ran into an issue with the converter. It appears to
>>>>>>> only convert tag annotations that surround or are inside the body tag.
>>>>>>> Metadata elements like citations are ignored. The only way to get around
>>>>>>> this seems to be by forking and modifying the codebase, which I like to
>>>>>>> avoid. Both modules seem otherwise very useful to us but I am looking for a
>>>>>>> better approach to solve this issue. Is there some way to customise this
>>>>>>> behaviour without code modifications?
>>>>>>>
>>>>>>> Your input is appreciated, thanks.
>>>>>>>
>>>>>>>
>>>>>>> On 18 Feb 2015, at 23:03 , Mario Gazzo <ma...@gmail.com> wrote:
>>>>>>>> Thanks. Looks interesting, seems that it could fit our use case. We will
>>>>>>>> have a closer look at it.
>>>>>>>>
>>>>>>>> On 18 Feb 2015, at 21:58 , Peter Klügl <pk...@uni-wuerzburg.de>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> you might want to take a look at two analysis engines of UIMA Ruta:
>>>>>>>>> HtmlAnnotator and HtmlConverter [1]
>>>>>>>>>
>>>>>>>>> The former one creates annotations for html element and therefore also
>>>>>>>>> for xml tags. The latter one creates a new view with only the plain text
>>>>>>>>> and adds existing annotations while adapting their offsets to the new
>>>>>>>>> document.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Peter
>>>>>>>>>
>>>>>>>>> [1] http://uima.apache.org/d/ruta-current/tools.ruta.book.html#
>>>>>>>>> ugr.tools.ruta.ae.html
>>>>>>>>>
>>>>>>>>> Am 18.02.2015 um 21:46 schrieb Mario Gazzo:
>>>>>>>>>
>>>>>>>>>> We are starting to use the UIMA framework for NL processing article
>>>>>>>>>> text, which is usually stored with metadata in some XML format. We need to
>>>>>>>>>> extract text elements to be processed by various NL analysis engines that
>>>>>>>>>> only work with pure text but we also need to keep track of the formatting
>>>>>>>>>> information related to the processed text. It is in general also valuable
>>>>>>>>>> for us to be able to track every annotation back to the original XML to
>>>>>>>>>> maintain provenance. Before embarking on this I like to validate our
>>>>>>>>>> approach with more experienced users since this is the first application we
>>>>>>>>>> are building with UIMA.
>>>>>>>>>>
>>>>>>>>>> In the first step we would annotate every important element of the XML
>>>>>>>>>> including formatting elements in the body. We maintain some DOM-like
>>>>>>>>>> relationships between the body text and formatting annotations so that text
>>>>>>>>>> formatting can be reproduced later with NLP annotations in some article
>>>>>>>>>> viewer.
>>>>>>>>>>
>>>>>>>>>> Next we would in another AE produce a pure text view of the text
>>>>>>>>>> annotations in the XML view that need to be NL analysed. In this new text
>>>>>>>>>> view we would annotate the different text elements with references back to
>>>>>>>>>> their counterpart in the original XML view so that we can trace back
>>>>>>>>>> positions in the original XML and the formatting relations. This of course
>>>>>>>>>> will require mapping NLP annotation offsets in the text view back to the
>>>>>>>>>> XML view but the information should then be there to make this possible.
>>>>>>>>>>
>>>>>>>>>> This approach requires somewhat more handcrafted book keeping than we
>>>>>>>>>> initially hoped would be necessary. We haven’t been able to find any
>>>>>>>>>> examples of how this is usually done and the UIMA docs are vague regarding
>>>>>>>>>> managing this kind of relationships across views. We would therefore really
>>>>>>>>>> like to know if there is a simpler and better approach.
>>>>>>>>>>
>>>>>>>>>> Any feedback is greatly appreciated. Thanks.
>>>>>>>>>>

Re: Approach for keeping track of formatting associated with text views

Posted by Mario Gazzo <ma...@gmail.com>.

No problem. You can contact me anytime in case you have additional questions.

> On 14 Mar 2015, at 14:34 , Peter Klügl <pk...@uni-wuerzburg.de> wrote:
> 
> Hi,
> 
> 
> 
> thanks for the issue and sorry for the delayed response. I did not yet find the time to look into it, but I will the next days.
> 
> Best,
> 
> Peter
> 
> Am 13.03.2015 um 23:51 schrieb Mario Gazzo:
>> The issue has now been created:
>> 
>> https://issues.apache.org/jira/browse/UIMA-4286 <https://issues.apache.org/jira/browse/UIMA-4286>
>> 
>> 
>>> On 11 Mar 2015, at 14:47 , Mario Gazzo <ma...@gmail.com> wrote:
>>> 
>>> Thanks, I understand the choices now. I would also probably prefer to use the document annotation if no text content is associated with the tag. However, ideally I would prefer that tag annotations get the offsets of content that is within their scope but otherwise get offsets of content within their closest shared ancestor element. Ultimately this could end up being the document annotation. E.g.
>>> 
>>> <journal-meta>
>>>    <journal-id journal-id-type="nlm-ta">Environ Health Perspect</journal-id>
>>>    <journal-title>Environmental Health Perspectives</journal-title>
>>>    <issn pub-type="ppub">0091-6765</issn>
>>>    <publisher>
>>>        <publisher-name>National Institute of Environmental Health Sciences</publisher-name>
>>>    </publisher>
>>> </journal-meta>
>>> 
>>> I would here expect journal-meta to have the offsets of all content within its scope, which in the converted view of my experiments gets combined to the following “Environ Health PerspectEnvironmental Health Perspectives0091-6765National Institute of Environmental Health Sciences”. This works as expected when I just disable the “inBody”-flag of the HtmlConverterVisitor except that there is no clear separation between the content elements any longer, which is why I would like to have a sentence separator like “. ” between them so that I instead get: “Environ Health Perspect. Environmental Health Perspectives. 0091-6765. National Institute of Environmental Health Sciences.”. The dot separators should then of course not be included in the converters offsets since they are not part of the original text.
>>> 
>>> Additionally there might be a case where a meta tag doesn’t have any content within its scope but it contains attribute values:
>>> 
>>> <Parent>
>>> 	<Child1 attribute=“someValue” />
>>> 	<Child2/>Some content.</Child2>
>>> </Parent>
>>> 
>>> In this case I would prefer that Child1 has the same offsets as Child2 since the tag is most closely related to that content. In case there is no content within the scope of its parent then I would find the first ancestor that contains content within its scope and use that offset although this choice is questionable. I haven’t a good example of this case though so I presume they are in reality rare.
>>> 
>>> That said, the latter is more complicated to implement, so I would be happy if I could just easily turn off the “inBody”-test in the HtmlConverterVisitor and have some way to add content separation between tags outside body without resorting to code modifications.
>>> 
>>> Hope this feedback was helpful.
>>> 
>>> Your time is much appreciated, thanks.
>>> 
>>> 
>>>> On 09 Mar 2015, at 16:56 , Jens Grivolla <j+...@grivolla.net> wrote:
>>>> 
>>>> Hi Peter, while I don't think I will be using the HtmlConverter right away,
>>>> I would vote for using the length of the document annotation for
>>>> annotations that relate to the whole document (such as metadata).  That
>>>> makes them show up nicely in the CasEditor/Viewer and you could maintain it
>>>> in all segments when you split a CAS (e.g. with something based on the
>>>> SimpleTextSegmenter example).
>>>> 
>>>> -- Jens
>>>> 
>>>> On Sat, Mar 7, 2015 at 5:33 PM, Peter Klügl <pk...@uni-wuerzburg.de>
>>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> there is no way yet to customize this behavior. The HtmlConverter only
>>>>> retains annotation of a length > 0 since annoations with length == 0 are
>>>>> rather problematic and should be avoided.
>>>>> 
>>>>> I can add a configuration parameter for keeping these annoations if you
>>>>> want (best open an issue for it). What should be the offsets of the
>>>>> annotations for elements in the head of the html document? 0, those of the
>>>>> first token or those of the document annotation?
>>>>> 
>>>>> Best,
>>>>> 
>>>>> Peter
>>>>> 
>>>>> 
>>>>> Am 06.03.2015 um 14:00 schrieb Mario Gazzo:
>>>>> 
>>>>> We conducted some experiments with both the HtmlAnnotator and the
>>>>>> HtmlConverter but we ran into an issue with the converter. It appears to
>>>>>> only convert tag annotations that surround or are inside the body tag.
>>>>>> Metadata elements like citations are ignored. The only way to get around
>>>>>> this seems to be by forking and modifying the codebase, which I like to
>>>>>> avoid. Both modules seem otherwise very useful to us but I am looking for a
>>>>>> better approach to solve this issue. Is there some way to customise this
>>>>>> behaviour without code modifications?
>>>>>> 
>>>>>> Your input is appreciated, thanks.
>>>>>> 
>>>>>> 
>>>>>> On 18 Feb 2015, at 23:03 , Mario Gazzo <ma...@gmail.com> wrote:
>>>>>>> Thanks. Looks interesting, seems that it could fit our use case. We will
>>>>>>> have a closer look at it.
>>>>>>> 
>>>>>>> On 18 Feb 2015, at 21:58 , Peter Klügl <pk...@uni-wuerzburg.de>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> you might want to take a look at two analysis engines of UIMA Ruta:
>>>>>>>> HtmlAnnotator and HtmlConverter [1]
>>>>>>>> 
>>>>>>>> The former one creates annotations for html element and therefore also
>>>>>>>> for xml tags. The latter one creates a new view with only the plain text
>>>>>>>> and adds existing annotations while adapting their offsets to the new
>>>>>>>> document.
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> 
>>>>>>>> Peter
>>>>>>>> 
>>>>>>>> [1] http://uima.apache.org/d/ruta-current/tools.ruta.book.html#
>>>>>>>> ugr.tools.ruta.ae.html
>>>>>>>> 
>>>>>>>> Am 18.02.2015 um 21:46 schrieb Mario Gazzo:
>>>>>>>> 
>>>>>>>>> We are starting to use the UIMA framework for NL processing article
>>>>>>>>> text, which is usually stored with metadata in some XML format. We need to
>>>>>>>>> extract text elements to be processed by various NL analysis engines that
>>>>>>>>> only work with pure text but we also need to keep track of the formatting
>>>>>>>>> information related to the processed text. It is in general also valuable
>>>>>>>>> for us to be able to track every annotation back to the original XML to
>>>>>>>>> maintain provenance. Before embarking on this I like to validate our
>>>>>>>>> approach with more experienced users since this is the first application we
>>>>>>>>> are building with UIMA.
>>>>>>>>> 
>>>>>>>>> In the first step we would annotate every important element of the XML
>>>>>>>>> including formatting elements in the body. We maintain some DOM-like
>>>>>>>>> relationships between the body text and formatting annotations so that text
>>>>>>>>> formatting can be reproduced later with NLP annotations in some article
>>>>>>>>> viewer.
>>>>>>>>> 
>>>>>>>>> Next we would in another AE produce a pure text view of the text
>>>>>>>>> annotations in the XML view that need to be NL analysed. In this new text
>>>>>>>>> view we would annotate the different text elements with references back to
>>>>>>>>> their counterpart in the original XML view so that we can trace back
>>>>>>>>> positions in the original XML and the formatting relations. This of course
>>>>>>>>> will require mapping NLP annotation offsets in the text view back to the
>>>>>>>>> XML view but the information should then be there to make this possible.
>>>>>>>>> 
>>>>>>>>> This approach requires somewhat more handcrafted book keeping than we
>>>>>>>>> initially hoped would be necessary. We haven’t been able to find any
>>>>>>>>> examples of how this is usually done and the UIMA docs are vague regarding
>>>>>>>>> managing this kind of relationships across views. We would therefore really
>>>>>>>>> like to know if there is a simpler and better approach.
>>>>>>>>> 
>>>>>>>>> Any feedback is greatly appreciated. Thanks.
>>>>>>>>> 
>> 
>

Re: Approach for keeping track of formatting associated with text views

Posted by Peter Klügl <pk...@uni-wuerzburg.de>.

Hi,



thanks for the issue and sorry for the delayed response. I did not yet 
find the time to look into it, but I will the next days.

Best,

Peter

Am 13.03.2015 um 23:51 schrieb Mario Gazzo:
> The issue has now been created:
>
> https://issues.apache.org/jira/browse/UIMA-4286 <https://issues.apache.org/jira/browse/UIMA-4286>
>
>
>> On 11 Mar 2015, at 14:47 , Mario Gazzo <ma...@gmail.com> wrote:
>>
>> Thanks, I understand the choices now. I would also probably prefer to use the document annotation if no text content is associated with the tag. However, ideally I would prefer that tag annotations get the offsets of content that is within their scope but otherwise get offsets of content within their closest shared ancestor element. Ultimately this could end up being the document annotation. E.g.
>>
>> <journal-meta>
>>     <journal-id journal-id-type="nlm-ta">Environ Health Perspect</journal-id>
>>     <journal-title>Environmental Health Perspectives</journal-title>
>>     <issn pub-type="ppub">0091-6765</issn>
>>     <publisher>
>>         <publisher-name>National Institute of Environmental Health Sciences</publisher-name>
>>     </publisher>
>> </journal-meta>
>>
>> I would here expect journal-meta to have the offsets of all content within its scope, which in the converted view of my experiments gets combined to the following “Environ Health PerspectEnvironmental Health Perspectives0091-6765National Institute of Environmental Health Sciences”. This works as expected when I just disable the “inBody”-flag of the HtmlConverterVisitor except that there is no clear separation between the content elements any longer, which is why I would like to have a sentence separator like “. ” between them so that I instead get: “Environ Health Perspect. Environmental Health Perspectives. 0091-6765. National Institute of Environmental Health Sciences.”. The dot separators should then of course not be included in the converters offsets since they are not part of the original text.
>>
>> Additionally there might be a case where a meta tag doesn’t have any content within its scope but it contains attribute values:
>>
>> <Parent>
>> 	<Child1 attribute=“someValue” />
>> 	<Child2/>Some content.</Child2>
>> </Parent>
>>
>> In this case I would prefer that Child1 has the same offsets as Child2 since the tag is most closely related to that content. In case there is no content within the scope of its parent then I would find the first ancestor that contains content within its scope and use that offset although this choice is questionable. I haven’t a good example of this case though so I presume they are in reality rare.
>>
>> That said, the latter is more complicated to implement, so I would be happy if I could just easily turn off the “inBody”-test in the HtmlConverterVisitor and have some way to add content separation between tags outside body without resorting to code modifications.
>>
>> Hope this feedback was helpful.
>>
>> Your time is much appreciated, thanks.
>>
>>
>>> On 09 Mar 2015, at 16:56 , Jens Grivolla <j+...@grivolla.net> wrote:
>>>
>>> Hi Peter, while I don't think I will be using the HtmlConverter right away,
>>> I would vote for using the length of the document annotation for
>>> annotations that relate to the whole document (such as metadata).  That
>>> makes them show up nicely in the CasEditor/Viewer and you could maintain it
>>> in all segments when you split a CAS (e.g. with something based on the
>>> SimpleTextSegmenter example).
>>>
>>> -- Jens
>>>
>>> On Sat, Mar 7, 2015 at 5:33 PM, Peter Klügl <pk...@uni-wuerzburg.de>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> there is no way yet to customize this behavior. The HtmlConverter only
>>>> retains annotation of a length > 0 since annoations with length == 0 are
>>>> rather problematic and should be avoided.
>>>>
>>>> I can add a configuration parameter for keeping these annoations if you
>>>> want (best open an issue for it). What should be the offsets of the
>>>> annotations for elements in the head of the html document? 0, those of the
>>>> first token or those of the document annotation?
>>>>
>>>> Best,
>>>>
>>>> Peter
>>>>
>>>>
>>>> Am 06.03.2015 um 14:00 schrieb Mario Gazzo:
>>>>
>>>> We conducted some experiments with both the HtmlAnnotator and the
>>>>> HtmlConverter but we ran into an issue with the converter. It appears to
>>>>> only convert tag annotations that surround or are inside the body tag.
>>>>> Metadata elements like citations are ignored. The only way to get around
>>>>> this seems to be by forking and modifying the codebase, which I like to
>>>>> avoid. Both modules seem otherwise very useful to us but I am looking for a
>>>>> better approach to solve this issue. Is there some way to customise this
>>>>> behaviour without code modifications?
>>>>>
>>>>> Your input is appreciated, thanks.
>>>>>
>>>>>
>>>>> On 18 Feb 2015, at 23:03 , Mario Gazzo <ma...@gmail.com> wrote:
>>>>>> Thanks. Looks interesting, seems that it could fit our use case. We will
>>>>>> have a closer look at it.
>>>>>>
>>>>>> On 18 Feb 2015, at 21:58 , Peter Klügl <pk...@uni-wuerzburg.de>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> you might want to take a look at two analysis engines of UIMA Ruta:
>>>>>>> HtmlAnnotator and HtmlConverter [1]
>>>>>>>
>>>>>>> The former one creates annotations for html element and therefore also
>>>>>>> for xml tags. The latter one creates a new view with only the plain text
>>>>>>> and adds existing annotations while adapting their offsets to the new
>>>>>>> document.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Peter
>>>>>>>
>>>>>>> [1] http://uima.apache.org/d/ruta-current/tools.ruta.book.html#
>>>>>>> ugr.tools.ruta.ae.html
>>>>>>>
>>>>>>> Am 18.02.2015 um 21:46 schrieb Mario Gazzo:
>>>>>>>
>>>>>>>> We are starting to use the UIMA framework for NL processing article
>>>>>>>> text, which is usually stored with metadata in some XML format. We need to
>>>>>>>> extract text elements to be processed by various NL analysis engines that
>>>>>>>> only work with pure text but we also need to keep track of the formatting
>>>>>>>> information related to the processed text. It is in general also valuable
>>>>>>>> for us to be able to track every annotation back to the original XML to
>>>>>>>> maintain provenance. Before embarking on this I like to validate our
>>>>>>>> approach with more experienced users since this is the first application we
>>>>>>>> are building with UIMA.
>>>>>>>>
>>>>>>>> In the first step we would annotate every important element of the XML
>>>>>>>> including formatting elements in the body. We maintain some DOM-like
>>>>>>>> relationships between the body text and formatting annotations so that text
>>>>>>>> formatting can be reproduced later with NLP annotations in some article
>>>>>>>> viewer.
>>>>>>>>
>>>>>>>> Next we would in another AE produce a pure text view of the text
>>>>>>>> annotations in the XML view that need to be NL analysed. In this new text
>>>>>>>> view we would annotate the different text elements with references back to
>>>>>>>> their counterpart in the original XML view so that we can trace back
>>>>>>>> positions in the original XML and the formatting relations. This of course
>>>>>>>> will require mapping NLP annotation offsets in the text view back to the
>>>>>>>> XML view but the information should then be there to make this possible.
>>>>>>>>
>>>>>>>> This approach requires somewhat more handcrafted book keeping than we
>>>>>>>> initially hoped would be necessary. We haven’t been able to find any
>>>>>>>> examples of how this is usually done and the UIMA docs are vague regarding
>>>>>>>> managing this kind of relationships across views. We would therefore really
>>>>>>>> like to know if there is a simpler and better approach.
>>>>>>>>
>>>>>>>> Any feedback is greatly appreciated. Thanks.
>>>>>>>>
>

Re: Approach for keeping track of formatting associated with text views

Posted by Mario Gazzo <ma...@gmail.com>.

The issue has now been created:

https://issues.apache.org/jira/browse/UIMA-4286 <https://issues.apache.org/jira/browse/UIMA-4286>


> On 11 Mar 2015, at 14:47 , Mario Gazzo <ma...@gmail.com> wrote:
> 
> Thanks, I understand the choices now. I would also probably prefer to use the document annotation if no text content is associated with the tag. However, ideally I would prefer that tag annotations get the offsets of content that is within their scope but otherwise get offsets of content within their closest shared ancestor element. Ultimately this could end up being the document annotation. E.g.
> 
> <journal-meta>
>    <journal-id journal-id-type="nlm-ta">Environ Health Perspect</journal-id>
>    <journal-title>Environmental Health Perspectives</journal-title>
>    <issn pub-type="ppub">0091-6765</issn>
>    <publisher>
>        <publisher-name>National Institute of Environmental Health Sciences</publisher-name>
>    </publisher>
> </journal-meta>
> 
> I would here expect journal-meta to have the offsets of all content within its scope, which in the converted view of my experiments gets combined to the following “Environ Health PerspectEnvironmental Health Perspectives0091-6765National Institute of Environmental Health Sciences”. This works as expected when I just disable the “inBody”-flag of the HtmlConverterVisitor except that there is no clear separation between the content elements any longer, which is why I would like to have a sentence separator like “. ” between them so that I instead get: “Environ Health Perspect. Environmental Health Perspectives. 0091-6765. National Institute of Environmental Health Sciences.”. The dot separators should then of course not be included in the converters offsets since they are not part of the original text.
> 
> Additionally there might be a case where a meta tag doesn’t have any content within its scope but it contains attribute values:
> 
> <Parent>
> 	<Child1 attribute=“someValue” />
> 	<Child2/>Some content.</Child2>
> </Parent>
> 
> In this case I would prefer that Child1 has the same offsets as Child2 since the tag is most closely related to that content. In case there is no content within the scope of its parent then I would find the first ancestor that contains content within its scope and use that offset although this choice is questionable. I haven’t a good example of this case though so I presume they are in reality rare.
> 
> That said, the latter is more complicated to implement, so I would be happy if I could just easily turn off the “inBody”-test in the HtmlConverterVisitor and have some way to add content separation between tags outside body without resorting to code modifications.
> 
> Hope this feedback was helpful.
> 
> Your time is much appreciated, thanks.
> 
> 
>> On 09 Mar 2015, at 16:56 , Jens Grivolla <j+...@grivolla.net> wrote:
>> 
>> Hi Peter, while I don't think I will be using the HtmlConverter right away,
>> I would vote for using the length of the document annotation for
>> annotations that relate to the whole document (such as metadata).  That
>> makes them show up nicely in the CasEditor/Viewer and you could maintain it
>> in all segments when you split a CAS (e.g. with something based on the
>> SimpleTextSegmenter example).
>> 
>> -- Jens
>> 
>> On Sat, Mar 7, 2015 at 5:33 PM, Peter Klügl <pk...@uni-wuerzburg.de>
>> wrote:
>> 
>>> Hi,
>>> 
>>> there is no way yet to customize this behavior. The HtmlConverter only
>>> retains annotation of a length > 0 since annoations with length == 0 are
>>> rather problematic and should be avoided.
>>> 
>>> I can add a configuration parameter for keeping these annoations if you
>>> want (best open an issue for it). What should be the offsets of the
>>> annotations for elements in the head of the html document? 0, those of the
>>> first token or those of the document annotation?
>>> 
>>> Best,
>>> 
>>> Peter
>>> 
>>> 
>>> Am 06.03.2015 um 14:00 schrieb Mario Gazzo:
>>> 
>>> We conducted some experiments with both the HtmlAnnotator and the
>>>> HtmlConverter but we ran into an issue with the converter. It appears to
>>>> only convert tag annotations that surround or are inside the body tag.
>>>> Metadata elements like citations are ignored. The only way to get around
>>>> this seems to be by forking and modifying the codebase, which I like to
>>>> avoid. Both modules seem otherwise very useful to us but I am looking for a
>>>> better approach to solve this issue. Is there some way to customise this
>>>> behaviour without code modifications?
>>>> 
>>>> Your input is appreciated, thanks.
>>>> 
>>>> 
>>>> On 18 Feb 2015, at 23:03 , Mario Gazzo <ma...@gmail.com> wrote:
>>>>> 
>>>>> Thanks. Looks interesting, seems that it could fit our use case. We will
>>>>> have a closer look at it.
>>>>> 
>>>>> On 18 Feb 2015, at 21:58 , Peter Klügl <pk...@uni-wuerzburg.de>
>>>>>> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> you might want to take a look at two analysis engines of UIMA Ruta:
>>>>>> HtmlAnnotator and HtmlConverter [1]
>>>>>> 
>>>>>> The former one creates annotations for html element and therefore also
>>>>>> for xml tags. The latter one creates a new view with only the plain text
>>>>>> and adds existing annotations while adapting their offsets to the new
>>>>>> document.
>>>>>> 
>>>>>> Best,
>>>>>> 
>>>>>> Peter
>>>>>> 
>>>>>> [1] http://uima.apache.org/d/ruta-current/tools.ruta.book.html#
>>>>>> ugr.tools.ruta.ae.html
>>>>>> 
>>>>>> Am 18.02.2015 um 21:46 schrieb Mario Gazzo:
>>>>>> 
>>>>>>> We are starting to use the UIMA framework for NL processing article
>>>>>>> text, which is usually stored with metadata in some XML format. We need to
>>>>>>> extract text elements to be processed by various NL analysis engines that
>>>>>>> only work with pure text but we also need to keep track of the formatting
>>>>>>> information related to the processed text. It is in general also valuable
>>>>>>> for us to be able to track every annotation back to the original XML to
>>>>>>> maintain provenance. Before embarking on this I like to validate our
>>>>>>> approach with more experienced users since this is the first application we
>>>>>>> are building with UIMA.
>>>>>>> 
>>>>>>> In the first step we would annotate every important element of the XML
>>>>>>> including formatting elements in the body. We maintain some DOM-like
>>>>>>> relationships between the body text and formatting annotations so that text
>>>>>>> formatting can be reproduced later with NLP annotations in some article
>>>>>>> viewer.
>>>>>>> 
>>>>>>> Next we would in another AE produce a pure text view of the text
>>>>>>> annotations in the XML view that need to be NL analysed. In this new text
>>>>>>> view we would annotate the different text elements with references back to
>>>>>>> their counterpart in the original XML view so that we can trace back
>>>>>>> positions in the original XML and the formatting relations. This of course
>>>>>>> will require mapping NLP annotation offsets in the text view back to the
>>>>>>> XML view but the information should then be there to make this possible.
>>>>>>> 
>>>>>>> This approach requires somewhat more handcrafted book keeping than we
>>>>>>> initially hoped would be necessary. We haven’t been able to find any
>>>>>>> examples of how this is usually done and the UIMA docs are vague regarding
>>>>>>> managing this kind of relationships across views. We would therefore really
>>>>>>> like to know if there is a simpler and better approach.
>>>>>>> 
>>>>>>> Any feedback is greatly appreciated. Thanks.
>>>>>>> 
>>>>>> 
>>> 
>

Re: Approach for keeping track of formatting associated with text views

Posted by Mario Gazzo <ma...@gmail.com>.

Thanks, I understand the choices now. I would also probably prefer to use the document annotation if no text content is associated with the tag. However, ideally I would prefer that tag annotations get the offsets of content that is within their scope but otherwise get offsets of content within their closest shared ancestor element. Ultimately this could end up being the document annotation. E.g.

<journal-meta>
    <journal-id journal-id-type="nlm-ta">Environ Health Perspect</journal-id>
    <journal-title>Environmental Health Perspectives</journal-title>
    <issn pub-type="ppub">0091-6765</issn>
    <publisher>
        <publisher-name>National Institute of Environmental Health Sciences</publisher-name>
    </publisher>
</journal-meta>

I would here expect journal-meta to have the offsets of all content within its scope, which in the converted view of my experiments gets combined to the following “Environ Health PerspectEnvironmental Health Perspectives0091-6765National Institute of Environmental Health Sciences”. This works as expected when I just disable the “inBody”-flag of the HtmlConverterVisitor except that there is no clear separation between the content elements any longer, which is why I would like to have a sentence separator like “. ” between them so that I instead get: “Environ Health Perspect. Environmental Health Perspectives. 0091-6765. National Institute of Environmental Health Sciences.”. The dot separators should then of course not be included in the converters offsets since they are not part of the original text.

Additionally there might be a case where a meta tag doesn’t have any content within its scope but it contains attribute values:

<Parent>
	<Child1 attribute=“someValue” />
	<Child2/>Some content.</Child2>
</Parent>

In this case I would prefer that Child1 has the same offsets as Child2 since the tag is most closely related to that content. In case there is no content within the scope of its parent then I would find the first ancestor that contains content within its scope and use that offset although this choice is questionable. I haven’t a good example of this case though so I presume they are in reality rare.

That said, the latter is more complicated to implement, so I would be happy if I could just easily turn off the “inBody”-test in the HtmlConverterVisitor and have some way to add content separation between tags outside body without resorting to code modifications.

Hope this feedback was helpful.

Your time is much appreciated, thanks.


> On 09 Mar 2015, at 16:56 , Jens Grivolla <j+...@grivolla.net> wrote:
> 
> Hi Peter, while I don't think I will be using the HtmlConverter right away,
> I would vote for using the length of the document annotation for
> annotations that relate to the whole document (such as metadata).  That
> makes them show up nicely in the CasEditor/Viewer and you could maintain it
> in all segments when you split a CAS (e.g. with something based on the
> SimpleTextSegmenter example).
> 
> -- Jens
> 
> On Sat, Mar 7, 2015 at 5:33 PM, Peter Klügl <pk...@uni-wuerzburg.de>
> wrote:
> 
>> Hi,
>> 
>> there is no way yet to customize this behavior. The HtmlConverter only
>> retains annotation of a length > 0 since annoations with length == 0 are
>> rather problematic and should be avoided.
>> 
>> I can add a configuration parameter for keeping these annoations if you
>> want (best open an issue for it). What should be the offsets of the
>> annotations for elements in the head of the html document? 0, those of the
>> first token or those of the document annotation?
>> 
>> Best,
>> 
>> Peter
>> 
>> 
>> Am 06.03.2015 um 14:00 schrieb Mario Gazzo:
>> 
>> We conducted some experiments with both the HtmlAnnotator and the
>>> HtmlConverter but we ran into an issue with the converter. It appears to
>>> only convert tag annotations that surround or are inside the body tag.
>>> Metadata elements like citations are ignored. The only way to get around
>>> this seems to be by forking and modifying the codebase, which I like to
>>> avoid. Both modules seem otherwise very useful to us but I am looking for a
>>> better approach to solve this issue. Is there some way to customise this
>>> behaviour without code modifications?
>>> 
>>> Your input is appreciated, thanks.
>>> 
>>> 
>>> On 18 Feb 2015, at 23:03 , Mario Gazzo <ma...@gmail.com> wrote:
>>>> 
>>>> Thanks. Looks interesting, seems that it could fit our use case. We will
>>>> have a closer look at it.
>>>> 
>>>> On 18 Feb 2015, at 21:58 , Peter Klügl <pk...@uni-wuerzburg.de>
>>>>> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> you might want to take a look at two analysis engines of UIMA Ruta:
>>>>> HtmlAnnotator and HtmlConverter [1]
>>>>> 
>>>>> The former one creates annotations for html element and therefore also
>>>>> for xml tags. The latter one creates a new view with only the plain text
>>>>> and adds existing annotations while adapting their offsets to the new
>>>>> document.
>>>>> 
>>>>> Best,
>>>>> 
>>>>> Peter
>>>>> 
>>>>> [1] http://uima.apache.org/d/ruta-current/tools.ruta.book.html#
>>>>> ugr.tools.ruta.ae.html
>>>>> 
>>>>> Am 18.02.2015 um 21:46 schrieb Mario Gazzo:
>>>>> 
>>>>>> We are starting to use the UIMA framework for NL processing article
>>>>>> text, which is usually stored with metadata in some XML format. We need to
>>>>>> extract text elements to be processed by various NL analysis engines that
>>>>>> only work with pure text but we also need to keep track of the formatting
>>>>>> information related to the processed text. It is in general also valuable
>>>>>> for us to be able to track every annotation back to the original XML to
>>>>>> maintain provenance. Before embarking on this I like to validate our
>>>>>> approach with more experienced users since this is the first application we
>>>>>> are building with UIMA.
>>>>>> 
>>>>>> In the first step we would annotate every important element of the XML
>>>>>> including formatting elements in the body. We maintain some DOM-like
>>>>>> relationships between the body text and formatting annotations so that text
>>>>>> formatting can be reproduced later with NLP annotations in some article
>>>>>> viewer.
>>>>>> 
>>>>>> Next we would in another AE produce a pure text view of the text
>>>>>> annotations in the XML view that need to be NL analysed. In this new text
>>>>>> view we would annotate the different text elements with references back to
>>>>>> their counterpart in the original XML view so that we can trace back
>>>>>> positions in the original XML and the formatting relations. This of course
>>>>>> will require mapping NLP annotation offsets in the text view back to the
>>>>>> XML view but the information should then be there to make this possible.
>>>>>> 
>>>>>> This approach requires somewhat more handcrafted book keeping than we
>>>>>> initially hoped would be necessary. We haven’t been able to find any
>>>>>> examples of how this is usually done and the UIMA docs are vague regarding
>>>>>> managing this kind of relationships across views. We would therefore really
>>>>>> like to know if there is a simpler and better approach.
>>>>>> 
>>>>>> Any feedback is greatly appreciated. Thanks.
>>>>>> 
>>>>> 
>>

Re: Approach for keeping track of formatting associated with text views

Posted by Jens Grivolla <j+...@grivolla.net>.

Hi Peter, while I don't think I will be using the HtmlConverter right away,
I would vote for using the length of the document annotation for
annotations that relate to the whole document (such as metadata).  That
makes them show up nicely in the CasEditor/Viewer and you could maintain it
in all segments when you split a CAS (e.g. with something based on the
SimpleTextSegmenter example).

-- Jens

On Sat, Mar 7, 2015 at 5:33 PM, Peter Klügl <pk...@uni-wuerzburg.de>
wrote:

> Hi,
>
> there is no way yet to customize this behavior. The HtmlConverter only
> retains annotation of a length > 0 since annoations with length == 0 are
> rather problematic and should be avoided.
>
> I can add a configuration parameter for keeping these annoations if you
> want (best open an issue for it). What should be the offsets of the
> annotations for elements in the head of the html document? 0, those of the
> first token or those of the document annotation?
>
> Best,
>
> Peter
>
>
> Am 06.03.2015 um 14:00 schrieb Mario Gazzo:
>
>  We conducted some experiments with both the HtmlAnnotator and the
>> HtmlConverter but we ran into an issue with the converter. It appears to
>> only convert tag annotations that surround or are inside the body tag.
>> Metadata elements like citations are ignored. The only way to get around
>> this seems to be by forking and modifying the codebase, which I like to
>> avoid. Both modules seem otherwise very useful to us but I am looking for a
>> better approach to solve this issue. Is there some way to customise this
>> behaviour without code modifications?
>>
>> Your input is appreciated, thanks.
>>
>>
>>  On 18 Feb 2015, at 23:03 , Mario Gazzo <ma...@gmail.com> wrote:
>>>
>>> Thanks. Looks interesting, seems that it could fit our use case. We will
>>> have a closer look at it.
>>>
>>>  On 18 Feb 2015, at 21:58 , Peter Klügl <pk...@uni-wuerzburg.de>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> you might want to take a look at two analysis engines of UIMA Ruta:
>>>> HtmlAnnotator and HtmlConverter [1]
>>>>
>>>> The former one creates annotations for html element and therefore also
>>>> for xml tags. The latter one creates a new view with only the plain text
>>>> and adds existing annotations while adapting their offsets to the new
>>>> document.
>>>>
>>>> Best,
>>>>
>>>> Peter
>>>>
>>>> [1] http://uima.apache.org/d/ruta-current/tools.ruta.book.html#
>>>> ugr.tools.ruta.ae.html
>>>>
>>>> Am 18.02.2015 um 21:46 schrieb Mario Gazzo:
>>>>
>>>>> We are starting to use the UIMA framework for NL processing article
>>>>> text, which is usually stored with metadata in some XML format. We need to
>>>>> extract text elements to be processed by various NL analysis engines that
>>>>> only work with pure text but we also need to keep track of the formatting
>>>>> information related to the processed text. It is in general also valuable
>>>>> for us to be able to track every annotation back to the original XML to
>>>>> maintain provenance. Before embarking on this I like to validate our
>>>>> approach with more experienced users since this is the first application we
>>>>> are building with UIMA.
>>>>>
>>>>> In the first step we would annotate every important element of the XML
>>>>> including formatting elements in the body. We maintain some DOM-like
>>>>> relationships between the body text and formatting annotations so that text
>>>>> formatting can be reproduced later with NLP annotations in some article
>>>>> viewer.
>>>>>
>>>>> Next we would in another AE produce a pure text view of the text
>>>>> annotations in the XML view that need to be NL analysed. In this new text
>>>>> view we would annotate the different text elements with references back to
>>>>> their counterpart in the original XML view so that we can trace back
>>>>> positions in the original XML and the formatting relations. This of course
>>>>> will require mapping NLP annotation offsets in the text view back to the
>>>>> XML view but the information should then be there to make this possible.
>>>>>
>>>>> This approach requires somewhat more handcrafted book keeping than we
>>>>> initially hoped would be necessary. We haven’t been able to find any
>>>>> examples of how this is usually done and the UIMA docs are vague regarding
>>>>> managing this kind of relationships across views. We would therefore really
>>>>> like to know if there is a simpler and better approach.
>>>>>
>>>>> Any feedback is greatly appreciated. Thanks.
>>>>>
>>>>
>

Re: Approach for keeping track of formatting associated with text views

Posted by Peter Klügl <pk...@uni-wuerzburg.de>.

Hi,

there is no way yet to customize this behavior. The HtmlConverter only 
retains annotation of a length > 0 since annoations with length == 0 are 
rather problematic and should be avoided.

I can add a configuration parameter for keeping these annoations if you 
want (best open an issue for it). What should be the offsets of the 
annotations for elements in the head of the html document? 0, those of 
the first token or those of the document annotation?

Best,

Peter


Am 06.03.2015 um 14:00 schrieb Mario Gazzo:
> We conducted some experiments with both the HtmlAnnotator and the HtmlConverter but we ran into an issue with the converter. It appears to only convert tag annotations that surround or are inside the body tag. Metadata elements like citations are ignored. The only way to get around this seems to be by forking and modifying the codebase, which I like to avoid. Both modules seem otherwise very useful to us but I am looking for a better approach to solve this issue. Is there some way to customise this behaviour without code modifications?
>
> Your input is appreciated, thanks.
>
>
>> On 18 Feb 2015, at 23:03 , Mario Gazzo <ma...@gmail.com> wrote:
>>
>> Thanks. Looks interesting, seems that it could fit our use case. We will have a closer look at it.
>>
>>> On 18 Feb 2015, at 21:58 , Peter Klügl <pk...@uni-wuerzburg.de> wrote:
>>>
>>> Hi,
>>>
>>> you might want to take a look at two analysis engines of UIMA Ruta: HtmlAnnotator and HtmlConverter [1]
>>>
>>> The former one creates annotations for html element and therefore also for xml tags. The latter one creates a new view with only the plain text and adds existing annotations while adapting their offsets to the new document.
>>>
>>> Best,
>>>
>>> Peter
>>>
>>> [1] http://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.ae.html
>>>
>>> Am 18.02.2015 um 21:46 schrieb Mario Gazzo:
>>>> We are starting to use the UIMA framework for NL processing article text, which is usually stored with metadata in some XML format. We need to extract text elements to be processed by various NL analysis engines that only work with pure text but we also need to keep track of the formatting information related to the processed text. It is in general also valuable for us to be able to track every annotation back to the original XML to maintain provenance. Before embarking on this I like to validate our approach with more experienced users since this is the first application we are building with UIMA.
>>>>
>>>> In the first step we would annotate every important element of the XML including formatting elements in the body. We maintain some DOM-like relationships between the body text and formatting annotations so that text formatting can be reproduced later with NLP annotations in some article viewer.
>>>>
>>>> Next we would in another AE produce a pure text view of the text annotations in the XML view that need to be NL analysed. In this new text view we would annotate the different text elements with references back to their counterpart in the original XML view so that we can trace back positions in the original XML and the formatting relations. This of course will require mapping NLP annotation offsets in the text view back to the XML view but the information should then be there to make this possible.
>>>>
>>>> This approach requires somewhat more handcrafted book keeping than we initially hoped would be necessary. We haven’t been able to find any examples of how this is usually done and the UIMA docs are vague regarding managing this kind of relationships across views. We would therefore really like to know if there is a simpler and better approach.
>>>>
>>>> Any feedback is greatly appreciated. Thanks.

Re: Approach for keeping track of formatting associated with text views

Posted by Mario Gazzo <ma...@gmail.com>.

We conducted some experiments with both the HtmlAnnotator and the HtmlConverter but we ran into an issue with the converter. It appears to only convert tag annotations that surround or are inside the body tag. Metadata elements like citations are ignored. The only way to get around this seems to be by forking and modifying the codebase, which I like to avoid. Both modules seem otherwise very useful to us but I am looking for a better approach to solve this issue. Is there some way to customise this behaviour without code modifications?

Your input is appreciated, thanks.


> On 18 Feb 2015, at 23:03 , Mario Gazzo <ma...@gmail.com> wrote:
> 
> Thanks. Looks interesting, seems that it could fit our use case. We will have a closer look at it.
> 
>> On 18 Feb 2015, at 21:58 , Peter Klügl <pk...@uni-wuerzburg.de> wrote:
>> 
>> Hi,
>> 
>> you might want to take a look at two analysis engines of UIMA Ruta: HtmlAnnotator and HtmlConverter [1]
>> 
>> The former one creates annotations for html element and therefore also for xml tags. The latter one creates a new view with only the plain text and adds existing annotations while adapting their offsets to the new document.
>> 
>> Best,
>> 
>> Peter
>> 
>> [1] http://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.ae.html
>> 
>> Am 18.02.2015 um 21:46 schrieb Mario Gazzo:
>>> We are starting to use the UIMA framework for NL processing article text, which is usually stored with metadata in some XML format. We need to extract text elements to be processed by various NL analysis engines that only work with pure text but we also need to keep track of the formatting information related to the processed text. It is in general also valuable for us to be able to track every annotation back to the original XML to maintain provenance. Before embarking on this I like to validate our approach with more experienced users since this is the first application we are building with UIMA.
>>> 
>>> In the first step we would annotate every important element of the XML including formatting elements in the body. We maintain some DOM-like relationships between the body text and formatting annotations so that text formatting can be reproduced later with NLP annotations in some article viewer.
>>> 
>>> Next we would in another AE produce a pure text view of the text annotations in the XML view that need to be NL analysed. In this new text view we would annotate the different text elements with references back to their counterpart in the original XML view so that we can trace back positions in the original XML and the formatting relations. This of course will require mapping NLP annotation offsets in the text view back to the XML view but the information should then be there to make this possible.
>>> 
>>> This approach requires somewhat more handcrafted book keeping than we initially hoped would be necessary. We haven’t been able to find any examples of how this is usually done and the UIMA docs are vague regarding managing this kind of relationships across views. We would therefore really like to know if there is a simpler and better approach.
>>> 
>>> Any feedback is greatly appreciated. Thanks.
>> 
>

Re: Approach for keeping track of formatting associated with text views

Posted by Mario Gazzo <ma...@gmail.com>.

Thanks. Looks interesting, seems that it could fit our use case. We will have a closer look at it.

> On 18 Feb 2015, at 21:58 , Peter Klügl <pk...@uni-wuerzburg.de> wrote:
> 
> Hi,
> 
> you might want to take a look at two analysis engines of UIMA Ruta: HtmlAnnotator and HtmlConverter [1]
> 
> The former one creates annotations for html element and therefore also for xml tags. The latter one creates a new view with only the plain text and adds existing annotations while adapting their offsets to the new document.
> 
> Best,
> 
> Peter
> 
> [1] http://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.ae.html
> 
> Am 18.02.2015 um 21:46 schrieb Mario Gazzo:
>> We are starting to use the UIMA framework for NL processing article text, which is usually stored with metadata in some XML format. We need to extract text elements to be processed by various NL analysis engines that only work with pure text but we also need to keep track of the formatting information related to the processed text. It is in general also valuable for us to be able to track every annotation back to the original XML to maintain provenance. Before embarking on this I like to validate our approach with more experienced users since this is the first application we are building with UIMA.
>> 
>> In the first step we would annotate every important element of the XML including formatting elements in the body. We maintain some DOM-like relationships between the body text and formatting annotations so that text formatting can be reproduced later with NLP annotations in some article viewer.
>> 
>> Next we would in another AE produce a pure text view of the text annotations in the XML view that need to be NL analysed. In this new text view we would annotate the different text elements with references back to their counterpart in the original XML view so that we can trace back positions in the original XML and the formatting relations. This of course will require mapping NLP annotation offsets in the text view back to the XML view but the information should then be there to make this possible.
>> 
>> This approach requires somewhat more handcrafted book keeping than we initially hoped would be necessary. We haven’t been able to find any examples of how this is usually done and the UIMA docs are vague regarding managing this kind of relationships across views. We would therefore really like to know if there is a simpler and better approach.
>> 
>> Any feedback is greatly appreciated. Thanks.
>

Re: Approach for keeping track of formatting associated with text views

Posted by Peter Klügl <pk...@uni-wuerzburg.de>.

Hi,

you might want to take a look at two analysis engines of UIMA Ruta: 
HtmlAnnotator and HtmlConverter [1]

The former one creates annotations for html element and therefore also 
for xml tags. The latter one creates a new view with only the plain text 
and adds existing annotations while adapting their offsets to the new 
document.

Best,

Peter

[1] 
http://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.ae.html

Am 18.02.2015 um 21:46 schrieb Mario Gazzo:
> We are starting to use the UIMA framework for NL processing article text, which is usually stored with metadata in some XML format. We need to extract text elements to be processed by various NL analysis engines that only work with pure text but we also need to keep track of the formatting information related to the processed text. It is in general also valuable for us to be able to track every annotation back to the original XML to maintain provenance. Before embarking on this I like to validate our approach with more experienced users since this is the first application we are building with UIMA.
>
> In the first step we would annotate every important element of the XML including formatting elements in the body. We maintain some DOM-like relationships between the body text and formatting annotations so that text formatting can be reproduced later with NLP annotations in some article viewer.
>
> Next we would in another AE produce a pure text view of the text annotations in the XML view that need to be NL analysed. In this new text view we would annotate the different text elements with references back to their counterpart in the original XML view so that we can trace back positions in the original XML and the formatting relations. This of course will require mapping NLP annotation offsets in the text view back to the XML view but the information should then be there to make this possible.
>
> This approach requires somewhat more handcrafted book keeping than we initially hoped would be necessary. We haven’t been able to find any examples of how this is usually done and the UIMA docs are vague regarding managing this kind of relationships across views. We would therefore really like to know if there is a simpler and better approach.
>
> Any feedback is greatly appreciated. Thanks.