You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@uima.apache.org by "Peter Klügl (JIRA)" <de...@uima.apache.org> on 2015/03/14 14:34:38 UTC

[jira] [Updated] (UIMA-4286) Ruta: HTMLConverter: Option to convert tags outside body tags

     [ https://issues.apache.org/jira/browse/UIMA-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Klügl updated UIMA-4286:
------------------------------
    Affects Version/s: 2.2.1ruta

> Ruta: HTMLConverter: Option to convert tags outside body tags
> -------------------------------------------------------------
>
>                 Key: UIMA-4286
>                 URL: https://issues.apache.org/jira/browse/UIMA-4286
>             Project: UIMA
>          Issue Type: Improvement
>          Components: ruta
>    Affects Versions: 2.2.1ruta
>            Reporter: Mario Juric
>
> The HTML converter only converts tags that are found inside the body tag. Therefore some information carrying tags like citations get left out when applying the converter to XML articles with many metadata. It would be useful to add the option to have all tags converted since this would allow content outside the body to be parsed by natural language analysers as well.
> The converter was originally, as the name implies, conceived for HTML documents but together with the HTML Annotator it can this way be more generally useful in enabling NL parsing of a broader class of documents such as articles stored in XML documents.
> An example of how this option might work can be given by disabling the "inBody"-flag inside the HTMLConverterVisitor. The example also illustrates what offsets to apply to such annotations but otherwise the document annotation offsets can be used. Empty tags can still be ignored but tags with only attributes and no content should preferably be converted.
> Experiments with disabling the "in body"-constraint reveals that there will be an additional need to separate the content metadata tags in the converted text view. An NL parser reading the text will in many case read different tags as one word or one sentence, which is not desirable. Some text delimiter should therefore be inserted between tags were required, which optionally could be customizable as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)