You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by "Olivier Grisel (JIRA)" <ji...@apache.org> on 2011/04/19 14:14:05 UTC

[jira] [Created] (STANBOL-176) NER engine should not put control chars in text literals of the annotation graph

NER engine should not put control chars in text literals of the annotation graph
--------------------------------------------------------------------------------

                 Key: STANBOL-176
                 URL: https://issues.apache.org/jira/browse/STANBOL-176
             Project: Stanbol
          Issue Type: Bug
            Reporter: Olivier Grisel
            Assignee: Olivier Grisel


Some text to analyse might contain control chars such as "\x13", "\x14", "\x15"... Such characters cannothe be serialized as XML and are generally worthless in the labels and context properties of enhancements.

The NER engine should filter them out before writing its annotations to the content item graph.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (STANBOL-176) NER engine should not put control chars in text literals of the annotation graph

Posted by "Olivier Grisel (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/STANBOL-176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olivier Grisel resolved STANBOL-176.
------------------------------------

    Resolution: Fixed

fixed in r1094969

> NER engine should not put control chars in text literals of the annotation graph
> --------------------------------------------------------------------------------
>
>                 Key: STANBOL-176
>                 URL: https://issues.apache.org/jira/browse/STANBOL-176
>             Project: Stanbol
>          Issue Type: Bug
>            Reporter: Olivier Grisel
>            Assignee: Olivier Grisel
>
> Some text to analyse might contain control chars such as "\x13", "\x14", "\x15"... Such characters cannothe be serialized as XML and are generally worthless in the labels and context properties of enhancements.
> The NER engine should filter them out before writing its annotations to the content item graph.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (STANBOL-176) NER engine should not put control chars in text literals of the annotation graph

Posted by "Olivier Grisel (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/STANBOL-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021552#comment-13021552 ] 

Olivier Grisel commented on STANBOL-176:
----------------------------------------

Just a note: "\x13" writes "\u0013" in java. Such characters can be extracted by libraries such as apache POI to extract the text content of a word document for instance.

> NER engine should not put control chars in text literals of the annotation graph
> --------------------------------------------------------------------------------
>
>                 Key: STANBOL-176
>                 URL: https://issues.apache.org/jira/browse/STANBOL-176
>             Project: Stanbol
>          Issue Type: Bug
>            Reporter: Olivier Grisel
>            Assignee: Olivier Grisel
>
> Some text to analyse might contain control chars such as "\x13", "\x14", "\x15"... Such characters cannothe be serialized as XML and are generally worthless in the labels and context properties of enhancements.
> The NER engine should filter them out before writing its annotations to the content item graph.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (STANBOL-176) NER engine should not put control chars in text literals of the annotation graph

Posted by "Olivier Grisel (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/STANBOL-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021548#comment-13021548 ] 

Olivier Grisel commented on STANBOL-176:
----------------------------------------

Sample exception we might get trying to serialize such a "corrupted" literal as RDF/XML:

com.hp.hpl.jena.shared.CannotEncodeCharacterException: cannot encode (char)  in context XML
	at com.hp.hpl.jena.rdf.model.impl.Util.substituteEntitiesInElementContent(Util.java:188)
	at com.hp.hpl.jena.xmloutput.impl.Basic.writeLiteral(Basic.java:168)
	at com.hp.hpl.jena.xmloutput.impl.Basic.writePredicate(Basic.java:104)
	at com.hp.hpl.jena.xmloutput.impl.Basic.writeRDFStatements(Basic.java:77)
	at com.hp.hpl.jena.xmloutput.impl.Basic.writeRDFStatements(Basic.java:66)
	at com.hp.hpl.jena.xmloutput.impl.Basic.writeBody(Basic.java:40)
	at com.hp.hpl.jena.xmloutput.impl.BaseXMLWriter.writeXMLBody(BaseXMLWriter.java:500)
	at com.hp.hpl.jena.xmloutput.impl.BaseXMLWriter.write(BaseXMLWriter.java:472)
	at com.hp.hpl.jena.xmloutput.impl.BaseXMLWriter.write(BaseXMLWriter.java:458)
	at org.apache.clerezza.rdf.jena.serializer.JenaSerializerProvider.serialize(JenaSerializerProvider.java:65)
	at org.apache.clerezza.rdf.core.serializedform.Serializer.serialize(Serializer.java:144)
	at org.apache.stanbol.enhancer.jersey.resource.ContentItemResource.getRdfMetadata(ContentItemResource.java:132)

> NER engine should not put control chars in text literals of the annotation graph
> --------------------------------------------------------------------------------
>
>                 Key: STANBOL-176
>                 URL: https://issues.apache.org/jira/browse/STANBOL-176
>             Project: Stanbol
>          Issue Type: Bug
>            Reporter: Olivier Grisel
>            Assignee: Olivier Grisel
>
> Some text to analyse might contain control chars such as "\x13", "\x14", "\x15"... Such characters cannothe be serialized as XML and are generally worthless in the labels and context properties of enhancements.
> The NER engine should filter them out before writing its annotations to the content item graph.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira