You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by "Olivier Grisel (JIRA)" <ji...@apache.org> on 2011/04/19 14:16:05 UTC

[jira] [Commented] (STANBOL-176) NER engine should not put control chars in text literals of the annotation graph

    [ https://issues.apache.org/jira/browse/STANBOL-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021548#comment-13021548 ] 

Olivier Grisel commented on STANBOL-176:
----------------------------------------

Sample exception we might get trying to serialize such a "corrupted" literal as RDF/XML:

com.hp.hpl.jena.shared.CannotEncodeCharacterException: cannot encode (char)  in context XML
	at com.hp.hpl.jena.rdf.model.impl.Util.substituteEntitiesInElementContent(Util.java:188)
	at com.hp.hpl.jena.xmloutput.impl.Basic.writeLiteral(Basic.java:168)
	at com.hp.hpl.jena.xmloutput.impl.Basic.writePredicate(Basic.java:104)
	at com.hp.hpl.jena.xmloutput.impl.Basic.writeRDFStatements(Basic.java:77)
	at com.hp.hpl.jena.xmloutput.impl.Basic.writeRDFStatements(Basic.java:66)
	at com.hp.hpl.jena.xmloutput.impl.Basic.writeBody(Basic.java:40)
	at com.hp.hpl.jena.xmloutput.impl.BaseXMLWriter.writeXMLBody(BaseXMLWriter.java:500)
	at com.hp.hpl.jena.xmloutput.impl.BaseXMLWriter.write(BaseXMLWriter.java:472)
	at com.hp.hpl.jena.xmloutput.impl.BaseXMLWriter.write(BaseXMLWriter.java:458)
	at org.apache.clerezza.rdf.jena.serializer.JenaSerializerProvider.serialize(JenaSerializerProvider.java:65)
	at org.apache.clerezza.rdf.core.serializedform.Serializer.serialize(Serializer.java:144)
	at org.apache.stanbol.enhancer.jersey.resource.ContentItemResource.getRdfMetadata(ContentItemResource.java:132)

> NER engine should not put control chars in text literals of the annotation graph
> --------------------------------------------------------------------------------
>
>                 Key: STANBOL-176
>                 URL: https://issues.apache.org/jira/browse/STANBOL-176
>             Project: Stanbol
>          Issue Type: Bug
>            Reporter: Olivier Grisel
>            Assignee: Olivier Grisel
>
> Some text to analyse might contain control chars such as "\x13", "\x14", "\x15"... Such characters cannothe be serialized as XML and are generally worthless in the labels and context properties of enhancements.
> The NER engine should filter them out before writing its annotations to the content item graph.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira