You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by "Olivier Grisel (JIRA)" <ji...@apache.org> on 2011/04/19 14:14:05 UTC
[jira] [Created] (STANBOL-176) NER engine should not put control
chars in text literals of the annotation graph
NER engine should not put control chars in text literals of the annotation graph
--------------------------------------------------------------------------------
Key: STANBOL-176
URL: https://issues.apache.org/jira/browse/STANBOL-176
Project: Stanbol
Issue Type: Bug
Reporter: Olivier Grisel
Assignee: Olivier Grisel
Some text to analyse might contain control chars such as "\x13", "\x14", "\x15"... Such characters cannothe be serialized as XML and are generally worthless in the labels and context properties of enhancements.
The NER engine should filter them out before writing its annotations to the content item graph.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (STANBOL-176) NER engine should not put control
chars in text literals of the annotation graph
Posted by "Olivier Grisel (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/STANBOL-176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Olivier Grisel resolved STANBOL-176.
------------------------------------
Resolution: Fixed
fixed in r1094969
> NER engine should not put control chars in text literals of the annotation graph
> --------------------------------------------------------------------------------
>
> Key: STANBOL-176
> URL: https://issues.apache.org/jira/browse/STANBOL-176
> Project: Stanbol
> Issue Type: Bug
> Reporter: Olivier Grisel
> Assignee: Olivier Grisel
>
> Some text to analyse might contain control chars such as "\x13", "\x14", "\x15"... Such characters cannothe be serialized as XML and are generally worthless in the labels and context properties of enhancements.
> The NER engine should filter them out before writing its annotations to the content item graph.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (STANBOL-176) NER engine should not put control
chars in text literals of the annotation graph
Posted by "Olivier Grisel (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/STANBOL-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021552#comment-13021552 ]
Olivier Grisel commented on STANBOL-176:
----------------------------------------
Just a note: "\x13" writes "\u0013" in java. Such characters can be extracted by libraries such as apache POI to extract the text content of a word document for instance.
> NER engine should not put control chars in text literals of the annotation graph
> --------------------------------------------------------------------------------
>
> Key: STANBOL-176
> URL: https://issues.apache.org/jira/browse/STANBOL-176
> Project: Stanbol
> Issue Type: Bug
> Reporter: Olivier Grisel
> Assignee: Olivier Grisel
>
> Some text to analyse might contain control chars such as "\x13", "\x14", "\x15"... Such characters cannothe be serialized as XML and are generally worthless in the labels and context properties of enhancements.
> The NER engine should filter them out before writing its annotations to the content item graph.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (STANBOL-176) NER engine should not put control
chars in text literals of the annotation graph
Posted by "Olivier Grisel (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/STANBOL-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021548#comment-13021548 ]
Olivier Grisel commented on STANBOL-176:
----------------------------------------
Sample exception we might get trying to serialize such a "corrupted" literal as RDF/XML:
com.hp.hpl.jena.shared.CannotEncodeCharacterException: cannot encode (char) in context XML
at com.hp.hpl.jena.rdf.model.impl.Util.substituteEntitiesInElementContent(Util.java:188)
at com.hp.hpl.jena.xmloutput.impl.Basic.writeLiteral(Basic.java:168)
at com.hp.hpl.jena.xmloutput.impl.Basic.writePredicate(Basic.java:104)
at com.hp.hpl.jena.xmloutput.impl.Basic.writeRDFStatements(Basic.java:77)
at com.hp.hpl.jena.xmloutput.impl.Basic.writeRDFStatements(Basic.java:66)
at com.hp.hpl.jena.xmloutput.impl.Basic.writeBody(Basic.java:40)
at com.hp.hpl.jena.xmloutput.impl.BaseXMLWriter.writeXMLBody(BaseXMLWriter.java:500)
at com.hp.hpl.jena.xmloutput.impl.BaseXMLWriter.write(BaseXMLWriter.java:472)
at com.hp.hpl.jena.xmloutput.impl.BaseXMLWriter.write(BaseXMLWriter.java:458)
at org.apache.clerezza.rdf.jena.serializer.JenaSerializerProvider.serialize(JenaSerializerProvider.java:65)
at org.apache.clerezza.rdf.core.serializedform.Serializer.serialize(Serializer.java:144)
at org.apache.stanbol.enhancer.jersey.resource.ContentItemResource.getRdfMetadata(ContentItemResource.java:132)
> NER engine should not put control chars in text literals of the annotation graph
> --------------------------------------------------------------------------------
>
> Key: STANBOL-176
> URL: https://issues.apache.org/jira/browse/STANBOL-176
> Project: Stanbol
> Issue Type: Bug
> Reporter: Olivier Grisel
> Assignee: Olivier Grisel
>
> Some text to analyse might contain control chars such as "\x13", "\x14", "\x15"... Such characters cannothe be serialized as XML and are generally worthless in the labels and context properties of enhancements.
> The NER engine should filter them out before writing its annotations to the content item graph.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira