You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Henrik Matzen <hu...@googlemail.com> on 2016/07/05 10:02:45 UTC
Serialization NonXML
Hi,
because of the known problem that you cannot serialize the cas if it has
non xml chracters I tried this:
I know its not working because of this (cas =
doReplaceNonXml(cas.toString()).toCas;)
- Because there is no .toCas method.
Does anyone of you know how I can solve this?
@Override
public void process(final JCas cas) throws
AnalysisEngineProcessException {
JCas oldcas = cas;
cas = doReplaceNonXml(cas.toString()).toCas;
try {
final String xmlContent = this.serializeCas(cas);
final Map<String, String> metadataFields =
this.extractMetadata(xmlContent);
//Do something with metadatafields
cas = oldcas;
}
} catch (SAXException e) {
throw new AnalysisEngineProcessException(e);
} catch (IOException e) {
throw new AnalysisEngineProcessException(e);
} catch (ParserConfigurationException e) {
throw new AnalysisEngineProcessException(e);
}
private String doReplaceNonXml(String aString)
{
char[] buf = aString.toCharArray();
int pos = XMLUtils.checkForNonXmlCharacters(buf, 0, buf.length,
false);
if (pos == -1) {
return aString;
}
while (pos != -1) {
buf[pos] = ' ';
pos = XMLUtils.checkForNonXmlCharacters(buf, pos, buf.length -
pos, false);
}
return String.valueOf(buf);
}
private String serializeCas(final JCas cas) throws SAXException,
IOException {
// TODO: think about buffering and performance
final ByteArrayOutputStream out = new ByteArrayOutputStream(1024);
final XmiCasSerializer ser = new
XmiCasSerializer(cas.getTypeSystem());
try {
ser.serialize(cas.getCas(), (new XMLSerializer(out,
false)).getContentHandler());
} finally {
out.close();
}
return out.toString();
}
private Map<String, String> extractMetadata(final String xmlContent)
throws SAXException, IOException,
ParserConfigurationException {
final Map<String, String> resultMap = new HashMap<String, String>();
// parse xmlContent String with java SAX parser
final DocumentBuilderFactory dbFactory =
DocumentBuilderFactory.newInstance();
final DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
InputStream stream = new
ByteArrayInputStream(xmlContent.getBytes(StandardCharsets.UTF_8));
final Document doc = dBuilder.parse(stream);
// get meta data field node
final NodeList nl = doc.getElementsByTagName("oze:MetaField");
if (nl == null) {
return resultMap;
}
Best regards!
Re: Serialization NonXML
Posted by Richard Eckart de Castilho <re...@apache.org>.
On 05.07.2016, at 12:02, Henrik Matzen <hu...@googlemail.com> wrote:
>
> because of the known problem that you cannot serialize the cas if it has
> non xml chracters I tried this:
>
> I know its not working because of this (cas =
> doReplaceNonXml(cas.toString()).toCas;)
> - Because there is no .toCas method.
>
> Does anyone of you know how I can solve this?
Are you required to use XMI? If not, consider serializing
your CASes in a binary format. [1]
DKPro Core has reader/writer components that support all different kinds of
UIMA binary serialization including some custom variants [2].
Cheers,
-- Richard
[1] https://uima.apache.org/d/uimaj-current/references.html#ugr.ref.compress
[2] https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-BinaryCas