You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Henrik Matzen <hu...@googlemail.com> on 2016/07/05 10:02:45 UTC

Serialization NonXML

Hi,

because of the known problem that you cannot serialize the cas if it has
non xml chracters I tried this:

I know its not working because of this (cas =
doReplaceNonXml(cas.toString()).toCas;)
- Because there is no .toCas method.

Does anyone of you know how I can solve this?

    @Override
    public void process(final JCas cas) throws
AnalysisEngineProcessException {
        JCas oldcas = cas;
        cas = doReplaceNonXml(cas.toString()).toCas;
        try {
            final String xmlContent = this.serializeCas(cas);
            final Map<String, String> metadataFields =
this.extractMetadata(xmlContent);

            //Do something with metadatafields
            cas = oldcas;
        }

        } catch (SAXException e) {
            throw new AnalysisEngineProcessException(e);
        } catch (IOException e) {
            throw new AnalysisEngineProcessException(e);
        } catch (ParserConfigurationException e) {
            throw new AnalysisEngineProcessException(e);
        }


    private String doReplaceNonXml(String aString)
    {

        char[] buf = aString.toCharArray();
        int pos = XMLUtils.checkForNonXmlCharacters(buf, 0, buf.length,
false);

        if (pos == -1) {
            return aString;
        }

        while (pos != -1) {
            buf[pos] = ' ';
            pos = XMLUtils.checkForNonXmlCharacters(buf, pos, buf.length -
pos, false);
        }
        return String.valueOf(buf);
    }

    private String serializeCas(final JCas cas) throws SAXException,
IOException {
        // TODO: think about buffering and performance
        final ByteArrayOutputStream out = new ByteArrayOutputStream(1024);
        final XmiCasSerializer ser = new
XmiCasSerializer(cas.getTypeSystem());
        try {
            ser.serialize(cas.getCas(), (new XMLSerializer(out,
false)).getContentHandler());
        } finally {
            out.close();
        }
        return out.toString();
    }

    private Map<String, String> extractMetadata(final String xmlContent)
throws SAXException, IOException,
            ParserConfigurationException {

        final Map<String, String> resultMap = new HashMap<String, String>();

        // parse xmlContent String with java SAX parser
        final DocumentBuilderFactory dbFactory =
DocumentBuilderFactory.newInstance();
        final DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
        InputStream stream = new
ByteArrayInputStream(xmlContent.getBytes(StandardCharsets.UTF_8));
        final Document doc = dBuilder.parse(stream);

        // get meta data field node
        final NodeList nl = doc.getElementsByTagName("oze:MetaField");
        if (nl == null) {
            return resultMap;
        }

Best regards!

Re: Serialization NonXML

Posted by Richard Eckart de Castilho <re...@apache.org>.
On 05.07.2016, at 12:02, Henrik Matzen <hu...@googlemail.com> wrote:
> 
> because of the known problem that you cannot serialize the cas if it has
> non xml chracters I tried this:
> 
> I know its not working because of this (cas =
> doReplaceNonXml(cas.toString()).toCas;)
> - Because there is no .toCas method.
> 
> Does anyone of you know how I can solve this?

Are you required to use XMI? If not, consider serializing
your CASes in a binary format. [1]

DKPro Core has reader/writer components that support all different kinds of
UIMA binary serialization including some custom variants [2].

Cheers,

-- Richard

[1] https://uima.apache.org/d/uimaj-current/references.html#ugr.ref.compress 
[2] https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-BinaryCas