You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by "Richard Eckart de Castilho (Jira)" <de...@uima.apache.org> on 2020/10/11 17:17:00 UTC

[jira] [Commented] (UIMA-6266) Clean JSON Wire Format for CAS

    [ https://issues.apache.org/jira/browse/UIMA-6266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211979#comment-17211979 ] 

Richard Eckart de Castilho commented on UIMA-6266:
--------------------------------------------------

What are the positive/negative sides of XMI and/or the JSON format that currently can be generated by UIMA?

What would be requirements such a format would need to respect? 

Here are a few I could think of:

* Should be reasonably usable with plain JSON support (i.e. without having to interpret a ton of auxiliary information encoded in the JSON file)
* Should avoid too much unnecessary redundancy
* Should be interpretable without a separate type system description
* Should allow embedding the full type system declaration
* Should support multiple views
* Should support indexed and non-indexed FSes
* Should support encoding of partial CASes
* Should support encoding data from multiple CASes
* Should have a stable ordering of data to permit easy text-based diffing
* Small changes in the data should lead to small changes in the serialization to permit easy text-based diffing
* Type information should be using JSON types as much as possible (e.g. to represent boolean, string, integer, etc)
* Where additional type information is necessary, an embedded (or externally loaded) type system should be consulted


> Clean JSON Wire Format for CAS
> ------------------------------
>
>                 Key: UIMA-6266
>                 URL: https://issues.apache.org/jira/browse/UIMA-6266
>             Project: UIMA
>          Issue Type: New Feature
>          Components: Core Java Framework
>            Reporter: Daniel Gruhl
>            Priority: Major
>
> A clean format for sending CAS over the wire in JSON would make interoperation with other text analytics systems much easier. Impact on UIMAj would be a need for the serializer and deserializer for these formats.
>  
> The hope would be this is NOT just a cut and past of the XMI, but rather a clean rethink of what would represent the best wire format going forward.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Re: [jira] [Commented] (UIMA-6266) Clean JSON Wire Format for CAS

Posted by Daniel Gruhl <da...@gmail.com>.
That looks great! So what we've been doing is a CAS is a [] of FS.

A FS is like {"_type":"Geo", "begin":10, "end":12, "spannedText":"NY",
"lat":40.7128, "lon":-74.006, "fsid":13}

This is just the FS, not the index yet - want to start the discussion :)

            -= Dan

On Sun, Oct 11, 2020 at 10:17 AM Richard Eckart de Castilho (Jira) <
dev@uima.apache.org> wrote:

>
>     [
> https://issues.apache.org/jira/browse/UIMA-6266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211979#comment-17211979
> ]
>
> Richard Eckart de Castilho commented on UIMA-6266:
> --------------------------------------------------
>
> What are the positive/negative sides of XMI and/or the JSON format that
> currently can be generated by UIMA?
>
> What would be requirements such a format would need to respect?
>
> Here are a few I could think of:
>
> * Should be reasonably usable with plain JSON support (i.e. without having
> to interpret a ton of auxiliary information encoded in the JSON file)
> * Should avoid too much unnecessary redundancy
> * Should be interpretable without a separate type system description
> * Should allow embedding the full type system declaration
> * Should support multiple views
> * Should support indexed and non-indexed FSes
> * Should support encoding of partial CASes
> * Should support encoding data from multiple CASes
> * Should have a stable ordering of data to permit easy text-based diffing
> * Small changes in the data should lead to small changes in the
> serialization to permit easy text-based diffing
> * Type information should be using JSON types as much as possible (e.g. to
> represent boolean, string, integer, etc)
> * Where additional type information is necessary, an embedded (or
> externally loaded) type system should be consulted
>
>
> > Clean JSON Wire Format for CAS
> > ------------------------------
> >
> >                 Key: UIMA-6266
> >                 URL: https://issues.apache.org/jira/browse/UIMA-6266
> >             Project: UIMA
> >          Issue Type: New Feature
> >          Components: Core Java Framework
> >            Reporter: Daniel Gruhl
> >            Priority: Major
> >
> > A clean format for sending CAS over the wire in JSON would make
> interoperation with other text analytics systems much easier. Impact on
> UIMAj would be a need for the serializer and deserializer for these formats.
> >
> > The hope would be this is NOT just a cut and past of the XMI, but rather
> a clean rethink of what would represent the best wire format going forward.
>
>
>
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)
>