You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by "Richard Eckart de Castilho (Jira)" <de...@uima.apache.org> on 2020/10/13 16:42:00 UTC

[jira] [Comment Edited] (UIMA-6266) Clean JSON Wire Format for CAS

    [ https://issues.apache.org/jira/browse/UIMA-6266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213212#comment-17213212 ] 

Richard Eckart de Castilho edited comment on UIMA-6266 at 10/13/20, 4:41 PM:
-----------------------------------------------------------------------------

Here some thoughts:

* {{_type`}} is IMHO metadata, but it is also a name which could appear in that way in a Java class as a field name. I'd prefer to use a naming convention which does not potentially clash with Java (and maybe other languages) and which might also be used elsewhere in similar JSON formats. E.g. [JSON-LD|https://json-ld.org] prefixes such kinds of JSON fields with an {{@}}. 
* We need some kind of mapping from the {{type}} to the fully qualified type name (the one which includes the namespace / Java package). That is also necessary to handle cases where a type with the same base name exists in multiple namespaces / Java packages.
* For lat/long, we know that it is a floating point number, but we don't know which one. It would be good if the data format (somewhere outside the FS representation) had information on if this is a float, double or something else. I think there exist some conventions for encoding field type information in JSON (maybe like {{ { "lat:f": 49.123, 'lon:d': -84,234 } }}, but I cannot find anything right now. In any case, having this information in some "schema" part of the file may be preferable to avoid unnecessary redundancy. There is an implicit question here how much redundancy is acceptable. E.g. for a proper "wire" (network) format, having the feature names repeated in every FS might not be desirable - if the data could be represented as a JSON array, it could be much more compact (assuming that most fields have non-null/default values). But it makes the format more difficult to processes. Where is the sweet spot between wire size and ease of access?
* The FSID here seems to be an actual feature of the feature structure (i.e. not "just" metadata used for references between FSes). We'd need an  {{@id}} for FSes as well to allow making cross-references between them.


was (Author: rec):
Here some thoughts:

* {{_type`}}is IMHO metadata, but it is also a name which could appear in that way in a Java class as a field name. I'd prefer to use a naming convention which does not potentially clash with Java (and maybe other languages) and which might also be used elsewhere in similar JSON formats. E.g. [JSON-LD|https://json-ld.org] prefixes such kinds of JSON fields with an {{@}}. 
* We need some kind of mapping from the {{type}} to the fully qualified type name (the one which includes the namespace / Java package). That is also necessary to handle cases where a type with the same base name exists in multiple namespaces / Java packages.
* For lat/long, we know that it is a floating point number, but we don't know which one. It would be good if the data format (somewhere outside the FS representation) had information on if this is a float, double or something else. I think there exist some conventions for encoding field type information in JSON (maybe like {{ { "lat:f": 49.123, 'lon:d': -84,234 } }}, but I cannot find anything right now. In any case, having this information in some "schema" part of the file may be preferable to avoid unnecessary redundancy. There is an implicit question here how much redundancy is acceptable. E.g. for a proper "wire" (network) format, having the feature names repeated in every FS might not be desirable - if the data could be represented as a JSON array, it could be much more compact (assuming that most fields have non-null/default values). But it makes the format more difficult to processes. Where is the sweet spot between wire size and ease of access?
* The FSID here seems to be an actual feature of the feature structure (i.e. not "just" metadata used for references between FSes). We'd need an  {{@id}} for FSes as well to allow making cross-references between them.

> Clean JSON Wire Format for CAS
> ------------------------------
>
>                 Key: UIMA-6266
>                 URL: https://issues.apache.org/jira/browse/UIMA-6266
>             Project: UIMA
>          Issue Type: New Feature
>          Components: Core Java Framework
>            Reporter: Daniel Gruhl
>            Priority: Major
>
> A clean format for sending CAS over the wire in JSON would make interoperation with other text analytics systems much easier. Impact on UIMAj would be a need for the serializer and deserializer for these formats.
>  
> The hope would be this is NOT just a cut and past of the XMI, but rather a clean rethink of what would represent the best wire format going forward.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)