You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by "Marshall Schor (JIRA)" <de...@uima.apache.org> on 2013/08/04 23:35:48 UTC
[jira] [Commented] (UIMA-3141) Binary CAS format 6 + type filtering fails to deserialize document annotation correctly

    [ https://issues.apache.org/jira/browse/UIMA-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13729006#comment-13729006 ] 

Marshall Schor commented on UIMA-3141:
--------------------------------------

I took a look at this, and it may be working as designed.

Here's what it appears is happening (I didn't run the test case (yet), just examined the code.

1) A CAS, sourceCas, is created, having a type system which includes a special type definition, DocMeta, which is a subtype of the built-in uima.tcas.DocumentAnnotation type.  

1a) The code makes an instance of this type, and adds it to the indexes.

2) The sourceCas's method "setDocumentLanguage" method is called. This method looks up to see if there is an indexed instance of this type, and finds the instance of the "DocMeta" type, created in 1a); it then sets that type's language feature to "latin".

3) The new form 6 serializer serializes out the sourceCas, using it's type system, so all "indexed" and reachable feature structures are serialized.

4) Now, the interesting part.  This file is deserialized, into the targetCas.  However, that CAS has been defined without the special type DocMeta.  With form 6, this type mismatch is allowed, and the semantics of this is that the deserialization process "filters" the feature structures being deserialized, so that only those with type definitions in the receiving CAS are deserialized, and the others are "skipped".

So - this results in the DocMeta feature structure instance being skipped.

I think this is why the getDocumentLanaguage call doesn't get the language set in the DocMeta feature structure.

If you put the DocMeta type definition into the Target Cas's type system description, does it change the behavior so that the getDocumentLanguage returns "latin"?
                
> Binary CAS format 6 + type filtering fails to deserialize document annotation correctly 
> ----------------------------------------------------------------------------------------
>
>                 Key: UIMA-3141
>                 URL: https://issues.apache.org/jira/browse/UIMA-3141
>             Project: UIMA
>          Issue Type: Bug
>          Components: Core Java Framework
>    Affects Versions: 2.4.1SDK
>            Reporter: Richard Eckart de Castilho
>            Assignee: Marshall Schor
>
> When a custom document annotation type is used, the language is not properly restored after deserializing from CAS format 6.
> Expected: deserialized CAS has language "latin"
> Actual: deserialized CAS has language "x-unspecified"
> If the line {{sourceCas.addFsToIndexes(ma);}} is commented out, the code works.
> {code}
> import static org.junit.Assert.assertEquals;
> import static org.junit.Assert.assertTrue;
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.FileOutputStream;
> import java.io.InputStream;
> import java.io.OutputStream;
> import org.apache.commons.io.IOUtils;
> import org.apache.uima.cas.CAS;
> import org.apache.uima.cas.impl.Serialization;
> import org.apache.uima.cas.text.AnnotationFS;
> import org.apache.uima.resource.metadata.TypeSystemDescription;
> import org.apache.uima.resource.metadata.impl.TypeSystemDescription_impl;
> import org.apache.uima.util.CasCreationUtils;
> import org.junit.Rule;
> import org.junit.Test;
> import org.junit.rules.TemporaryFolder;
> public class MinimalTest
> {
>     @Rule
>     public TemporaryFolder testFolder = new TemporaryFolder();
>     @Test
>     public void test()
>         throws Exception
>     {
>         TypeSystemDescription sourceTsd = new TypeSystemDescription_impl();
>         sourceTsd.addType("DocMeta", "", CAS.TYPE_NAME_DOCUMENT_ANNOTATION);
>         TypeSystemDescription targetTsd = new TypeSystemDescription_impl();
>         CAS sourceCas = CasCreationUtils.createCas(sourceTsd, null, null);
>         AnnotationFS ma = sourceCas.createAnnotation(sourceCas.getTypeSystem().getType("DocMeta"),
>                 0, 0);
>         sourceCas.addFsToIndexes(ma);
>         sourceCas.setDocumentLanguage("latin");
>         sourceCas.setDocumentText("test");
>         File file = testFolder.newFile("test.bin");
>         OutputStream os = new FileOutputStream(file);
>         Serialization.serializeWithCompression(sourceCas, os, sourceCas.getTypeSystem());
>         IOUtils.closeQuietly(os);
>         assertTrue(new File(testFolder.getRoot(), "test.bin").exists());
>         CAS targetCas = CasCreationUtils.createCas(targetTsd, null, null);
>         InputStream is = new FileInputStream(file);
>         Serialization.deserializeCAS(targetCas, is, sourceCas.getTypeSystem(), null);
>         IOUtils.closeQuietly(is);
>         assertEquals("latin", targetCas.getDocumentLanguage());
>         assertEquals("test", targetCas.getDocumentText());
>     }
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira