You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ctakes.apache.org by Tomasz Oliwa <ol...@uchicago.edu> on 2016/09/14 18:08:01 UTC

deserialize and process XCAS files

Hi,

I have working code to deserialize XCAS files and read-only process them further, it is based on CASConsumerTestDriver.java, an example is :

        // inputs to the CAS file and the AE from cTAKES, templates here
        String xCasLocation = <location-of-CAS-file>;
        String taeDescriptionLocation = <location-of-AggregatePlaintextFastUMLSProcessor.xml>;

        // initialize the ae
        InputStream xCasStream = new FileInputStream(xCasLocation);
        AnalysisEngineDescription taeDescription = UIMAFramework.getXMLParser().parseAnalysisEngineDescription(
                new XMLInputSource(new File(taeDescriptionLocation)));
        AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(taeDescription);
        
        // read CAS
        CAS cas = ae.newCAS();
        XCASDeserializer.deserialize(xCasStream, cas);
        
        // print out the Sofa
        System.out.println(cas.getSofaDataString());
        
        // create jCAS and print out some UmlsConcepts
        JFSIndexRepository indexes = cas.getJCas().getJFSIndexRepository();
        Iterator iterator = indexes.getAnnotationIndex(SignSymptomMention.type).iterator();
        while (iterator.hasNext()) {
            SignSymptomMention annot = (SignSymptomMention) iterator.next();
            System.out.println(annot.getCoveredText());
            // further read the annotation
            FSArray ocArr = annot.getOntologyConceptArr();
            // ...
        }

The code above runs fine, but runs sequentially. I have a lot of CAS files and would like to process them in parallel (for instance to extract some values and store them in another DB).

My question:

Can I give a reference to the above created AnalysisEngine ae to code that is run in parallel (java.util.concurrent.Callable or parallel Java 8 Streams, it does not matter), provided that I only use read operations (such as annot.getCoveredText() or some other calls to get the CUI) and no two Threads would work on the same CAS ? 

I read in https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.0+Component+Use+Guide that "cTAKES is not designed to be thread safe", but here I would be doing read-only operations to extract concepts and CUIs from JCas objects. No new annotations would be created, no annotators called.

If this is not recommended, what would be the best course of action to deserialize and read-only process these CAS files?

Thanks for any help, I would really appreciate it
Tomasz

RE: deserialize and process XCAS files

Posted by Tomasz Oliwa <ol...@uchicago.edu>.

Sean,

I tested your code, it works with some small demo CAS files that I tried. I changed the uimafit style ae creation to reading the descriptor XML from a file as in my example.

I was not aware about the CasPool class, this is nice.

Thanks,
Tomasz

________________________________________
From: Finan, Sean [Sean.Finan@childrens.harvard.edu]
Sent: Wednesday, September 14, 2016 2:50 PM
To: dev@ctakes.apache.org
Subject: RE: deserialize and process XCAS files

Hi Tomasz,

As far as I know, the ae is only instantiated to create the type system, which is then handed off.  So, you shouldn't need to worry about thread safety problems with the ae.  I have an idea for deserializing concurrently, but you'll notice that I did things slightly differently ...  this is only a sample and I haven't actually done anything with it:




public class ParallelXcasDeserializer {

   static private final Logger LOGGER = Logger.getLogger( "ParallelXcasDeserializer" );
   static private final int PROCESS_COUNT = 10;
   static private final CasPool CAS_POOL = createCasPool( PROCESS_COUNT );

   static private CasPool createCasPool( final int processCount ) {
      try {
         final AnalysisEngineDescription fastPipelineDesc = ClinicalPipelineFactory.getFastPipeline();
         final AnalysisEngine fastPipeline = AnalysisEngineFactory.createEngine( fastPipelineDesc );
         return new CasPool( processCount, fastPipeline );
      } catch ( UIMAException | MalformedURLException multE ) {
         LOGGER.error( multE.getMessage() );
      }
      return null;
   }

   static private final class XCasDeserializer implements Runnable {
      private final String _xcasFilePath;
      private XCasDeserializer( final String xcasFilePath ) {
         _xcasFilePath = xcasFilePath;
      }
      @Override
      public void run() {
         final CAS cas = CAS_POOL.getCas();
         try ( InputStream xCasStream = new FileInputStream( _xcasFilePath ) ) {
            XCASDeserializer.deserialize( xCasStream, cas );
            final JCas jCas = cas.getJCas();
            OntologyConceptUtil.getCuis( jCas ).forEach( LOGGER::info );
         } catch ( IOException | SAXException | CASException multE ) {
            LOGGER.error( multE.getMessage() );
         }
         CAS_POOL.releaseCas( cas );
      }
   }

   public static void main( final String... args ) {
      if ( CAS_POOL == null ) {
         LOGGER.error( "Couldn't create CAS Pool, exiting" );
         System.exit( 1 );
      }
      final Path inputDir = Paths.get( args[ 0 ] );
      final ExecutorService executor = Executors.newFixedThreadPool( PROCESS_COUNT );
      try ( DirectoryStream<Path> filePaths = Files.newDirectoryStream( inputDir ) ) {
         for ( Path filePath : filePaths ) {
            executor.execute( new XCasDeserializer( filePath.toString() ) );
         }
      } catch ( IOException ioE ) {
         LOGGER.error( ioE.getMessage() );
      }
      executor.shutdown();
   }

}


You may hit a maximum efficiency because i/o will cause disk thrashing, but 10 threads might be an ok start.

Sean



-----Original Message-----
From: Tomasz Oliwa [mailto:oliwa@uchicago.edu]
Sent: Wednesday, September 14, 2016 2:08 PM
To: dev@ctakes.apache.org
Subject: deserialize and process XCAS files

Hi,

I have working code to deserialize XCAS files and read-only process them further, it is based on CASConsumerTestDriver.java, an example is :

        // inputs to the CAS file and the AE from cTAKES, templates here
        String xCasLocation = <location-of-CAS-file>;
        String taeDescriptionLocation = <location-of-AggregatePlaintextFastUMLSProcessor.xml>;

        // initialize the ae
        InputStream xCasStream = new FileInputStream(xCasLocation);
        AnalysisEngineDescription taeDescription = UIMAFramework.getXMLParser().parseAnalysisEngineDescription(
                new XMLInputSource(new File(taeDescriptionLocation)));
        AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(taeDescription);

        // read CAS
        CAS cas = ae.newCAS();
        XCASDeserializer.deserialize(xCasStream, cas);

        // print out the Sofa
        System.out.println(cas.getSofaDataString());

        // create jCAS and print out some UmlsConcepts
        JFSIndexRepository indexes = cas.getJCas().getJFSIndexRepository();
        Iterator iterator = indexes.getAnnotationIndex(SignSymptomMention.type).iterator();
        while (iterator.hasNext()) {
            SignSymptomMention annot = (SignSymptomMention) iterator.next();
            System.out.println(annot.getCoveredText());
            // further read the annotation
            FSArray ocArr = annot.getOntologyConceptArr();
            // ...
        }

The code above runs fine, but runs sequentially. I have a lot of CAS files and would like to process them in parallel (for instance to extract some values and store them in another DB).

My question:

Can I give a reference to the above created AnalysisEngine ae to code that is run in parallel (java.util.concurrent.Callable or parallel Java 8 Streams, it does not matter), provided that I only use read operations (such as annot.getCoveredText() or some other calls to get the CUI) and no two Threads would work on the same CAS ?

I read in https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_CTAKES_cTAKES-2B3.0-2BComponent-2BUse-2BGuide&d=CwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=3P4-X5imbI3Is8nQ9rDHmji1kGfqDchM0pY6dx99vkM&s=pNyRIsM5qINCKWtRE_HJ6xRHfBP1GfA-lxZn63hYsIg&e=  that "cTAKES is not designed to be thread safe", but here I would be doing read-only operations to extract concepts and CUIs from JCas objects. No new annotations would be created, no annotators called.

If this is not recommended, what would be the best course of action to deserialize and read-only process these CAS files?

Thanks for any help, I would really appreciate it Tomasz

RE: deserialize and process XCAS files

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.

Hi Tomasz,

As far as I know, the ae is only instantiated to create the type system, which is then handed off.  So, you shouldn't need to worry about thread safety problems with the ae.  I have an idea for deserializing concurrently, but you'll notice that I did things slightly differently ...  this is only a sample and I haven't actually done anything with it:




public class ParallelXcasDeserializer {

   static private final Logger LOGGER = Logger.getLogger( "ParallelXcasDeserializer" );
   static private final int PROCESS_COUNT = 10;
   static private final CasPool CAS_POOL = createCasPool( PROCESS_COUNT );

   static private CasPool createCasPool( final int processCount ) {
      try {
         final AnalysisEngineDescription fastPipelineDesc = ClinicalPipelineFactory.getFastPipeline();
         final AnalysisEngine fastPipeline = AnalysisEngineFactory.createEngine( fastPipelineDesc );
         return new CasPool( processCount, fastPipeline );
      } catch ( UIMAException | MalformedURLException multE ) {
         LOGGER.error( multE.getMessage() );
      }
      return null;
   }

   static private final class XCasDeserializer implements Runnable {
      private final String _xcasFilePath;
      private XCasDeserializer( final String xcasFilePath ) {
         _xcasFilePath = xcasFilePath;
      }
      @Override
      public void run() {
         final CAS cas = CAS_POOL.getCas();
         try ( InputStream xCasStream = new FileInputStream( _xcasFilePath ) ) {
            XCASDeserializer.deserialize( xCasStream, cas );
            final JCas jCas = cas.getJCas();
            OntologyConceptUtil.getCuis( jCas ).forEach( LOGGER::info );
         } catch ( IOException | SAXException | CASException multE ) {
            LOGGER.error( multE.getMessage() );
         }
         CAS_POOL.releaseCas( cas );
      }
   }

   public static void main( final String... args ) {
      if ( CAS_POOL == null ) {
         LOGGER.error( "Couldn't create CAS Pool, exiting" );
         System.exit( 1 );
      }
      final Path inputDir = Paths.get( args[ 0 ] );
      final ExecutorService executor = Executors.newFixedThreadPool( PROCESS_COUNT );
      try ( DirectoryStream<Path> filePaths = Files.newDirectoryStream( inputDir ) ) {
         for ( Path filePath : filePaths ) {
            executor.execute( new XCasDeserializer( filePath.toString() ) );
         }
      } catch ( IOException ioE ) {
         LOGGER.error( ioE.getMessage() );
      }
      executor.shutdown();
   }

}


You may hit a maximum efficiency because i/o will cause disk thrashing, but 10 threads might be an ok start.

Sean



-----Original Message-----
From: Tomasz Oliwa [mailto:oliwa@uchicago.edu] 
Sent: Wednesday, September 14, 2016 2:08 PM
To: dev@ctakes.apache.org
Subject: deserialize and process XCAS files

Hi,

I have working code to deserialize XCAS files and read-only process them further, it is based on CASConsumerTestDriver.java, an example is :

        // inputs to the CAS file and the AE from cTAKES, templates here
        String xCasLocation = <location-of-CAS-file>;
        String taeDescriptionLocation = <location-of-AggregatePlaintextFastUMLSProcessor.xml>;

        // initialize the ae
        InputStream xCasStream = new FileInputStream(xCasLocation);
        AnalysisEngineDescription taeDescription = UIMAFramework.getXMLParser().parseAnalysisEngineDescription(
                new XMLInputSource(new File(taeDescriptionLocation)));
        AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(taeDescription);
        
        // read CAS
        CAS cas = ae.newCAS();
        XCASDeserializer.deserialize(xCasStream, cas);
        
        // print out the Sofa
        System.out.println(cas.getSofaDataString());
        
        // create jCAS and print out some UmlsConcepts
        JFSIndexRepository indexes = cas.getJCas().getJFSIndexRepository();
        Iterator iterator = indexes.getAnnotationIndex(SignSymptomMention.type).iterator();
        while (iterator.hasNext()) {
            SignSymptomMention annot = (SignSymptomMention) iterator.next();
            System.out.println(annot.getCoveredText());
            // further read the annotation
            FSArray ocArr = annot.getOntologyConceptArr();
            // ...
        }

The code above runs fine, but runs sequentially. I have a lot of CAS files and would like to process them in parallel (for instance to extract some values and store them in another DB).

My question:

Can I give a reference to the above created AnalysisEngine ae to code that is run in parallel (java.util.concurrent.Callable or parallel Java 8 Streams, it does not matter), provided that I only use read operations (such as annot.getCoveredText() or some other calls to get the CUI) and no two Threads would work on the same CAS ? 

I read in https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_CTAKES_cTAKES-2B3.0-2BComponent-2BUse-2BGuide&d=CwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=3P4-X5imbI3Is8nQ9rDHmji1kGfqDchM0pY6dx99vkM&s=pNyRIsM5qINCKWtRE_HJ6xRHfBP1GfA-lxZn63hYsIg&e=  that "cTAKES is not designed to be thread safe", but here I would be doing read-only operations to extract concepts and CUIs from JCas objects. No new annotations would be created, no annotators called.

If this is not recommended, what would be the best course of action to deserialize and read-only process these CAS files?

Thanks for any help, I would really appreciate it Tomasz