You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Adam Lally <al...@alum.rpi.edu> on 2007/07/02 18:15:10 UTC

Re: Performance bug in XmiCasSerializer?

Greg,

It doesn't look to me like you're doing anything wrong.

I did a quick test to try to reproduce this but wasn't able to... I
may need more information about your set up.

I created a CPE with the FileSystemCollectionReader,
PersonTitleAnnotator, and your XmiCasAnnotator.  (I filled in the part
about generating an identifier with something that checks the
SourceDocumentInformation annotations put there by the
FileSystemCollectionReader.)

On a particular set of documents, with the CPE desriptor's
processingUnitThreadCount set to 1 I get a total elapsed time of 9.25
seconds, whereas with the processingUnitThreadCount set to 10 I get a
total elapsed time of 6.875 seconds.  (This is on a dual-core
machine.)

A few questions come to mind:  Are you using a CPE to do the
multithreading or something else?  If something else, do you see the
same behavior if you try using a CPE instead?  Does this only happen
with large documents, and/or does it only happen when you have a lot
of annotations in the CAS (I have very few in my test).

Regards,
  -Adam



On 6/29/07, greg@holmberg.name <ho...@comcast.net> wrote:
> I've run into a severe slowdown when using the XmiCasSerializer in a CasAnnotator with multiple concurrent AnalysisEngines.  I'm wondering if I'm doing something wrong or if there's a bug.  Code for this XMI CasAnnotator is appended.
>
> I run three scenarios on the same set of documents and same set of CAS-updating annotators.
> A. 1 thread/AnalysisEngine with the Xmi CasAnnotator.
> B. 10 threads/AnalysisEngines without the Xmi CasAnnotator (in fact, no saving of any CAS data to disk at all, in any form).
> C. 10 threads/AnalysisEngines with the Xmi CasAnnotator.
>
> On Windows XP I use the excellent ProcExplorer tool from SysInternals.com to measure CPU seconds spent in "user" (i.e. process) space and in kernel space.  I also measure elapsed time.  Here's what I see for these scenarios:
>
> Scenario   User    Kernel   Elapsed
>    A            103          5        588  (a lot of time spent blocking on proprietary remote network services).
>    B            84           4         135
>    C            237       139       295
>
> So, in A, with just one thread and XMI output, we spend very little time in the kernel.
> In B, with 10 threads and no XMI output, we also spend very little time in the kernel.
> C is B+Xmi, and so should be only slightly more than B.  Instead kernel time increases 35X, user time increases 3X, and elapsed time increases 2X.
>
> So it seems like using the XmiCasSerializer with concurrent AnalysisEngines creates some sort of thread contention. Either that, or I'm using it incorrectly.
>
> Is this a bug?
>
>
> Greg Holmberg
>
>
> public class XmiOutputAnnotator extends CasAnnotator_ImplBase {
>
>         public static final String PARAM_OUTPUT_DIRECTORY = "outputDirectory";
>
>         private String outputDirectory;
>
>         private XmiCasSerializer serializer;
>
>         @Override
>         public void initialize(UimaContext context) throws ResourceInitializationException {
>                 super.initialize(context);
>         outputDirectory = (String)context.getConfigParameterValue(PARAM_OUTPUT_DIRECTORY);
>     }
>
>         public void typeSystemInit(TypeSystem aTypeSystem)
>         throws AnalysisEngineProcessException
>     {
>         serializer = new XmiCasSerializer(aTypeSystem);
>     }
>
>         public void process(CAS cas) throws AnalysisEngineProcessException {
>                 JCas base = null;
>                 try {
>                         base = cas.getJCas();
>                 }
>                 catch (CASException ce) {
>                         throw new AnalysisEngineProcessException(ce);
>                 }
>         OutputStream outputStream = null;
>
>         try {
>                         String identifier = ...
>
>             File inputFile = new File(new URI(identifier));
>             File outputFile = new File(outputDirectory, inputFile.getAbsolutePath().replace(":", "") + ".xmi");
>             outputFile.getParentFile().mkdirs();
>             outputStream = new FileOutputStream(outputFile);
>             serializer.serialize(cas, new XMLSerializer(outputStream, true).getContentHandler());
>         } catch (Exception e) {
>                 throw new MyAnnotatorException(getClass().getSimpleName(), e);
>             } finally {
>                 if (outputStream != null) {
>                         try {
>                             outputStream.close();
>                         } catch (IOException e) {
>                                 // Ignore?
>                         }
>                 }
>             }
>     }
>
> }
>
>