You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Adam Lally <al...@alum.rpi.edu> on 2007/07/02 18:15:10 UTC
Re: Performance bug in XmiCasSerializer?
Greg,
It doesn't look to me like you're doing anything wrong.
I did a quick test to try to reproduce this but wasn't able to... I
may need more information about your set up.
I created a CPE with the FileSystemCollectionReader,
PersonTitleAnnotator, and your XmiCasAnnotator. (I filled in the part
about generating an identifier with something that checks the
SourceDocumentInformation annotations put there by the
FileSystemCollectionReader.)
On a particular set of documents, with the CPE desriptor's
processingUnitThreadCount set to 1 I get a total elapsed time of 9.25
seconds, whereas with the processingUnitThreadCount set to 10 I get a
total elapsed time of 6.875 seconds. (This is on a dual-core
machine.)
A few questions come to mind: Are you using a CPE to do the
multithreading or something else? If something else, do you see the
same behavior if you try using a CPE instead? Does this only happen
with large documents, and/or does it only happen when you have a lot
of annotations in the CAS (I have very few in my test).
Regards,
-Adam
On 6/29/07, greg@holmberg.name <ho...@comcast.net> wrote:
> I've run into a severe slowdown when using the XmiCasSerializer in a CasAnnotator with multiple concurrent AnalysisEngines. I'm wondering if I'm doing something wrong or if there's a bug. Code for this XMI CasAnnotator is appended.
>
> I run three scenarios on the same set of documents and same set of CAS-updating annotators.
> A. 1 thread/AnalysisEngine with the Xmi CasAnnotator.
> B. 10 threads/AnalysisEngines without the Xmi CasAnnotator (in fact, no saving of any CAS data to disk at all, in any form).
> C. 10 threads/AnalysisEngines with the Xmi CasAnnotator.
>
> On Windows XP I use the excellent ProcExplorer tool from SysInternals.com to measure CPU seconds spent in "user" (i.e. process) space and in kernel space. I also measure elapsed time. Here's what I see for these scenarios:
>
> Scenario User Kernel Elapsed
> A 103 5 588 (a lot of time spent blocking on proprietary remote network services).
> B 84 4 135
> C 237 139 295
>
> So, in A, with just one thread and XMI output, we spend very little time in the kernel.
> In B, with 10 threads and no XMI output, we also spend very little time in the kernel.
> C is B+Xmi, and so should be only slightly more than B. Instead kernel time increases 35X, user time increases 3X, and elapsed time increases 2X.
>
> So it seems like using the XmiCasSerializer with concurrent AnalysisEngines creates some sort of thread contention. Either that, or I'm using it incorrectly.
>
> Is this a bug?
>
>
> Greg Holmberg
>
>
> public class XmiOutputAnnotator extends CasAnnotator_ImplBase {
>
> public static final String PARAM_OUTPUT_DIRECTORY = "outputDirectory";
>
> private String outputDirectory;
>
> private XmiCasSerializer serializer;
>
> @Override
> public void initialize(UimaContext context) throws ResourceInitializationException {
> super.initialize(context);
> outputDirectory = (String)context.getConfigParameterValue(PARAM_OUTPUT_DIRECTORY);
> }
>
> public void typeSystemInit(TypeSystem aTypeSystem)
> throws AnalysisEngineProcessException
> {
> serializer = new XmiCasSerializer(aTypeSystem);
> }
>
> public void process(CAS cas) throws AnalysisEngineProcessException {
> JCas base = null;
> try {
> base = cas.getJCas();
> }
> catch (CASException ce) {
> throw new AnalysisEngineProcessException(ce);
> }
> OutputStream outputStream = null;
>
> try {
> String identifier = ...
>
> File inputFile = new File(new URI(identifier));
> File outputFile = new File(outputDirectory, inputFile.getAbsolutePath().replace(":", "") + ".xmi");
> outputFile.getParentFile().mkdirs();
> outputStream = new FileOutputStream(outputFile);
> serializer.serialize(cas, new XMLSerializer(outputStream, true).getContentHandler());
> } catch (Exception e) {
> throw new MyAnnotatorException(getClass().getSimpleName(), e);
> } finally {
> if (outputStream != null) {
> try {
> outputStream.close();
> } catch (IOException e) {
> // Ignore?
> }
> }
> }
> }
>
> }
>
>