You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ctakes.apache.org by "Petersam, John Contractor" <Jo...@ssa.gov> on 2019/09/25 00:18:26 UTC

RE: [EXTERNAL] Large files taking forever to process

Hi Greg,
We regularly process documents that are over 5000 pages (not lines).  What we've found is that many of the annotators within the standard distribution operate at o(n^2).  The standard dependency parser is one example among many.  

The good news is that you can achieve linear results if you convert these classes to use TreeMaps.  We actually build the tree maps one time and cache them in ThreadLocal variables which allows us to process multiple threads simultaneously.

Hope this helps,
John

-----Original Message-----
From: Greg Silverman <gm...@umn.edu> 
Sent: Tuesday, September 24, 2019 6:47 PM
To: dev@ctakes.apache.org
Subject: [EXTERNAL] Large files taking forever to process

Any suggestions on how to speed up processing large clinical text notes approaching 13K lines? This is a very old corpus culled from EPIC notes back in 2009. I thought about splitting the notes into smaller chunks, but then I would have to deal with the offsets when analyzing system output against manual annotations that had been done.

As is, I've tried different garbage collection options (this seemed to have worked well with CLAMP on the same set of notes).

TIA!

Greg--

--
Greg M. Silverman
Senior Systems Developer
NLP/IE <https://healthinformatics.umn.edu/research/nlpie-group>
Department of Surgery
University of Minnesota
gms@umn.edu

 ›  evaluate-it.org  ‹

Re: [EXTERNAL] Large files taking forever to process

Posted by Greg Silverman <gm...@umn.edu>.

Sean's fix did the magic.

Thanks for the suggestion though. I'm wondering how this would work with
our custom implementation of MetaMap with UIMA-AS (it is SLOW as molasses)

Best!

Greg--

On Tue, Sep 24, 2019 at 7:18 PM Petersam, John Contractor <
John.Petersam@ssa.gov> wrote:

> Hi Greg,
> We regularly process documents that are over 5000 pages (not lines).  What
> we've found is that many of the annotators within the standard distribution
> operate at o(n^2).  The standard dependency parser is one example among
> many.
>
> The good news is that you can achieve linear results if you convert these
> classes to use TreeMaps.  We actually build the tree maps one time and
> cache them in ThreadLocal variables which allows us to process multiple
> threads simultaneously.
>
> Hope this helps,
> John
>
> -----Original Message-----
> From: Greg Silverman <gm...@umn.edu>
> Sent: Tuesday, September 24, 2019 6:47 PM
> To: dev@ctakes.apache.org
> Subject: [EXTERNAL] Large files taking forever to process
>
> Any suggestions on how to speed up processing large clinical text notes
> approaching 13K lines? This is a very old corpus culled from EPIC notes
> back in 2009. I thought about splitting the notes into smaller chunks, but
> then I would have to deal with the offsets when analyzing system output
> against manual annotations that had been done.
>
> As is, I've tried different garbage collection options (this seemed to
> have worked well with CLAMP on the same set of notes).
>
> TIA!
>
> Greg--
>
> --
> Greg M. Silverman
> Senior Systems Developer
> NLP/IE <https://healthinformatics.umn.edu/research/nlpie-group>
> Department of Surgery
> University of Minnesota
> gms@umn.edu
>
>  ›  evaluate-it.org  ‹
>


-- 
Greg M. Silverman
Senior Systems Developer
NLP/IE <https://healthinformatics.umn.edu/research/nlpie-group>
Department of Surgery
University of Minnesota
gms@umn.edu

 ›  evaluate-it.org  ‹