You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ctakes.apache.org by "Petersam, John Contractor" <Jo...@ssa.gov> on 2019/08/05 14:35:47 UTC
RE: [EXTERNAL] Re: Processing Extraordinarily Long Documents

Hi Mike,
Many of the internal cTAKES annotators rely on loops within loops to process documents.  As the document grows, both loops grow, leading to exponential time growth.

A prime example of this is the dependency parser.  We have been able to achieve linear time growth by modifying the underlying code to use a tree maps in certain instances and then caching those maps in ThreadLocal variables to take advantage of our multi-threaded platform.

Hope this helps,
John

From: Debdipto Misra <de...@gmail.com>
Sent: Sunday, August 04, 2019 9:57 PM
To: user@ctakes.apache.org
Cc: Price, Ronald <rp...@luc.edu>; Nathan Salmon <na...@metistream.com>
Subject: [EXTERNAL] Re: Processing Extraordinarily Long Documents

Hi Mike,
Hopefully you have found a solution to the problem.
However, we ran into the same issue while processing clinical notes on Spark. We are seeing long task times for a few executors.
Our setup is 400 executors with 8G of executor memory using G1GC  with as many partitions of the data.
Since we were interested in the sentence boundary, divvying the notes into chunks was not an option.

We achieved performance improvement by changing how using efficient in-memory lookup.
Also one of the issues you might run into is a lot of unused objects which takes a toll on your GC .
So you might want to try reusing some of the ctakes object by implementing singleton classes.

Hope this has been helpful.
Thanks,
Deb

On Thu, Feb 28, 2019 at 4:23 PM Michael Trepanier <mi...@metistream.com>> wrote:
Hi Ron,

Hugely appreciate the response. Do you know what the max document size you fed through your pipeline was? Below is a line-histogram of our note length vs. processing time (ns). At the lower end, we're seeing a similar drop-off after around 20,000 chars, with a more or less exponential growth in runtime from there on out.

[image.png]
Our current setup is leveraging 256 Spark Executors (essentially JVMs), each with 7G of RAM and 1 core, and then feeding partitions with ~ 20,000 notes each into these. With this config, we burned through 99% of the notes in less than a day, but ended up spinning on the partitions which contained the larger notes for nearly a week afterwards. For your implementation, could you please provide what the specs were and how long it took to process the 84M docs?

Regards,

Mike

On Thu, Feb 28, 2019 at 10:11 AM Price, Ronald <rp...@luc.edu>> wrote:
Mike,
We’ve fully processed 84M documents through CTAKES on 3 separate occasions.  We constructed a pipeline that has 30 separately controlled sub-queues.  We have the ability to target processing of documents to specific queues.  We allocate and target 5-10 queues for processing of large documents.  Similar to you, we have a small percentage (3%-4%) of documents that are over 15K in size.  The bulk of our documents are less then 3K.  In our environment and through some detailed performance analysis, we determined that the performance breakpoint occurs once documents get above 12K-13K.  We also target processing as many as 10 annotators in a single pass of the corpus.  This approach has worked well for us.

Thanks,
Ron




From: Michael Trepanier <mi...@metistream.com>>
Date: Thursday, February 28, 2019 at 11:57 AM
To: "user@ctakes.apache.org<ma...@ctakes.apache.org>" <us...@ctakes.apache.org>>
Cc: "Price, Ronald" <rp...@luc.edu>>
Subject: Re: Processing Extraordinarily Long Documents

Hi Dima,

Thanks for the feedback! As our pipeline develops, we'll be building in additional functionality (eg. Temporal Relations) that require context greater than that of a sentence. Given this, partitioning on document length and shunting those to another queue is an excellent solution.

Thanks,

Mike

On Thu, Feb 28, 2019 at 4:08 AM Dligach, Dmitriy <dd...@luc.edu>> wrote:
Hi Mike,

We also observed this issue. Splitting large documents into smaller ones is an option, but you have to make sure you preserve the integrity of individual sentences or you might loose some concept mentions. Since you are using cTAKES only for ontology mapping, I don’t think you need to worry about the integrity of linguistic units larger than a sentence.

FWIF, our solution to this problem was to create a separate queue for large documents and process them independently from the smaller documents.

Best,

Dima



On Feb 27, 2019, at 16:59, Michael Trepanier <mi...@metistream.com>> wrote:

Hi,

We currently have a pipeline which is generating ontology mappings for a repository of clinical notes. However, this repository contains documents which, after RTF parsing, can contain over 900,000 characters (albeit this is a very small percentage of notes, out of ~13 million, around 50 contain more than 100k chars). Looking at some averages across the dataset, it is clear that the processing time is exponentially related to the note length:

0-10000 chars: 0.9 seconds (11 million notes)
10000-20000 chars: 5.625 seconds (1.5 million notes)
210000-220000 chars: 4422 seconds/1.22 hours (3 notes)
900000-1000000 chars: 103237 seconds/28.6 hours (1 note)

Given these results, splitting the longer docs into partitions would speed up the pipeline considerably. However, our team has some concerns over how that might impact the context aware steps of the cTAKES pipeline. How would the results from splitting a doc on its sentences or paragraphs compare to feeding in an entire doc? Does the default pipeline API support a way to use segments instead of the entire document text?

Regards,

Mike




--
Error! Filename not specified.
Mike Trepanier| Senior Big Data Engineer | MetiStream, Inc. |  mike@metistream.com<ma...@metistream.com> | 845 - 270 - 3129 (m) | www.metistream.com<https://protect2.fireeye.com/url?k=444672c6-18d02363-44465bb1-0cc47adc5e34-399c8f42457818f6&q=1&u=http%3A%2F%2Fwww.metistream.com%2F>


--
Mike Trepanier| Senior Big Data Engineer | MetiStream, Inc. |  mike@metistream.com<ma...@metistream.com> | 845 - 270 - 3129 (m) | www.metistream.com<https://protect2.fireeye.com/url?k=4e819998-1217c83d-4e81b0ef-0cc47adc5e34-10623e423c931c62&q=1&u=http%3A%2F%2Fwww.metistream.com%2F>


--
Thanks and regards
Deb