You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ctakes.apache.org by Prasanna Bala <ba...@gmail.com> on 2014/07/01 07:54:01 UTC

cTakes Scalability Problem

Hi,

I have certain clarifications. This is regarding using third party
libraries with cTakes. I have clarifications on run time for processing
documents using cTakes. We are able to run the cTakes through batch mode.
But we have plans to run documents for 1 million clinical documents. Can
anyone tell me if they have tackled scalability using cTakes ? I have an
idea to distribute the process using Hadoop. There are various libraries
available that can use UIMA and distribute the process using Hadoop. Since
cTakes is also developed using UIMA, I think there should be a way to
distribute process. Have anyone tried this ? Are there any limitations in
distributing problems using cTakes ? Your thoughts please ?

Regards,
Prasanna

Re: cTakes Scalability Problem

Posted by Jonathan Bates <jo...@gmail.com>.
Disclaimer: I'm not a developer, just a user.   To use DBConsumer, I had to
change "int" to "bigint" for the anno_base_id in the SQL tables to prevent
overflow in the annotation index.  Speed did increase marginally after the
change, but I don't understand how the change in datatype could have been
the cause...  Let us know how it works out!
-Jon


On Tue, Jul 1, 2014 at 11:50 AM, Prasanna Bala <ba...@gmail.com>
wrote:

> Hi,
>
> Thanks for your suggestions. So I have to change the "int" to "bigint" to
> improve the performance.
>
> I am looking at UIMA DUCC.
> http://uima.apache.org/doc-uimaducc-whatitam.html
>
> The problem with Hadoop is it runs in batch process. So it cannot be used
> for low latency real systems. But still I want to explore it.
>
>
> On Tue, Jul 1, 2014 at 6:20 PM, Jonathan Bates <jo...@gmail.com>
> wrote:
>
>> Hi Prasanna,
>> I am currently using 3.1.2 to process ~40M notes using 14 CPEs with
>> AggregatePlaintextUMLSProcessor+DBConsumer.  So far, ~34M notes have been
>> annotated and stored.  Altogether, I'm seeing 0.054sec/note.  This is with
>> 4.1k rows in v_snomed_fword_lookup.  One modification we had to make was to
>> change anno_base_id datatype from 'int' to 'bigint'.  It would be very
>> interesting to see Hadoop used with ctakes...
>> -Jon
>>
>>
>> On Tue, Jul 1, 2014 at 1:54 AM, Prasanna Bala <
>> balkiprasanna1984@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have certain clarifications. This is regarding using third party
>>> libraries with cTakes. I have clarifications on run time for processing
>>> documents using cTakes. We are able to run the cTakes through batch mode.
>>> But we have plans to run documents for 1 million clinical documents. Can
>>> anyone tell me if they have tackled scalability using cTakes ? I have an
>>> idea to distribute the process using Hadoop. There are various libraries
>>> available that can use UIMA and distribute the process using Hadoop. Since
>>> cTakes is also developed using UIMA, I think there should be a way to
>>> distribute process. Have anyone tried this ? Are there any limitations in
>>> distributing problems using cTakes ? Your thoughts please ?
>>>
>>> Regards,
>>> Prasanna
>>>
>>
>>
>

Re: cTakes Scalability Problem

Posted by Prasanna Bala <ba...@gmail.com>.
Hi,

Thanks for your suggestions. So I have to change the "int" to "bigint" to
improve the performance.

I am looking at UIMA DUCC.
http://uima.apache.org/doc-uimaducc-whatitam.html

The problem with Hadoop is it runs in batch process. So it cannot be used
for low latency real systems. But still I want to explore it.


On Tue, Jul 1, 2014 at 6:20 PM, Jonathan Bates <jo...@gmail.com> wrote:

> Hi Prasanna,
> I am currently using 3.1.2 to process ~40M notes using 14 CPEs with
> AggregatePlaintextUMLSProcessor+DBConsumer.  So far, ~34M notes have been
> annotated and stored.  Altogether, I'm seeing 0.054sec/note.  This is with
> 4.1k rows in v_snomed_fword_lookup.  One modification we had to make was to
> change anno_base_id datatype from 'int' to 'bigint'.  It would be very
> interesting to see Hadoop used with ctakes...
> -Jon
>
>
> On Tue, Jul 1, 2014 at 1:54 AM, Prasanna Bala <balkiprasanna1984@gmail.com
> > wrote:
>
>> Hi,
>>
>> I have certain clarifications. This is regarding using third party
>> libraries with cTakes. I have clarifications on run time for processing
>> documents using cTakes. We are able to run the cTakes through batch mode.
>> But we have plans to run documents for 1 million clinical documents. Can
>> anyone tell me if they have tackled scalability using cTakes ? I have an
>> idea to distribute the process using Hadoop. There are various libraries
>> available that can use UIMA and distribute the process using Hadoop. Since
>> cTakes is also developed using UIMA, I think there should be a way to
>> distribute process. Have anyone tried this ? Are there any limitations in
>> distributing problems using cTakes ? Your thoughts please ?
>>
>> Regards,
>> Prasanna
>>
>
>

Re: cTakes Scalability Problem

Posted by Jonathan Bates <jo...@gmail.com>.
Hi Prasanna,
I am currently using 3.1.2 to process ~40M notes using 14 CPEs with
AggregatePlaintextUMLSProcessor+DBConsumer.  So far, ~34M notes have been
annotated and stored.  Altogether, I'm seeing 0.054sec/note.  This is with
4.1k rows in v_snomed_fword_lookup.  One modification we had to make was to
change anno_base_id datatype from 'int' to 'bigint'.  It would be very
interesting to see Hadoop used with ctakes...
-Jon


On Tue, Jul 1, 2014 at 1:54 AM, Prasanna Bala <ba...@gmail.com>
wrote:

> Hi,
>
> I have certain clarifications. This is regarding using third party
> libraries with cTakes. I have clarifications on run time for processing
> documents using cTakes. We are able to run the cTakes through batch mode.
> But we have plans to run documents for 1 million clinical documents. Can
> anyone tell me if they have tackled scalability using cTakes ? I have an
> idea to distribute the process using Hadoop. There are various libraries
> available that can use UIMA and distribute the process using Hadoop. Since
> cTakes is also developed using UIMA, I think there should be a way to
> distribute process. Have anyone tried this ? Are there any limitations in
> distributing problems using cTakes ? Your thoughts please ?
>
> Regards,
> Prasanna
>