You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ctakes.apache.org by "Finan, Sean" <Se...@childrens.harvard.edu> on 2018/03/05 15:35:58 UTC

RE: Output formats - CPE - cTAKES - Persist in database [EXTERNAL]

Hi Manuel,

The default clinical pipeline runs a piper file located in ctakes-core-res [1].  If you are running using a ctakes binary build, which is how it looks, you can find the file in:
Resources/org/apache/ctakes/core/pipeline/DefaultFastPipeline.piper	

You can edit this file and add a different writer at the end / bottom.  There are a lot of file writers available, more than I have time to fully describe, but below is a partial list.  

pretty.html.HtmlTextWriter
pretty.plaintext.PrettyTextWriterFit
property.plaintext.PropertyTextWriterFit
CuiCountFileWriter
CuiListFileWriter
CuiLookupLister
HtmlTableCasConsumer
SentenceTokensPrinter
TextSpanWriter
TokenFreqCasConsumer
TokenOffsetsCasConsumer

As you have seen, xmi output contains everything under the sun.  The first three writers in the list create output with information that is most commonly desired (cuis, negation, uncertainty, etc.).  The rest are more focused in their output.  You can add the whole list to the end of the piper file mentioned above, prefixing each with the "add " command, or just add them individually.  Then make sure that you specify "-o <outputDirectory>" in your command line.  Some of the older writers may not accept -o as a valid parameter value specifier, in which case you may need to do something different.  Ending with "CasConsumer" is a good giveaway that the writer is one of the older types.

There is a JdbcWriterTemplate that was built to write to a database, but it requires a fair amount of configuration.

Sean

[1]  https://cwiki.apache.org/confluence/display/CTAKES/Piper+Files



-----Original Message-----
From: Manuel Lamy [mailto:mmvpdml@gmail.com] 
Sent: Sunday, March 04, 2018 10:29 PM
To: dev@ctakes.apache.org
Subject: Output formats - CPE - cTAKES - Persist in database [EXTERNAL]

Hello everyone,

I'm using cTAKES clinical pipeline in order to process a lot of documents in a row.

I'm using this command in the command line:  runClinicalPipeline.bat  -i input --xmiOut output  --user username  --pass password

This works, adapted to my credentials and my paths of course. My problem is that I can only output in XMI format.

My questions are the following:

-Is it possible to output a different kind of format than XMI? If yes, what should I change in this command and what are the available formats?

-It is of my interest to persist the structured clinical information extracted by cTAKES directly in a database. Is there a format that is more suitable to that task? At the moment, I can only output in XMI format. I built a parser in Perl with a lot of regex in order to process all the information in the XMI file and persist in a database. However, the XMI file has a complex structure and the script, despite of working well, is taking more time than it should to run and persist.

If someone could give me some advice about what my possibilities are, I would be appreciated.

Best regards,

Manuel

Re: Output formats - CPE - cTAKES - Persist in database [EXTERNAL]

Posted by Manuel Lamy <mm...@gmail.com>.

Hello Sean,

Thanks for the quick response as always. I've tried several of those
writers and any of them gives me what I pretend in order to conduct my
research successfully.

What I'm aiming for is an output that is easily processed (the opposite of
the XMI obtained), in order to persist in a database after at ease.

What I want to persist in a database is only the diseases, medications,
anatomical regions, clinical procedures and signs/symptoms, associated with
each clinical record passed to cTAKES. So clearly I just want the most
standard findings made by cTAKES, nothing from the other world.

Now I have three option that I can think of in order to accomplish the
objective:


   1. Try to mesh and work with the JdbcWriterTemplate. This would fit my
   needs, by the name of it. But for what I've already seen, people usually
   have problems putting this to work properly, since the configuration is not
   straighforward. So I guess this would be a rough path to take, what you
   think? Read my other two options and maybe you'll understand my doubts.
   2. The second option would be to have an output that is so
   straighforward, that I could build a script and regex the sake of it, in
   order to obtain the clinical entities that I want (enunciated above). I'm
   thinking about a txt file that would just have something like: "Diseases ->
   diseases a, disease b   \n   Medications -> medication a, medication b,
   etc" This way I could just run a script and grab all the clinical entities.
   The processing performance would be much better than the XMI since it would
   have just some lines with what I want. From the formats that I tried and
   worked, none of them seems easily processable.
   3. This one would be rough probably, but maybe "write my own writer",
   that would perform like described in point 2.


So Sean, I'm again at doubt about which path to take. I have thousands of
records coming at me soon and I'll have to make decisions. I hope that, as
always, you can help me taking the most efficient path to do the job.

If I'm overestimating the difficulty of putting JdbcWriterTemplate to work,
please tell me. I already have the Dev version of cTAKES for several months
now so I'm already kinda conversant with the system already.

Thanks again!

Best regards,

Manuel


2018-03-05 15:35 GMT+00:00 Finan, Sean <Se...@childrens.harvard.edu>:

> Hi Manuel,
>
> The default clinical pipeline runs a piper file located in ctakes-core-res
> [1].  If you are running using a ctakes binary build, which is how it
> looks, you can find the file in:
> Resources/org/apache/ctakes/core/pipeline/DefaultFastPipeline.piper
>
> You can edit this file and add a different writer at the end / bottom.
> There are a lot of file writers available, more than I have time to fully
> describe, but below is a partial list.
>
> pretty.html.HtmlTextWriter
> pretty.plaintext.PrettyTextWriterFit
> property.plaintext.PropertyTextWriterFit
> CuiCountFileWriter
> CuiListFileWriter
> CuiLookupLister
> HtmlTableCasConsumer
> SentenceTokensPrinter
> TextSpanWriter
> TokenFreqCasConsumer
> TokenOffsetsCasConsumer
>
> As you have seen, xmi output contains everything under the sun.  The first
> three writers in the list create output with information that is most
> commonly desired (cuis, negation, uncertainty, etc.).  The rest are more
> focused in their output.  You can add the whole list to the end of the
> piper file mentioned above, prefixing each with the "add " command, or just
> add them individually.  Then make sure that you specify "-o
> <outputDirectory>" in your command line.  Some of the older writers may not
> accept -o as a valid parameter value specifier, in which case you may need
> to do something different.  Ending with "CasConsumer" is a good giveaway
> that the writer is one of the older types.
>
> There is a JdbcWriterTemplate that was built to write to a database, but
> it requires a fair amount of configuration.
>
> Sean
>
> [1]  https://cwiki.apache.org/confluence/display/CTAKES/Piper+Files
>
>
>
> -----Original Message-----
> From: Manuel Lamy [mailto:mmvpdml@gmail.com]
> Sent: Sunday, March 04, 2018 10:29 PM
> To: dev@ctakes.apache.org
> Subject: Output formats - CPE - cTAKES - Persist in database [EXTERNAL]
>
> Hello everyone,
>
> I'm using cTAKES clinical pipeline in order to process a lot of documents
> in a row.
>
> I'm using this command in the command line:  runClinicalPipeline.bat  -i
> input --xmiOut output  --user username  --pass password
>
> This works, adapted to my credentials and my paths of course. My problem
> is that I can only output in XMI format.
>
> My questions are the following:
>
> -Is it possible to output a different kind of format than XMI? If yes,
> what should I change in this command and what are the available formats?
>
> -It is of my interest to persist the structured clinical information
> extracted by cTAKES directly in a database. Is there a format that is more
> suitable to that task? At the moment, I can only output in XMI format. I
> built a parser in Perl with a lot of regex in order to process all the
> information in the XMI file and persist in a database. However, the XMI
> file has a complex structure and the script, despite of working well, is
> taking more time than it should to run and persist.
>
> If someone could give me some advice about what my possibilities are, I
> would be appreciated.
>
> Best regards,
>
> Manuel
>