You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ctakes.apache.org by Yonghui Wu <Yo...@uth.tmc.edu> on 2013/03/19 23:25:40 UTC

How to turn off the tokenizer and sentence boundary module in clinical pipeline?

Hi All,

Currently, I'm using the apache-ctakes-3.0.0-incubating<http://mirror.cogentco.com/pub/apache//incubator/ctakes/apache-ctakes-3.0.0-incubating-bin.tar.gz> with clinical pipeline: AggregatePlaintextUMLSProcessor.xml.

Is there any way  to turn off the tokenization and sentence boundary to force the pipeline use the default tokens and sentence boundaries, so that we can align the CTAKEs out put with the original text.

Thanks.

Re: How to turn off the tokenizer and sentence boundary module in clinical pipeline?

Posted by "Miller, Timothy" <Ti...@childrens.harvard.edu>.
What do you mean by "default tokens and sentence boundaries?" Does your input data have gold standard (human annotated) information about these or just spaces and newlines?

Many downstream components use the token and sentence types created by the first two components, so if you want to do dictionary lookup you will need those types present somehow. If you have gold standard information to use then the typical approach is to write a CollectionReader that can take in your gold standard data as well as the text and create the Token and Sentence annotations. Then you could create a pipeline that is a subset of the AggregatePlaintextUMLSProcessor without those two components.

If you don't have gold standard tokens and sentences, but you think cTAKES is not performing correctly on your data, then the best recourse is to try to create your own tokenizer and sentence detector. If your format is very simple to do with a rule-based approach then this may be preferable, the current models are somewhat trained to work on data without particular predictable formatting.

Hope this helps,
Tim

On Mar 19, 2013, at 6:26 PM, Yonghui Wu wrote:

Hi All,

Currently, I'm using the apache-ctakes-3.0.0-incubating<http://mirror.cogentco.com/pub/apache//incubator/ctakes/apache-ctakes-3.0.0-incubating-bin.tar.gz> with clinical pipeline: AggregatePlaintextUMLSProcessor.xml.

Is there any way  to turn off the tokenization and sentence boundary to force the pipeline use the default tokens and sentence boundaries, so that we can align the CTAKEs out put with the original text.

Thanks.