You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ctakes.apache.org by "Mullane, Sean *HS" <SP...@hscmail.mcc.virginia.edu> on 2018/03/01 20:58:48 UTC

BsvRegexSectionizer breaks my pipeline

I am finding that the addition of BsvRegexSectionizer to my pipeline (below) has slowed it basically to a halt. Without the sectionizer added, I get ~1000 documents/minute. With that line added, I ran the pipeline for an hour and got no documents annotated. Can anyone suggest what's going wrong here and how to fix it?

FWIW I created a .xml descriptor file and tested the sectionizer and .bsv in the CVD and it worked as expected with only a small-moderate decrease in speed.

Thanks,
Sean

//---------------------------------------------------------------------------------------------------------------
// Description: Commands and parameters to create a default plaintext document processing pipeline with UMLS lookup. Used for back-annotation of existing documents. This takes the top x documents not already existing in the ytex.dbo.document table.
//  Database Reader
//  Read documents from a database.
reader org.apache.ctakes.ytex.uima.DBCollectionReader queryGetDocumentKeys="EXECUTE Ytex.Rptg.uspSrc_cTAKES_get_rad_notes_from_batch_backanno /*@pipelineCount*/ _pipelineCount_ ,/*@pipelineNumber*/ _pipelineNumber_", queryGetDocument="EXEC YTEX.Rptg.uspSrc_cTAKES_single_rad_note /*@note_id*/ :instance_id"
// using stored procedures for flexibility and to work around buggy regex in PiperFileReader.java

//  Regex Sectionizer -- added for experiment
//  Annotates Document Sections by detecting Section Headers using Regular Expressions provided in a Bar-Separated-Value (BSV) File.
#   SectionsBsv  path to a BSV file containing a list of regular expressions and corresponding section types.
add org.apache.ctakes.core.ae.BsvRegexSectionizer SectionsBsv=E:\ctakes\apache-ctakes-4.0.0\DefaultSectionRegex.bsv

// Load a simple token processing pipeline from another pipeline file
load DefaultTokenizerPipeline.piper

// Add non-core annotators
add ContextDependentTokenizerAnnotator
addDescription POSTagger

// Add Chunkers
load ChunkerSubPipe.piper

// Default fast dictionary lookup
//add DefaultJCasTermAnnotator
// optional: this may improve recall of low-level concepts
add OverlapJCasTermAnnotator

// Add Cleartk Entity Attribute annotators
load AttributeCleartkSubPipe.piper

// Optional: this may allow ctakes to do better with finding specific forms of generic terms without needing to add all permutations to dictionary
//load RelationSubPipe

//  XMI Writer 3
//  Writes XMI files with full representation of input text and all extracted information.
add org.apache.ctakes.ytex.uima.annotators.DBConsumer analysisBatch="Radiology_test_DefaultFastPipeline7" storeDocText=false storeCAS=false  typesToIgnore=org.apache.ctakes.typesystem.type.textspan.Sentence,org.apache.ctakes.typesystem.type.syntax.ContractionToken,org.apache.ctakes.typesystem.type.syntax.NewlineToken,org.apache.ctakes.typesystem.type.syntax.NumToken,org.apache.ctakes.typesystem.type.syntax.PunctuationToken,org.apache.ctakes.typesystem.type.syntax.SymbolToken,org.apache.ctakes.typesystem.type.syntax.NP,org.apache.ctakes.typesystem.type.syntax.VP,org.apache.ctakes.typesystem.type.textsem.RomanNumeralAnnotation,org.apache.ctakes.typesystem.type.textsem.PersonTitleAnnotation,org.apache.ctakes.typesystem.type.syntax.WordToken,org.apache.ctakes.typesystem.type.syntax.TreebankNode,org.apache.ctakes.typesystem.type.syntax.TopTreebankNode,org.apache.ctakes.typesystem.type.syntax.TerminalTreebankNode

RE: BsvRegexSectionizer breaks my pipeline [EXTERNAL]

Posted by "Mullane, Sean *HS" <SP...@hscmail.mcc.virginia.edu>.

I have made partial progress. It seems the ExtractionPrepAnnotator is required here after playing with the CVD desc pipeline. But I'm getting a similar error to Gundolf:

05 Mar 2018 21:30:19 DEBUG DataBinder - DataBinder requires binding of required fields [classifierJarPath]
05 Mar 2018 21:30:19 ERROR PiperFileRunner - Initialization of annotator class "org.apache.ctakes.clinicalpipeline.ae.ExtractionPrepAnnotator" failed.  (Descriptor: <unknown>)

ctakes-clinical-pipeline-4.0.0.jar is in my classpath so I would think it would be found, but no luck.

Thanks,
Sean

-----Original Message-----
From: Mullane, Sean *HS 
Sent: Monday, March 05, 2018 6:19 PM
To: 'Finan, Sean'; dev@ctakes.apache.org
Subject: RE: BsvRegexSectionizer breaks my pipeline [EXTERNAL]

Sean,

Thanks for looking into this. Somehow the line with that seems to have gotten lost. I rechecked and made sure I had exactly one segment annotator in the pipeline (and a sentence annotator) and it seems that it was able to complete. So that's good!

However, I am getting only null segmentID values for the anno_disease_disorder_mention table (and the other similar tables) in the output database. This may be a DBConsumer-specific issue, as I was able to see segmentID values in the CVD. Still any suggestions on how to remedy this would be much appreciated. I have been stepping through the eclipse debugger to try to see what's going on but it's hard for me to make much sense of it, not being particularly familiar with Java.

Thanks,
Sean

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu] 
Sent: Thursday, March 01, 2018 4:22 PM
To: dev@ctakes.apache.org
Subject: RE: BsvRegexSectionizer breaks my pipeline [EXTERNAL]

Hi Sean,

It looks like you are not using the standard regex file:
> SectionsBsv=E:\ctakes\apache-ctakes-4.0.0\DefaultSectionRegex.bsv

Is it possible that there is a poorly-formed regex?
The sectionizer should time out any regex that takes longer than a few seconds to complete, but it is possible that something in the timeout isn't working.  

As an aside, I don't see a sentence annotator.  A whole lot of downstream annotators depend upon sentences, so you should add one.

Sean

-----Original Message-----
From: Mullane, Sean *HS [mailto:SPM9R@hscmail.mcc.virginia.edu] 
Sent: Thursday, March 01, 2018 3:59 PM
To: dev@ctakes.apache.org
Subject: BsvRegexSectionizer breaks my pipeline [EXTERNAL]

I am finding that the addition of BsvRegexSectionizer to my pipeline (below) has slowed it basically to a halt. Without the sectionizer added, I get ~1000 documents/minute. With that line added, I ran the pipeline for an hour and got no documents annotated. Can anyone suggest what's going wrong here and how to fix it?

FWIW I created a .xml descriptor file and tested the sectionizer and .bsv in the CVD and it worked as expected with only a small-moderate decrease in speed.

Thanks,
Sean

//---------------------------------------------------------------------------------------------------------------
// Description: Commands and parameters to create a default plaintext document processing pipeline with UMLS lookup. Used for back-annotation of existing documents. This takes the top x documents not already existing in the ytex.dbo.document table.
//  Database Reader
//  Read documents from a database.
reader org.apache.ctakes.ytex.uima.DBCollectionReader queryGetDocumentKeys="EXECUTE Ytex.Rptg.uspSrc_cTAKES_get_rad_notes_from_batch_backanno /*@pipelineCount*/ _pipelineCount_ ,/*@pipelineNumber*/ _pipelineNumber_", queryGetDocument="EXEC YTEX.Rptg.uspSrc_cTAKES_single_rad_note /*@note_id*/ :instance_id"
// using stored procedures for flexibility and to work around buggy regex in PiperFileReader.java

//  Regex Sectionizer -- added for experiment //  Annotates Document Sections by detecting Section Headers using Regular Expressions provided in a Bar-Separated-Value (BSV) File.
#   SectionsBsv  path to a BSV file containing a list of regular expressions and corresponding section types.
add org.apache.ctakes.core.ae.BsvRegexSectionizer SectionsBsv=E:\ctakes\apache-ctakes-4.0.0\DefaultSectionRegex.bsv

// Load a simple token processing pipeline from another pipeline file load DefaultTokenizerPipeline.piper

// Add non-core annotators
add ContextDependentTokenizerAnnotator
addDescription POSTagger

// Add Chunkers
load ChunkerSubPipe.piper

// Default fast dictionary lookup
//add DefaultJCasTermAnnotator
// optional: this may improve recall of low-level concepts add OverlapJCasTermAnnotator

// Add Cleartk Entity Attribute annotators load AttributeCleartkSubPipe.piper

// Optional: this may allow ctakes to do better with finding specific forms of generic terms without needing to add all permutations to dictionary //load RelationSubPipe

//  XMI Writer 3
//  Writes XMI files with full representation of input text and all extracted information.
add org.apache.ctakes.ytex.uima.annotators.DBConsumer analysisBatch="Radiology_test_DefaultFastPipeline7" storeDocText=false storeCAS=false  typesToIgnore=org.apache.ctakes.typesystem.type.textspan.Sentence,org.apache.ctakes.typesystem.type.syntax.ContractionToken,org.apache.ctakes.typesystem.type.syntax.NewlineToken,org.apache.ctakes.typesystem.type.syntax.NumToken,org.apache.ctakes.typesystem.type.syntax.PunctuationToken,org.apache.ctakes.typesystem.type.syntax.SymbolToken,org.apache.ctakes.typesystem.type.syntax.NP,org.apache.ctakes.typesystem.type.syntax.VP,org.apache.ctakes.typesystem.type.textsem.RomanNumeralAnnotation,org.apache.ctakes.typesystem.type.textsem.PersonTitleAnnotation,org.apache.ctakes.typesystem.type.syntax.WordToken,org.apache.ctakes.typesystem.type.syntax.TreebankNode,org.apache.ctakes.typesystem.type.syntax.TopTreebankNode,org.apache.ctakes.typesystem.type.syntax.TerminalTreebankNode

RE: BsvRegexSectionizer breaks my pipeline [EXTERNAL]

Posted by "Mullane, Sean *HS" <SP...@hscmail.mcc.virginia.edu>.

Sean,

Thanks for looking into this. Somehow the line with that seems to have gotten lost. I rechecked and made sure I had exactly one segment annotator in the pipeline (and a sentence annotator) and it seems that it was able to complete. So that's good!

However, I am getting only null segmentID values for the anno_disease_disorder_mention table (and the other similar tables) in the output database. This may be a DBConsumer-specific issue, as I was able to see segmentID values in the CVD. Still any suggestions on how to remedy this would be much appreciated. I have been stepping through the eclipse debugger to try to see what's going on but it's hard for me to make much sense of it, not being particularly familiar with Java.

Thanks,
Sean

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu] 
Sent: Thursday, March 01, 2018 4:22 PM
To: dev@ctakes.apache.org
Subject: RE: BsvRegexSectionizer breaks my pipeline [EXTERNAL]

Hi Sean,

It looks like you are not using the standard regex file:
> SectionsBsv=E:\ctakes\apache-ctakes-4.0.0\DefaultSectionRegex.bsv

Is it possible that there is a poorly-formed regex?
The sectionizer should time out any regex that takes longer than a few seconds to complete, but it is possible that something in the timeout isn't working.  

As an aside, I don't see a sentence annotator.  A whole lot of downstream annotators depend upon sentences, so you should add one.

Sean

-----Original Message-----
From: Mullane, Sean *HS [mailto:SPM9R@hscmail.mcc.virginia.edu] 
Sent: Thursday, March 01, 2018 3:59 PM
To: dev@ctakes.apache.org
Subject: BsvRegexSectionizer breaks my pipeline [EXTERNAL]

I am finding that the addition of BsvRegexSectionizer to my pipeline (below) has slowed it basically to a halt. Without the sectionizer added, I get ~1000 documents/minute. With that line added, I ran the pipeline for an hour and got no documents annotated. Can anyone suggest what's going wrong here and how to fix it?

FWIW I created a .xml descriptor file and tested the sectionizer and .bsv in the CVD and it worked as expected with only a small-moderate decrease in speed.

Thanks,
Sean

//---------------------------------------------------------------------------------------------------------------
// Description: Commands and parameters to create a default plaintext document processing pipeline with UMLS lookup. Used for back-annotation of existing documents. This takes the top x documents not already existing in the ytex.dbo.document table.
//  Database Reader
//  Read documents from a database.
reader org.apache.ctakes.ytex.uima.DBCollectionReader queryGetDocumentKeys="EXECUTE Ytex.Rptg.uspSrc_cTAKES_get_rad_notes_from_batch_backanno /*@pipelineCount*/ _pipelineCount_ ,/*@pipelineNumber*/ _pipelineNumber_", queryGetDocument="EXEC YTEX.Rptg.uspSrc_cTAKES_single_rad_note /*@note_id*/ :instance_id"
// using stored procedures for flexibility and to work around buggy regex in PiperFileReader.java

//  Regex Sectionizer -- added for experiment //  Annotates Document Sections by detecting Section Headers using Regular Expressions provided in a Bar-Separated-Value (BSV) File.
#   SectionsBsv  path to a BSV file containing a list of regular expressions and corresponding section types.
add org.apache.ctakes.core.ae.BsvRegexSectionizer SectionsBsv=E:\ctakes\apache-ctakes-4.0.0\DefaultSectionRegex.bsv

// Load a simple token processing pipeline from another pipeline file load DefaultTokenizerPipeline.piper

// Add non-core annotators
add ContextDependentTokenizerAnnotator
addDescription POSTagger

// Add Chunkers
load ChunkerSubPipe.piper

// Default fast dictionary lookup
//add DefaultJCasTermAnnotator
// optional: this may improve recall of low-level concepts add OverlapJCasTermAnnotator

// Add Cleartk Entity Attribute annotators load AttributeCleartkSubPipe.piper

// Optional: this may allow ctakes to do better with finding specific forms of generic terms without needing to add all permutations to dictionary //load RelationSubPipe

//  XMI Writer 3
//  Writes XMI files with full representation of input text and all extracted information.
add org.apache.ctakes.ytex.uima.annotators.DBConsumer analysisBatch="Radiology_test_DefaultFastPipeline7" storeDocText=false storeCAS=false  typesToIgnore=org.apache.ctakes.typesystem.type.textspan.Sentence,org.apache.ctakes.typesystem.type.syntax.ContractionToken,org.apache.ctakes.typesystem.type.syntax.NewlineToken,org.apache.ctakes.typesystem.type.syntax.NumToken,org.apache.ctakes.typesystem.type.syntax.PunctuationToken,org.apache.ctakes.typesystem.type.syntax.SymbolToken,org.apache.ctakes.typesystem.type.syntax.NP,org.apache.ctakes.typesystem.type.syntax.VP,org.apache.ctakes.typesystem.type.textsem.RomanNumeralAnnotation,org.apache.ctakes.typesystem.type.textsem.PersonTitleAnnotation,org.apache.ctakes.typesystem.type.syntax.WordToken,org.apache.ctakes.typesystem.type.syntax.TreebankNode,org.apache.ctakes.typesystem.type.syntax.TopTreebankNode,org.apache.ctakes.typesystem.type.syntax.TerminalTreebankNode

RE: BsvRegexSectionizer breaks my pipeline [EXTERNAL]

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.

Hi Sean,

It looks like you are not using the standard regex file:
> SectionsBsv=E:\ctakes\apache-ctakes-4.0.0\DefaultSectionRegex.bsv

Is it possible that there is a poorly-formed regex?
The sectionizer should time out any regex that takes longer than a few seconds to complete, but it is possible that something in the timeout isn't working.  

As an aside, I don't see a sentence annotator.  A whole lot of downstream annotators depend upon sentences, so you should add one.

Sean

-----Original Message-----
From: Mullane, Sean *HS [mailto:SPM9R@hscmail.mcc.virginia.edu] 
Sent: Thursday, March 01, 2018 3:59 PM
To: dev@ctakes.apache.org
Subject: BsvRegexSectionizer breaks my pipeline [EXTERNAL]

I am finding that the addition of BsvRegexSectionizer to my pipeline (below) has slowed it basically to a halt. Without the sectionizer added, I get ~1000 documents/minute. With that line added, I ran the pipeline for an hour and got no documents annotated. Can anyone suggest what's going wrong here and how to fix it?

FWIW I created a .xml descriptor file and tested the sectionizer and .bsv in the CVD and it worked as expected with only a small-moderate decrease in speed.

Thanks,
Sean

//---------------------------------------------------------------------------------------------------------------
// Description: Commands and parameters to create a default plaintext document processing pipeline with UMLS lookup. Used for back-annotation of existing documents. This takes the top x documents not already existing in the ytex.dbo.document table.
//  Database Reader
//  Read documents from a database.
reader org.apache.ctakes.ytex.uima.DBCollectionReader queryGetDocumentKeys="EXECUTE Ytex.Rptg.uspSrc_cTAKES_get_rad_notes_from_batch_backanno /*@pipelineCount*/ _pipelineCount_ ,/*@pipelineNumber*/ _pipelineNumber_", queryGetDocument="EXEC YTEX.Rptg.uspSrc_cTAKES_single_rad_note /*@note_id*/ :instance_id"
// using stored procedures for flexibility and to work around buggy regex in PiperFileReader.java

//  Regex Sectionizer -- added for experiment //  Annotates Document Sections by detecting Section Headers using Regular Expressions provided in a Bar-Separated-Value (BSV) File.
#   SectionsBsv  path to a BSV file containing a list of regular expressions and corresponding section types.
add org.apache.ctakes.core.ae.BsvRegexSectionizer SectionsBsv=E:\ctakes\apache-ctakes-4.0.0\DefaultSectionRegex.bsv

// Load a simple token processing pipeline from another pipeline file load DefaultTokenizerPipeline.piper

// Add non-core annotators
add ContextDependentTokenizerAnnotator
addDescription POSTagger

// Add Chunkers
load ChunkerSubPipe.piper

// Default fast dictionary lookup
//add DefaultJCasTermAnnotator
// optional: this may improve recall of low-level concepts add OverlapJCasTermAnnotator

// Add Cleartk Entity Attribute annotators load AttributeCleartkSubPipe.piper

// Optional: this may allow ctakes to do better with finding specific forms of generic terms without needing to add all permutations to dictionary //load RelationSubPipe

//  XMI Writer 3
//  Writes XMI files with full representation of input text and all extracted information.
add org.apache.ctakes.ytex.uima.annotators.DBConsumer analysisBatch="Radiology_test_DefaultFastPipeline7" storeDocText=false storeCAS=false  typesToIgnore=org.apache.ctakes.typesystem.type.textspan.Sentence,org.apache.ctakes.typesystem.type.syntax.ContractionToken,org.apache.ctakes.typesystem.type.syntax.NewlineToken,org.apache.ctakes.typesystem.type.syntax.NumToken,org.apache.ctakes.typesystem.type.syntax.PunctuationToken,org.apache.ctakes.typesystem.type.syntax.SymbolToken,org.apache.ctakes.typesystem.type.syntax.NP,org.apache.ctakes.typesystem.type.syntax.VP,org.apache.ctakes.typesystem.type.textsem.RomanNumeralAnnotation,org.apache.ctakes.typesystem.type.textsem.PersonTitleAnnotation,org.apache.ctakes.typesystem.type.syntax.WordToken,org.apache.ctakes.typesystem.type.syntax.TreebankNode,org.apache.ctakes.typesystem.type.syntax.TopTreebankNode,org.apache.ctakes.typesystem.type.syntax.TerminalTreebankNode