You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ctakes.apache.org by Tim Miller <ti...@childrens.harvard.edu> on 2013/11/13 00:16:35 UTC

getContextMap() question

I'm running the default pipeline on some large files and trying to fix 
some of the slower annotators. I changed ChunkAdjuster to use UimaFit 
selectors which dramatically improves speed on large files. I removed 
the OverlapAnnotator, with its complicated interface and extreme 
generality, from my pipeline altogether and replaced it with a 3-line 
static annotator. I think we should consider doing that for the default 
pipeline even if we think there are good reasons to keep the 
general-purpose annotator around.

Anyways, now I'm at the dictionary lookup which I suspect will be the 
slowest component. One call is to getContextMap() which seems especially 
slow. It is called for every LookupWindow, and given the span of that 
window, iterates over all LookupWindow's looking for one with the 
equivalent span. So in the end you give it a lookup window and it gives 
you the same one back basically. Of course the code is written very 
generally so there may be use cases where the types are different, but 
for the default case it seems a little weird for something doing nothing 
to take so long.

So, my question is, does anyone know what the engineering goals of this 
setup are? I think it can be optimized even within the super-general 
framework it is trying to maintain, but I don't want to break anything 
by making assumptions that aren't valid.

Thanks
Tim

RE: getContextMap() question

Posted by "Masanz, James J." <Ma...@mayo.edu>.

I don't know the goal as such but I do know this bit of info:

Maybe you already know this -  if it's the bit of code I think you are referring to, it doesn't always have to be a org.apache.ctakes.typesystem.type.textspan.LookupWindowAnnotation.
The windowAnnotations property within the LookupDesc*.xml files defines what type of annotation it will be.
I think at least one of the pipelines within cTAKES uses Sentence as the window.  It appeared to me that at least one of the goals was to handle any type of annotation as the lookup window, and ensure not to create duplicate annotations if the lookup window type was one that contained overlapping or duplicated spans.  

-- James

-----Original Message-----
From: dev-return-2208-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-2208-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Tim Miller
Sent: Tuesday, November 12, 2013 5:17 PM
To: dev@ctakes.apache.org
Subject: getContextMap() question

I'm running the default pipeline on some large files and trying to fix 
some of the slower annotators. I changed ChunkAdjuster to use UimaFit 
selectors which dramatically improves speed on large files. I removed 
the OverlapAnnotator, with its complicated interface and extreme 
generality, from my pipeline altogether and replaced it with a 3-line 
static annotator. I think we should consider doing that for the default 
pipeline even if we think there are good reasons to keep the 
general-purpose annotator around.

Anyways, now I'm at the dictionary lookup which I suspect will be the 
slowest component. One call is to getContextMap() which seems especially 
slow. It is called for every LookupWindow, and given the span of that 
window, iterates over all LookupWindow's looking for one with the 
equivalent span. So in the end you give it a lookup window and it gives 
you the same one back basically. Of course the code is written very 
generally so there may be use cases where the types are different, but 
for the default case it seems a little weird for something doing nothing 
to take so long.

So, my question is, does anyone know what the engineering goals of this 
setup are? I think it can be optimized even within the super-general 
framework it is trying to maintain, but I don't want to break anything 
by making assumptions that aren't valid.

Thanks
Tim