You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ctakes.apache.org by Alexandru Zbarcea <al...@apache.org> on 2017/10/01 23:54:23 UTC

Missing resources for script that extracts markables from a corpus for analysis

Hi,

I was trying to do a UTest for the
org.apache.ctakes.coreference.data.PrintMimicMarkables (recently added),
but I couldn't find any of the existing resources that can be used for
this. Can anyone help me pointing to a resource (Lucene index) folder.

org.apache.ctakes.coreference.data.PrintMimicMarkables \

/home/alex/projects/apache/ctakes/ctakes-dictionary-lookup-res/target/classes/org/apache/ctakes/dictionary/lookup/rxnorm_index
\
    index.out

I was trying with the following lucene folder/resource:
./ctakes-coreference-res/src/main/resources/org/apache/ctakes/coreference/models/index_med_5k

And also the dictionaries:
./ctakes-dictionary-lookup-res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed-like_codes_sample
./ctakes-dictionary-lookup-res/src/main/resources/org/apache/ctakes/dictionary/lookup/assertion_cue_phrase_index
./ctakes-dictionary-lookup-res/src/main/resources/org/apache/ctakes/dictionary/lookup/OrangeBook
./ctakes-dictionary-lookup-res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed-like_sample
./ctakes-dictionary-lookup-res/src/main/resources/org/apache/ctakes/dictionary/lookup/drug_index

Any execution looks like:
01 Oct 2017 19:50:19  INFO ConstituencyParser - Initializing parser...
Oct 01, 2017 7:50:20 PM
org.apache.uima.collection.impl.cpm.engine.ArtifactProducer process
WARNING: Got Exception. (Thread Name: [CollectionReader Thread]::) Message:
docID must be >= 0 and < maxDoc=5000 (got docID=5000)
Oct 01, 2017 7:50:20 PM
org.apache.uima.collection.impl.cpm.engine.ArtifactProducer run(820)
WARNING: docID must be >= 0 and < maxDoc=5000 (got docID=5000)
java.lang.IllegalArgumentException: docID must be >= 0 and < maxDoc=5000
(got docID=5000)
at
org.apache.lucene.index.BaseCompositeReader.readerIndex(BaseCompositeReader.java:152)
at
org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeReader.java:115)
at org.apache.lucene.index.IndexReader.document(IndexReader.java:436)
at
org.apache.ctakes.core.cr.LuceneCollectionReader.getNext(LuceneCollectionReader.java:90)
at
org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.readNext(ArtifactProducer.java:494)
at
org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.run(ArtifactProducer.java:711)

Collection process complete called, closing file writer.

I appreciate any of your help,
Alex

Re: Missing resources for script that extracts markables from a corpus for analysis [EXTERNAL]

Posted by Alexandru Zbarcea <al...@apache.org>.
Hi Tim,

Because LuceneIndex is touched in several places within the code, I started
with refactorization of LuceneIndexReaderResourceImpl (see: CTAKES-464 [1])

If you have time, may you also check CTAKES-334 [2]. I started to have it
as a prerequisite, because the patch provided actually will make the tests
pass (having also UMLS credentials).

Alex

[1] - https://issues.apache.org/jira/browse/CTAKES-464
[2] - https://issues.apache.org/jira/browse/CTAKES-334

On Wed, Oct 4, 2017 at 8:15 AM, Alexandru Zbarcea <al...@apache.org> wrote:

> Thanks Tim,
>
> I will let you know about the progress.
>
> Alex
>
> On Oct 4, 2017 06:34, "Miller, Timothy" <Timothy.Miller@childrens.
> harvard.edu> wrote:
>
>> I had in mind the notes in:
>> /ctakes-examples-res/src/main/resources/org/apache/ctakes/ex
>> amples/notes/rtf
>>
>> which I believe are the fake notes Dr. John Green wrote for us. I don't
>> know why they are rtf but they are nice, non-toy-length notes.
>> Tim
>>
>> ________________________________________
>> From: Alexandru Zbarcea <al...@apache.org>
>> Sent: Tuesday, October 3, 2017 5:32 PM
>> To: Apache cTAKES Dev
>> Subject: Re: Missing resources for script that extracts markables from a
>> corpus for analysis [EXTERNAL]
>>
>> Hi Tim,
>>
>> That's great news. If you think there are sample notes that can be used, I
>> can start working on the Lucene index and slowly build the UTest for them.
>>
>> I have created CTAKES-462[1] where we can track this work.
>>
>> Looking into the ctakes-examples-res, what I can find is:
>> $ find . -type f | grep -v "\.class" | grep -v "\.iml" | grep -v "\.jar" |
>> grep -v "\.rtf" | grep -v "\.xml" | grep -v "\.bsv" | grep -v "\.piper"
>> ./main/resources/org/apache/ctakes/examples/notes/pain_no_swelling.txt
>> ./main/resources/org/apache/ctakes/examples/notes/claudication
>> ./main/resources/org/apache/ctakes/examples/notes/shark_bite.txt
>> ./main/resources/org/apache/ctakes/examples/notes/edge_cases
>> _plaintext_1.txt
>>
>> ./main/resources/org/apache/ctakes/examples/notes/dr_nutritious_1.txt
>> ./main/resources/org/apache/ctakes/examples/notes/right_knee_arthroscopy
>> ./main/resources/org/apache/ctakes/examples/notes/SampleInpu
>> tRadiologyNotes.txt
>>
>> ./main/resources/org/apache/ctakes/examples/notes/smoker/
>> doc1_07543210_sample_past_smoker.txt
>> ./main/resources/org/apache/ctakes/examples/notes/smoker/
>> doc2_07543210_sample_past_smoker.txt
>> ./main/resources/org/apache/ctakes/examples/notes/smoker/
>> doc2_07543210_sample_current.txt
>> ./main/resources/org/apache/ctakes/examples/notes/smoker/
>> doc1_07543210_sample_unknown.txt
>> ./main/resources/org/apache/ctakes/examples/notes/smoker/
>> doc1_07543210_sample_current.txt
>> ./main/resources/org/apache/ctakes/examples/notes/mother_goose/README
>> ./main/resources/org/apache/ctakes/examples/notes/mother_
>> goose/OneMistyMoistyMorning.txt
>> ./main/resources/org/apache/ctakes/examples/notes/dr_nutritious_2.txt
>> ./main/resources/org/apache/ctakes/examples/annotation/
>> anafora_annotated/Peds_RoutBirthNote_1/Peds_RoutBirthNote_1
>> ./main/resources/org/apache/ctakes/examples/annotation/
>> anafora_annotated/VascSurg_AAA_Leak_1/VascSurg_AAA_Leak_1
>> ./main/resources/org/apache/ctakes/examples/annotation/
>> anafora_annotated/Peds_Dysphagia_1/Peds_Dysphagia_1
>> ./main/resources/org/apache/ctakes/examples/annotation/
>> anafora_annotated/OBGYN_LaborProgressNote_1/OBGYN_LaborProgressNote_1
>> ./main/resources/org/apache/ctakes/examples/annotation/
>> anafora_annotated/OBGYN_IUD_1/OBGYN_IUD_1
>> ./main/resources/org/apache/ctakes/examples/annotation/
>> anafora_annotated/OBGYN_HysterectomyAndBSO_1/OBGYN_HysterectomyAndBSO_1
>> ./main/resources/org/apache/ctakes/examples/annotation/
>> anafora_annotated/VascSurg_FollowUp_1/VascSurg_FollowUp_1
>> ./main/resources/org/apache/ctakes/examples/annotation/
>> anafora_annotated/OBGYN_PROMCheck_1/OBGYN_PROMCheck_1
>> ./main/resources/org/apache/ctakes/examples/annotation/
>> anafora_annotated/OBGYN_Gen_Abscess_1/OBGYN_Gen_Abscess_1
>> ./main/resources/org/apache/ctakes/examples/annotation/
>> anafora_annotated/Peds_FebrileSez_1/Peds_FebrileSez_1
>> ./main/resources/org/apache/ctakes/examples/annotation/
>> anafora_annotated/VascSurg_RO_AAA_1/VascSurg_RO_AAA_1
>> ./main/resources/org/apache/ctakes/examples/annotation/
>> anafora_annotated/VascSurg_RO_DVT_1/VascSurg_RO_DVT_1
>> ./main/resources/org/apache/ctakes/examples/annotation/
>> anafora_annotated/GenSurg_UmbilicalHernia_1/GenSurg_UmbilicalHernia_1
>> ./main/resources/org/apache/ctakes/examples/annotation/
>> anafora_annotated/VascSurg_PVD_1/VascSurg_PVD_1
>> ./main/resources/org/apache/ctakes/examples/annotation/
>> anafora_annotated/OBGYN_MVAPrego_1/OBGYN_MVAPrego_1
>>
>> What notes do you consider I should start with (all) ?
>>
>> Alex
>>
>> [1] - https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.
>> apache.org_jira_browse_CTAKES-2D462&d=DwIBaQ&c=qS4goWBT7popl
>> M69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRR
>> NQXipowRLRjx0ibQrHEo8uYx6674h&m=COSkyBpYGrcp_hTAFRRfTx8JCwHA
>> zxTM3GMiXKrSbnE&s=jOmot_onPFb31eg689D0ihb5Y4dZTzKcQ40vMCW0Bgk&e=
>>
>>
>> On Mon, Oct 2, 2017 at 6:46 PM, Miller, Timothy <Timothy.Miller@childrens.
>> harvard.edu> wrote:
>>
>> > Yeah, it might be nice to build a lucene index of all the sample notes
>> in
>> > the ctakes-example module. I'll create a jira for it but probably won't
>> be
>> > able to get to it right away.
>> > Tim
>> >
>> > ________________________________________
>> > From: Alexandru Zbarcea <al...@apache.org>
>> > Sent: Monday, October 2, 2017 5:31 PM
>> > To: Apache cTAKES Dev
>> > Subject: Re: Missing resources for script that extracts markables from a
>> > corpus for analysis [EXTERNAL]
>> >
>> > Hi Tim,
>> >
>> > I understand, makes sense. Is it possible to anonymize the data you
>> have or
>> > come up with a separate body of test data to generate a Lucene index and
>> > unit test the code? I think this would have the double benefit of the
>> code
>> > being tested and showing dev/users how the code is supposed to be used.
>> >
>> > What do you think?
>> >
>> > Alex
>> >
>> >
>> > On Mon, Oct 2, 2017 at 9:45 AM, Miller, Timothy <
>> > Timothy.Miller@childrens.harvard.edu> wrote:
>> >
>> > > Thanks Alex,
>> > > This code is for processing a clinical text data corpus stored as a
>> > > lucene index -- data that cannot be redistributed for privacy reasons.
>> > > Since it's so related to the coref stuff I thought it should go
>> > > alongside the coreference module. But maybe it makes more sense as an
>> > > external project since it can't really function without externally
>> > > created resources -- what do you think?
>> > > Tim
>> > >
>> > >
>> > > On Sun, 2017-10-01 at 19:54 -0400, Alexandru Zbarcea wrote:
>> > > > Hi,
>> > > >
>> > > > I was trying to do a UTest for the
>> > > > org.apache.ctakes.coreference.data.PrintMimicMarkables (recently
>> > > > added),
>> > > > but I couldn't find any of the existing resources that can be used
>> > > > for
>> > > > this. Can anyone help me pointing to a resource (Lucene index)
>> > > > folder.
>> > > >
>> > > > org.apache.ctakes.coreference.data.PrintMimicMarkables \
>> > > >
>> > > > /home/alex/projects/apache/ctakes/ctakes-dictionary-lookup-
>> > > > res/target/classes/org/apache/ctakes/dictionary/lookup/rxnorm_index
>> > > > \
>> > > >     index.out
>> > > >
>> > > > I was trying with the following lucene folder/resource:
>> > > > ./ctakes-coreference-
>> > > > res/src/main/resources/org/apache/ctakes/coreference/models/
>> index_med
>> > > > _5k
>> > > >
>> > > > And also the dictionaries:
>> > > > ./ctakes-dictionary-lookup-
>> > > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed-
>> > > > like_codes_sample
>> > > > ./ctakes-dictionary-lookup-
>> > > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/
>> assertion_
>> > > > cue_phrase_index
>> > > > ./ctakes-dictionary-lookup-
>> > > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/
>> OrangeBook
>> > > > ./ctakes-dictionary-lookup-
>> > > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed-
>> > > > like_sample
>> > > > ./ctakes-dictionary-lookup-
>> > > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/
>> drug_index
>> > > >
>> > > > Any execution looks like:
>> > > > 01 Oct 2017 19:50:19  INFO ConstituencyParser - Initializing
>> > > > parser...
>> > > > Oct 01, 2017 7:50:20 PM
>> > > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer process
>> > > > WARNING: Got Exception. (Thread Name: [CollectionReader Thread]::)
>> > > > Message:
>> > > > docID must be >= 0 and < maxDoc=5000 (got docID=5000)
>> > > > Oct 01, 2017 7:50:20 PM
>> > > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer
>> run(820)
>> > > > WARNING: docID must be >= 0 and < maxDoc=5000 (got docID=5000)
>> > > > java.lang.IllegalArgumentException: docID must be >= 0 and <
>> > > > maxDoc=5000
>> > > > (got docID=5000)
>> > > > at
>> > > > org.apache.lucene.index.BaseCompositeReader.readerIndex(
>> BaseComposite
>> > > > Reader.java:152)
>> > > > at
>> > > > org.apache.lucene.index.BaseCompositeReader.document(BaseCom
>> positeRea
>> > > > der.java:115)
>> > > > at org.apache.lucene.index.IndexReader.document(IndexReader.
>> java:436)
>> > > > at
>> > > > org.apache.ctakes.core.cr.LuceneCollectionReader.getNext(
>> LuceneCollec
>> > > > tionReader.java:90)
>> > > > at
>> > > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.
>> readNext(
>> > > > ArtifactProducer.java:494)
>> > > > at
>> > > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.
>> run(Artif
>> > > > actProducer.java:711)
>> > > >
>> > > > Collection process complete called, closing file writer.
>> > > >
>> > > > I appreciate any of your help,
>> > > > Alex
>> >
>>
>

Re: Missing resources for script that extracts markables from a corpus for analysis [EXTERNAL]

Posted by Alexandru Zbarcea <al...@apache.org>.
Thanks Tim,

I will let you know about the progress.

Alex

On Oct 4, 2017 06:34, "Miller, Timothy" <
Timothy.Miller@childrens.harvard.edu> wrote:

> I had in mind the notes in:
> /ctakes-examples-res/src/main/resources/org/apache/ctakes/
> examples/notes/rtf
>
> which I believe are the fake notes Dr. John Green wrote for us. I don't
> know why they are rtf but they are nice, non-toy-length notes.
> Tim
>
> ________________________________________
> From: Alexandru Zbarcea <al...@apache.org>
> Sent: Tuesday, October 3, 2017 5:32 PM
> To: Apache cTAKES Dev
> Subject: Re: Missing resources for script that extracts markables from a
> corpus for analysis [EXTERNAL]
>
> Hi Tim,
>
> That's great news. If you think there are sample notes that can be used, I
> can start working on the Lucene index and slowly build the UTest for them.
>
> I have created CTAKES-462[1] where we can track this work.
>
> Looking into the ctakes-examples-res, what I can find is:
> $ find . -type f | grep -v "\.class" | grep -v "\.iml" | grep -v "\.jar" |
> grep -v "\.rtf" | grep -v "\.xml" | grep -v "\.bsv" | grep -v "\.piper"
> ./main/resources/org/apache/ctakes/examples/notes/pain_no_swelling.txt
> ./main/resources/org/apache/ctakes/examples/notes/claudication
> ./main/resources/org/apache/ctakes/examples/notes/shark_bite.txt
> ./main/resources/org/apache/ctakes/examples/notes/edge_
> cases_plaintext_1.txt
>
> ./main/resources/org/apache/ctakes/examples/notes/dr_nutritious_1.txt
> ./main/resources/org/apache/ctakes/examples/notes/right_knee_arthroscopy
> ./main/resources/org/apache/ctakes/examples/notes/
> SampleInputRadiologyNotes.txt
>
> ./main/resources/org/apache/ctakes/examples/notes/smoker/
> doc1_07543210_sample_past_smoker.txt
> ./main/resources/org/apache/ctakes/examples/notes/smoker/
> doc2_07543210_sample_past_smoker.txt
> ./main/resources/org/apache/ctakes/examples/notes/smoker/
> doc2_07543210_sample_current.txt
> ./main/resources/org/apache/ctakes/examples/notes/smoker/
> doc1_07543210_sample_unknown.txt
> ./main/resources/org/apache/ctakes/examples/notes/smoker/
> doc1_07543210_sample_current.txt
> ./main/resources/org/apache/ctakes/examples/notes/mother_goose/README
> ./main/resources/org/apache/ctakes/examples/notes/mother_
> goose/OneMistyMoistyMorning.txt
> ./main/resources/org/apache/ctakes/examples/notes/dr_nutritious_2.txt
> ./main/resources/org/apache/ctakes/examples/annotation/
> anafora_annotated/Peds_RoutBirthNote_1/Peds_RoutBirthNote_1
> ./main/resources/org/apache/ctakes/examples/annotation/
> anafora_annotated/VascSurg_AAA_Leak_1/VascSurg_AAA_Leak_1
> ./main/resources/org/apache/ctakes/examples/annotation/
> anafora_annotated/Peds_Dysphagia_1/Peds_Dysphagia_1
> ./main/resources/org/apache/ctakes/examples/annotation/
> anafora_annotated/OBGYN_LaborProgressNote_1/OBGYN_LaborProgressNote_1
> ./main/resources/org/apache/ctakes/examples/annotation/
> anafora_annotated/OBGYN_IUD_1/OBGYN_IUD_1
> ./main/resources/org/apache/ctakes/examples/annotation/
> anafora_annotated/OBGYN_HysterectomyAndBSO_1/OBGYN_HysterectomyAndBSO_1
> ./main/resources/org/apache/ctakes/examples/annotation/
> anafora_annotated/VascSurg_FollowUp_1/VascSurg_FollowUp_1
> ./main/resources/org/apache/ctakes/examples/annotation/
> anafora_annotated/OBGYN_PROMCheck_1/OBGYN_PROMCheck_1
> ./main/resources/org/apache/ctakes/examples/annotation/
> anafora_annotated/OBGYN_Gen_Abscess_1/OBGYN_Gen_Abscess_1
> ./main/resources/org/apache/ctakes/examples/annotation/
> anafora_annotated/Peds_FebrileSez_1/Peds_FebrileSez_1
> ./main/resources/org/apache/ctakes/examples/annotation/
> anafora_annotated/VascSurg_RO_AAA_1/VascSurg_RO_AAA_1
> ./main/resources/org/apache/ctakes/examples/annotation/
> anafora_annotated/VascSurg_RO_DVT_1/VascSurg_RO_DVT_1
> ./main/resources/org/apache/ctakes/examples/annotation/
> anafora_annotated/GenSurg_UmbilicalHernia_1/GenSurg_UmbilicalHernia_1
> ./main/resources/org/apache/ctakes/examples/annotation/
> anafora_annotated/VascSurg_PVD_1/VascSurg_PVD_1
> ./main/resources/org/apache/ctakes/examples/annotation/
> anafora_annotated/OBGYN_MVAPrego_1/OBGYN_MVAPrego_1
>
> What notes do you consider I should start with (all) ?
>
> Alex
>
> [1] - https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.
> apache.org_jira_browse_CTAKES-2D462&d=DwIBaQ&c=qS4goWBT7poplM69zy_
> 3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-
> OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=COSkyBpYGrcp_
> hTAFRRfTx8JCwHAzxTM3GMiXKrSbnE&s=jOmot_onPFb31eg689D0ihb5Y4dZTzKcQ40v
> MCW0Bgk&e=
>
>
> On Mon, Oct 2, 2017 at 6:46 PM, Miller, Timothy <Timothy.Miller@childrens.
> harvard.edu> wrote:
>
> > Yeah, it might be nice to build a lucene index of all the sample notes in
> > the ctakes-example module. I'll create a jira for it but probably won't
> be
> > able to get to it right away.
> > Tim
> >
> > ________________________________________
> > From: Alexandru Zbarcea <al...@apache.org>
> > Sent: Monday, October 2, 2017 5:31 PM
> > To: Apache cTAKES Dev
> > Subject: Re: Missing resources for script that extracts markables from a
> > corpus for analysis [EXTERNAL]
> >
> > Hi Tim,
> >
> > I understand, makes sense. Is it possible to anonymize the data you have
> or
> > come up with a separate body of test data to generate a Lucene index and
> > unit test the code? I think this would have the double benefit of the
> code
> > being tested and showing dev/users how the code is supposed to be used.
> >
> > What do you think?
> >
> > Alex
> >
> >
> > On Mon, Oct 2, 2017 at 9:45 AM, Miller, Timothy <
> > Timothy.Miller@childrens.harvard.edu> wrote:
> >
> > > Thanks Alex,
> > > This code is for processing a clinical text data corpus stored as a
> > > lucene index -- data that cannot be redistributed for privacy reasons.
> > > Since it's so related to the coref stuff I thought it should go
> > > alongside the coreference module. But maybe it makes more sense as an
> > > external project since it can't really function without externally
> > > created resources -- what do you think?
> > > Tim
> > >
> > >
> > > On Sun, 2017-10-01 at 19:54 -0400, Alexandru Zbarcea wrote:
> > > > Hi,
> > > >
> > > > I was trying to do a UTest for the
> > > > org.apache.ctakes.coreference.data.PrintMimicMarkables (recently
> > > > added),
> > > > but I couldn't find any of the existing resources that can be used
> > > > for
> > > > this. Can anyone help me pointing to a resource (Lucene index)
> > > > folder.
> > > >
> > > > org.apache.ctakes.coreference.data.PrintMimicMarkables \
> > > >
> > > > /home/alex/projects/apache/ctakes/ctakes-dictionary-lookup-
> > > > res/target/classes/org/apache/ctakes/dictionary/lookup/rxnorm_index
> > > > \
> > > >     index.out
> > > >
> > > > I was trying with the following lucene folder/resource:
> > > > ./ctakes-coreference-
> > > > res/src/main/resources/org/apache/ctakes/coreference/
> models/index_med
> > > > _5k
> > > >
> > > > And also the dictionaries:
> > > > ./ctakes-dictionary-lookup-
> > > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed-
> > > > like_codes_sample
> > > > ./ctakes-dictionary-lookup-
> > > > res/src/main/resources/org/apache/ctakes/dictionary/
> lookup/assertion_
> > > > cue_phrase_index
> > > > ./ctakes-dictionary-lookup-
> > > > res/src/main/resources/org/apache/ctakes/dictionary/
> lookup/OrangeBook
> > > > ./ctakes-dictionary-lookup-
> > > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed-
> > > > like_sample
> > > > ./ctakes-dictionary-lookup-
> > > > res/src/main/resources/org/apache/ctakes/dictionary/
> lookup/drug_index
> > > >
> > > > Any execution looks like:
> > > > 01 Oct 2017 19:50:19  INFO ConstituencyParser - Initializing
> > > > parser...
> > > > Oct 01, 2017 7:50:20 PM
> > > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer process
> > > > WARNING: Got Exception. (Thread Name: [CollectionReader Thread]::)
> > > > Message:
> > > > docID must be >= 0 and < maxDoc=5000 (got docID=5000)
> > > > Oct 01, 2017 7:50:20 PM
> > > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer run(820)
> > > > WARNING: docID must be >= 0 and < maxDoc=5000 (got docID=5000)
> > > > java.lang.IllegalArgumentException: docID must be >= 0 and <
> > > > maxDoc=5000
> > > > (got docID=5000)
> > > > at
> > > > org.apache.lucene.index.BaseCompositeReader.
> readerIndex(BaseComposite
> > > > Reader.java:152)
> > > > at
> > > > org.apache.lucene.index.BaseCompositeReader.document(
> BaseCompositeRea
> > > > der.java:115)
> > > > at org.apache.lucene.index.IndexReader.document(
> IndexReader.java:436)
> > > > at
> > > > org.apache.ctakes.core.cr.LuceneCollectionReader.
> getNext(LuceneCollec
> > > > tionReader.java:90)
> > > > at
> > > > org.apache.uima.collection.impl.cpm.engine.
> ArtifactProducer.readNext(
> > > > ArtifactProducer.java:494)
> > > > at
> > > > org.apache.uima.collection.impl.cpm.engine.
> ArtifactProducer.run(Artif
> > > > actProducer.java:711)
> > > >
> > > > Collection process complete called, closing file writer.
> > > >
> > > > I appreciate any of your help,
> > > > Alex
> >
>

Re: Missing resources for script that extracts markables from a corpus for analysis [EXTERNAL]

Posted by "Miller, Timothy" <Ti...@childrens.harvard.edu>.
I had in mind the notes in:
/ctakes-examples-res/src/main/resources/org/apache/ctakes/examples/notes/rtf

which I believe are the fake notes Dr. John Green wrote for us. I don't know why they are rtf but they are nice, non-toy-length notes.
Tim

________________________________________
From: Alexandru Zbarcea <al...@apache.org>
Sent: Tuesday, October 3, 2017 5:32 PM
To: Apache cTAKES Dev
Subject: Re: Missing resources for script that extracts markables from a corpus for analysis [EXTERNAL]

Hi Tim,

That's great news. If you think there are sample notes that can be used, I
can start working on the Lucene index and slowly build the UTest for them.

I have created CTAKES-462[1] where we can track this work.

Looking into the ctakes-examples-res, what I can find is:
$ find . -type f | grep -v "\.class" | grep -v "\.iml" | grep -v "\.jar" |
grep -v "\.rtf" | grep -v "\.xml" | grep -v "\.bsv" | grep -v "\.piper"
./main/resources/org/apache/ctakes/examples/notes/pain_no_swelling.txt
./main/resources/org/apache/ctakes/examples/notes/claudication
./main/resources/org/apache/ctakes/examples/notes/shark_bite.txt
./main/resources/org/apache/ctakes/examples/notes/edge_cases_plaintext_1.txt

./main/resources/org/apache/ctakes/examples/notes/dr_nutritious_1.txt
./main/resources/org/apache/ctakes/examples/notes/right_knee_arthroscopy
./main/resources/org/apache/ctakes/examples/notes/SampleInputRadiologyNotes.txt

./main/resources/org/apache/ctakes/examples/notes/smoker/
doc1_07543210_sample_past_smoker.txt
./main/resources/org/apache/ctakes/examples/notes/smoker/
doc2_07543210_sample_past_smoker.txt
./main/resources/org/apache/ctakes/examples/notes/smoker/
doc2_07543210_sample_current.txt
./main/resources/org/apache/ctakes/examples/notes/smoker/
doc1_07543210_sample_unknown.txt
./main/resources/org/apache/ctakes/examples/notes/smoker/
doc1_07543210_sample_current.txt
./main/resources/org/apache/ctakes/examples/notes/mother_goose/README
./main/resources/org/apache/ctakes/examples/notes/mother_
goose/OneMistyMoistyMorning.txt
./main/resources/org/apache/ctakes/examples/notes/dr_nutritious_2.txt
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/Peds_RoutBirthNote_1/Peds_RoutBirthNote_1
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/VascSurg_AAA_Leak_1/VascSurg_AAA_Leak_1
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/Peds_Dysphagia_1/Peds_Dysphagia_1
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/OBGYN_LaborProgressNote_1/OBGYN_LaborProgressNote_1
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/OBGYN_IUD_1/OBGYN_IUD_1
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/OBGYN_HysterectomyAndBSO_1/OBGYN_HysterectomyAndBSO_1
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/VascSurg_FollowUp_1/VascSurg_FollowUp_1
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/OBGYN_PROMCheck_1/OBGYN_PROMCheck_1
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/OBGYN_Gen_Abscess_1/OBGYN_Gen_Abscess_1
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/Peds_FebrileSez_1/Peds_FebrileSez_1
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/VascSurg_RO_AAA_1/VascSurg_RO_AAA_1
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/VascSurg_RO_DVT_1/VascSurg_RO_DVT_1
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/GenSurg_UmbilicalHernia_1/GenSurg_UmbilicalHernia_1
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/VascSurg_PVD_1/VascSurg_PVD_1
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/OBGYN_MVAPrego_1/OBGYN_MVAPrego_1

What notes do you consider I should start with (all) ?

Alex

[1] - https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CTAKES-2D462&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=COSkyBpYGrcp_hTAFRRfTx8JCwHAzxTM3GMiXKrSbnE&s=jOmot_onPFb31eg689D0ihb5Y4dZTzKcQ40vMCW0Bgk&e=


On Mon, Oct 2, 2017 at 6:46 PM, Miller, Timothy <Timothy.Miller@childrens.
harvard.edu> wrote:

> Yeah, it might be nice to build a lucene index of all the sample notes in
> the ctakes-example module. I'll create a jira for it but probably won't be
> able to get to it right away.
> Tim
>
> ________________________________________
> From: Alexandru Zbarcea <al...@apache.org>
> Sent: Monday, October 2, 2017 5:31 PM
> To: Apache cTAKES Dev
> Subject: Re: Missing resources for script that extracts markables from a
> corpus for analysis [EXTERNAL]
>
> Hi Tim,
>
> I understand, makes sense. Is it possible to anonymize the data you have or
> come up with a separate body of test data to generate a Lucene index and
> unit test the code? I think this would have the double benefit of the code
> being tested and showing dev/users how the code is supposed to be used.
>
> What do you think?
>
> Alex
>
>
> On Mon, Oct 2, 2017 at 9:45 AM, Miller, Timothy <
> Timothy.Miller@childrens.harvard.edu> wrote:
>
> > Thanks Alex,
> > This code is for processing a clinical text data corpus stored as a
> > lucene index -- data that cannot be redistributed for privacy reasons.
> > Since it's so related to the coref stuff I thought it should go
> > alongside the coreference module. But maybe it makes more sense as an
> > external project since it can't really function without externally
> > created resources -- what do you think?
> > Tim
> >
> >
> > On Sun, 2017-10-01 at 19:54 -0400, Alexandru Zbarcea wrote:
> > > Hi,
> > >
> > > I was trying to do a UTest for the
> > > org.apache.ctakes.coreference.data.PrintMimicMarkables (recently
> > > added),
> > > but I couldn't find any of the existing resources that can be used
> > > for
> > > this. Can anyone help me pointing to a resource (Lucene index)
> > > folder.
> > >
> > > org.apache.ctakes.coreference.data.PrintMimicMarkables \
> > >
> > > /home/alex/projects/apache/ctakes/ctakes-dictionary-lookup-
> > > res/target/classes/org/apache/ctakes/dictionary/lookup/rxnorm_index
> > > \
> > >     index.out
> > >
> > > I was trying with the following lucene folder/resource:
> > > ./ctakes-coreference-
> > > res/src/main/resources/org/apache/ctakes/coreference/models/index_med
> > > _5k
> > >
> > > And also the dictionaries:
> > > ./ctakes-dictionary-lookup-
> > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed-
> > > like_codes_sample
> > > ./ctakes-dictionary-lookup-
> > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/assertion_
> > > cue_phrase_index
> > > ./ctakes-dictionary-lookup-
> > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/OrangeBook
> > > ./ctakes-dictionary-lookup-
> > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed-
> > > like_sample
> > > ./ctakes-dictionary-lookup-
> > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/drug_index
> > >
> > > Any execution looks like:
> > > 01 Oct 2017 19:50:19  INFO ConstituencyParser - Initializing
> > > parser...
> > > Oct 01, 2017 7:50:20 PM
> > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer process
> > > WARNING: Got Exception. (Thread Name: [CollectionReader Thread]::)
> > > Message:
> > > docID must be >= 0 and < maxDoc=5000 (got docID=5000)
> > > Oct 01, 2017 7:50:20 PM
> > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer run(820)
> > > WARNING: docID must be >= 0 and < maxDoc=5000 (got docID=5000)
> > > java.lang.IllegalArgumentException: docID must be >= 0 and <
> > > maxDoc=5000
> > > (got docID=5000)
> > > at
> > > org.apache.lucene.index.BaseCompositeReader.readerIndex(BaseComposite
> > > Reader.java:152)
> > > at
> > > org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeRea
> > > der.java:115)
> > > at org.apache.lucene.index.IndexReader.document(IndexReader.java:436)
> > > at
> > > org.apache.ctakes.core.cr.LuceneCollectionReader.getNext(LuceneCollec
> > > tionReader.java:90)
> > > at
> > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.readNext(
> > > ArtifactProducer.java:494)
> > > at
> > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.run(Artif
> > > actProducer.java:711)
> > >
> > > Collection process complete called, closing file writer.
> > >
> > > I appreciate any of your help,
> > > Alex
>

Re: Missing resources for script that extracts markables from a corpus for analysis [EXTERNAL]

Posted by Alexandru Zbarcea <al...@apache.org>.
Hi Tim,

That's great news. If you think there are sample notes that can be used, I
can start working on the Lucene index and slowly build the UTest for them.

I have created CTAKES-462[1] where we can track this work.

Looking into the ctakes-examples-res, what I can find is:
$ find . -type f | grep -v "\.class" | grep -v "\.iml" | grep -v "\.jar" |
grep -v "\.rtf" | grep -v "\.xml" | grep -v "\.bsv" | grep -v "\.piper"
./main/resources/org/apache/ctakes/examples/notes/pain_no_swelling.txt
./main/resources/org/apache/ctakes/examples/notes/claudication
./main/resources/org/apache/ctakes/examples/notes/shark_bite.txt
./main/resources/org/apache/ctakes/examples/notes/edge_cases_plaintext_1.txt

./main/resources/org/apache/ctakes/examples/notes/dr_nutritious_1.txt
./main/resources/org/apache/ctakes/examples/notes/right_knee_arthroscopy
./main/resources/org/apache/ctakes/examples/notes/SampleInputRadiologyNotes.txt

./main/resources/org/apache/ctakes/examples/notes/smoker/
doc1_07543210_sample_past_smoker.txt
./main/resources/org/apache/ctakes/examples/notes/smoker/
doc2_07543210_sample_past_smoker.txt
./main/resources/org/apache/ctakes/examples/notes/smoker/
doc2_07543210_sample_current.txt
./main/resources/org/apache/ctakes/examples/notes/smoker/
doc1_07543210_sample_unknown.txt
./main/resources/org/apache/ctakes/examples/notes/smoker/
doc1_07543210_sample_current.txt
./main/resources/org/apache/ctakes/examples/notes/mother_goose/README
./main/resources/org/apache/ctakes/examples/notes/mother_
goose/OneMistyMoistyMorning.txt
./main/resources/org/apache/ctakes/examples/notes/dr_nutritious_2.txt
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/Peds_RoutBirthNote_1/Peds_RoutBirthNote_1
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/VascSurg_AAA_Leak_1/VascSurg_AAA_Leak_1
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/Peds_Dysphagia_1/Peds_Dysphagia_1
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/OBGYN_LaborProgressNote_1/OBGYN_LaborProgressNote_1
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/OBGYN_IUD_1/OBGYN_IUD_1
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/OBGYN_HysterectomyAndBSO_1/OBGYN_HysterectomyAndBSO_1
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/VascSurg_FollowUp_1/VascSurg_FollowUp_1
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/OBGYN_PROMCheck_1/OBGYN_PROMCheck_1
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/OBGYN_Gen_Abscess_1/OBGYN_Gen_Abscess_1
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/Peds_FebrileSez_1/Peds_FebrileSez_1
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/VascSurg_RO_AAA_1/VascSurg_RO_AAA_1
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/VascSurg_RO_DVT_1/VascSurg_RO_DVT_1
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/GenSurg_UmbilicalHernia_1/GenSurg_UmbilicalHernia_1
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/VascSurg_PVD_1/VascSurg_PVD_1
./main/resources/org/apache/ctakes/examples/annotation/
anafora_annotated/OBGYN_MVAPrego_1/OBGYN_MVAPrego_1

What notes do you consider I should start with (all) ?

Alex

[1] - https://issues.apache.org/jira/browse/CTAKES-462


On Mon, Oct 2, 2017 at 6:46 PM, Miller, Timothy <Timothy.Miller@childrens.
harvard.edu> wrote:

> Yeah, it might be nice to build a lucene index of all the sample notes in
> the ctakes-example module. I'll create a jira for it but probably won't be
> able to get to it right away.
> Tim
>
> ________________________________________
> From: Alexandru Zbarcea <al...@apache.org>
> Sent: Monday, October 2, 2017 5:31 PM
> To: Apache cTAKES Dev
> Subject: Re: Missing resources for script that extracts markables from a
> corpus for analysis [EXTERNAL]
>
> Hi Tim,
>
> I understand, makes sense. Is it possible to anonymize the data you have or
> come up with a separate body of test data to generate a Lucene index and
> unit test the code? I think this would have the double benefit of the code
> being tested and showing dev/users how the code is supposed to be used.
>
> What do you think?
>
> Alex
>
>
> On Mon, Oct 2, 2017 at 9:45 AM, Miller, Timothy <
> Timothy.Miller@childrens.harvard.edu> wrote:
>
> > Thanks Alex,
> > This code is for processing a clinical text data corpus stored as a
> > lucene index -- data that cannot be redistributed for privacy reasons.
> > Since it's so related to the coref stuff I thought it should go
> > alongside the coreference module. But maybe it makes more sense as an
> > external project since it can't really function without externally
> > created resources -- what do you think?
> > Tim
> >
> >
> > On Sun, 2017-10-01 at 19:54 -0400, Alexandru Zbarcea wrote:
> > > Hi,
> > >
> > > I was trying to do a UTest for the
> > > org.apache.ctakes.coreference.data.PrintMimicMarkables (recently
> > > added),
> > > but I couldn't find any of the existing resources that can be used
> > > for
> > > this. Can anyone help me pointing to a resource (Lucene index)
> > > folder.
> > >
> > > org.apache.ctakes.coreference.data.PrintMimicMarkables \
> > >
> > > /home/alex/projects/apache/ctakes/ctakes-dictionary-lookup-
> > > res/target/classes/org/apache/ctakes/dictionary/lookup/rxnorm_index
> > > \
> > >     index.out
> > >
> > > I was trying with the following lucene folder/resource:
> > > ./ctakes-coreference-
> > > res/src/main/resources/org/apache/ctakes/coreference/models/index_med
> > > _5k
> > >
> > > And also the dictionaries:
> > > ./ctakes-dictionary-lookup-
> > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed-
> > > like_codes_sample
> > > ./ctakes-dictionary-lookup-
> > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/assertion_
> > > cue_phrase_index
> > > ./ctakes-dictionary-lookup-
> > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/OrangeBook
> > > ./ctakes-dictionary-lookup-
> > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed-
> > > like_sample
> > > ./ctakes-dictionary-lookup-
> > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/drug_index
> > >
> > > Any execution looks like:
> > > 01 Oct 2017 19:50:19  INFO ConstituencyParser - Initializing
> > > parser...
> > > Oct 01, 2017 7:50:20 PM
> > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer process
> > > WARNING: Got Exception. (Thread Name: [CollectionReader Thread]::)
> > > Message:
> > > docID must be >= 0 and < maxDoc=5000 (got docID=5000)
> > > Oct 01, 2017 7:50:20 PM
> > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer run(820)
> > > WARNING: docID must be >= 0 and < maxDoc=5000 (got docID=5000)
> > > java.lang.IllegalArgumentException: docID must be >= 0 and <
> > > maxDoc=5000
> > > (got docID=5000)
> > > at
> > > org.apache.lucene.index.BaseCompositeReader.readerIndex(BaseComposite
> > > Reader.java:152)
> > > at
> > > org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeRea
> > > der.java:115)
> > > at org.apache.lucene.index.IndexReader.document(IndexReader.java:436)
> > > at
> > > org.apache.ctakes.core.cr.LuceneCollectionReader.getNext(LuceneCollec
> > > tionReader.java:90)
> > > at
> > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.readNext(
> > > ArtifactProducer.java:494)
> > > at
> > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.run(Artif
> > > actProducer.java:711)
> > >
> > > Collection process complete called, closing file writer.
> > >
> > > I appreciate any of your help,
> > > Alex
>

Re: Missing resources for script that extracts markables from a corpus for analysis [EXTERNAL]

Posted by "Miller, Timothy" <Ti...@childrens.harvard.edu>.
Yeah, it might be nice to build a lucene index of all the sample notes in the ctakes-example module. I'll create a jira for it but probably won't be able to get to it right away.
Tim

________________________________________
From: Alexandru Zbarcea <al...@apache.org>
Sent: Monday, October 2, 2017 5:31 PM
To: Apache cTAKES Dev
Subject: Re: Missing resources for script that extracts markables from a corpus for analysis [EXTERNAL]

Hi Tim,

I understand, makes sense. Is it possible to anonymize the data you have or
come up with a separate body of test data to generate a Lucene index and
unit test the code? I think this would have the double benefit of the code
being tested and showing dev/users how the code is supposed to be used.

What do you think?

Alex


On Mon, Oct 2, 2017 at 9:45 AM, Miller, Timothy <
Timothy.Miller@childrens.harvard.edu> wrote:

> Thanks Alex,
> This code is for processing a clinical text data corpus stored as a
> lucene index -- data that cannot be redistributed for privacy reasons.
> Since it's so related to the coref stuff I thought it should go
> alongside the coreference module. But maybe it makes more sense as an
> external project since it can't really function without externally
> created resources -- what do you think?
> Tim
>
>
> On Sun, 2017-10-01 at 19:54 -0400, Alexandru Zbarcea wrote:
> > Hi,
> >
> > I was trying to do a UTest for the
> > org.apache.ctakes.coreference.data.PrintMimicMarkables (recently
> > added),
> > but I couldn't find any of the existing resources that can be used
> > for
> > this. Can anyone help me pointing to a resource (Lucene index)
> > folder.
> >
> > org.apache.ctakes.coreference.data.PrintMimicMarkables \
> >
> > /home/alex/projects/apache/ctakes/ctakes-dictionary-lookup-
> > res/target/classes/org/apache/ctakes/dictionary/lookup/rxnorm_index
> > \
> >     index.out
> >
> > I was trying with the following lucene folder/resource:
> > ./ctakes-coreference-
> > res/src/main/resources/org/apache/ctakes/coreference/models/index_med
> > _5k
> >
> > And also the dictionaries:
> > ./ctakes-dictionary-lookup-
> > res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed-
> > like_codes_sample
> > ./ctakes-dictionary-lookup-
> > res/src/main/resources/org/apache/ctakes/dictionary/lookup/assertion_
> > cue_phrase_index
> > ./ctakes-dictionary-lookup-
> > res/src/main/resources/org/apache/ctakes/dictionary/lookup/OrangeBook
> > ./ctakes-dictionary-lookup-
> > res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed-
> > like_sample
> > ./ctakes-dictionary-lookup-
> > res/src/main/resources/org/apache/ctakes/dictionary/lookup/drug_index
> >
> > Any execution looks like:
> > 01 Oct 2017 19:50:19  INFO ConstituencyParser - Initializing
> > parser...
> > Oct 01, 2017 7:50:20 PM
> > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer process
> > WARNING: Got Exception. (Thread Name: [CollectionReader Thread]::)
> > Message:
> > docID must be >= 0 and < maxDoc=5000 (got docID=5000)
> > Oct 01, 2017 7:50:20 PM
> > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer run(820)
> > WARNING: docID must be >= 0 and < maxDoc=5000 (got docID=5000)
> > java.lang.IllegalArgumentException: docID must be >= 0 and <
> > maxDoc=5000
> > (got docID=5000)
> > at
> > org.apache.lucene.index.BaseCompositeReader.readerIndex(BaseComposite
> > Reader.java:152)
> > at
> > org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeRea
> > der.java:115)
> > at org.apache.lucene.index.IndexReader.document(IndexReader.java:436)
> > at
> > org.apache.ctakes.core.cr.LuceneCollectionReader.getNext(LuceneCollec
> > tionReader.java:90)
> > at
> > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.readNext(
> > ArtifactProducer.java:494)
> > at
> > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.run(Artif
> > actProducer.java:711)
> >
> > Collection process complete called, closing file writer.
> >
> > I appreciate any of your help,
> > Alex

Re: Missing resources for script that extracts markables from a corpus for analysis [EXTERNAL]

Posted by Alexandru Zbarcea <al...@apache.org>.
Hi Tim,

I understand, makes sense. Is it possible to anonymize the data you have or
come up with a separate body of test data to generate a Lucene index and
unit test the code? I think this would have the double benefit of the code
being tested and showing dev/users how the code is supposed to be used.

What do you think?

Alex


On Mon, Oct 2, 2017 at 9:45 AM, Miller, Timothy <
Timothy.Miller@childrens.harvard.edu> wrote:

> Thanks Alex,
> This code is for processing a clinical text data corpus stored as a
> lucene index -- data that cannot be redistributed for privacy reasons.
> Since it's so related to the coref stuff I thought it should go
> alongside the coreference module. But maybe it makes more sense as an
> external project since it can't really function without externally
> created resources -- what do you think?
> Tim
>
>
> On Sun, 2017-10-01 at 19:54 -0400, Alexandru Zbarcea wrote:
> > Hi,
> >
> > I was trying to do a UTest for the
> > org.apache.ctakes.coreference.data.PrintMimicMarkables (recently
> > added),
> > but I couldn't find any of the existing resources that can be used
> > for
> > this. Can anyone help me pointing to a resource (Lucene index)
> > folder.
> >
> > org.apache.ctakes.coreference.data.PrintMimicMarkables \
> >
> > /home/alex/projects/apache/ctakes/ctakes-dictionary-lookup-
> > res/target/classes/org/apache/ctakes/dictionary/lookup/rxnorm_index
> > \
> >     index.out
> >
> > I was trying with the following lucene folder/resource:
> > ./ctakes-coreference-
> > res/src/main/resources/org/apache/ctakes/coreference/models/index_med
> > _5k
> >
> > And also the dictionaries:
> > ./ctakes-dictionary-lookup-
> > res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed-
> > like_codes_sample
> > ./ctakes-dictionary-lookup-
> > res/src/main/resources/org/apache/ctakes/dictionary/lookup/assertion_
> > cue_phrase_index
> > ./ctakes-dictionary-lookup-
> > res/src/main/resources/org/apache/ctakes/dictionary/lookup/OrangeBook
> > ./ctakes-dictionary-lookup-
> > res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed-
> > like_sample
> > ./ctakes-dictionary-lookup-
> > res/src/main/resources/org/apache/ctakes/dictionary/lookup/drug_index
> >
> > Any execution looks like:
> > 01 Oct 2017 19:50:19  INFO ConstituencyParser - Initializing
> > parser...
> > Oct 01, 2017 7:50:20 PM
> > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer process
> > WARNING: Got Exception. (Thread Name: [CollectionReader Thread]::)
> > Message:
> > docID must be >= 0 and < maxDoc=5000 (got docID=5000)
> > Oct 01, 2017 7:50:20 PM
> > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer run(820)
> > WARNING: docID must be >= 0 and < maxDoc=5000 (got docID=5000)
> > java.lang.IllegalArgumentException: docID must be >= 0 and <
> > maxDoc=5000
> > (got docID=5000)
> > at
> > org.apache.lucene.index.BaseCompositeReader.readerIndex(BaseComposite
> > Reader.java:152)
> > at
> > org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeRea
> > der.java:115)
> > at org.apache.lucene.index.IndexReader.document(IndexReader.java:436)
> > at
> > org.apache.ctakes.core.cr.LuceneCollectionReader.getNext(LuceneCollec
> > tionReader.java:90)
> > at
> > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.readNext(
> > ArtifactProducer.java:494)
> > at
> > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.run(Artif
> > actProducer.java:711)
> >
> > Collection process complete called, closing file writer.
> >
> > I appreciate any of your help,
> > Alex

Re: Missing resources for script that extracts markables from a corpus for analysis [EXTERNAL]

Posted by "Miller, Timothy" <Ti...@childrens.harvard.edu>.
Thanks Alex,
This code is for processing a clinical text data corpus stored as a
lucene index -- data that cannot be redistributed for privacy reasons.
Since it's so related to the coref stuff I thought it should go
alongside the coreference module. But maybe it makes more sense as an
external project since it can't really function without externally
created resources -- what do you think?
Tim


On Sun, 2017-10-01 at 19:54 -0400, Alexandru Zbarcea wrote:
> Hi,
> 
> I was trying to do a UTest for the
> org.apache.ctakes.coreference.data.PrintMimicMarkables (recently
> added),
> but I couldn't find any of the existing resources that can be used
> for
> this. Can anyone help me pointing to a resource (Lucene index)
> folder.
> 
> org.apache.ctakes.coreference.data.PrintMimicMarkables \
> 
> /home/alex/projects/apache/ctakes/ctakes-dictionary-lookup-
> res/target/classes/org/apache/ctakes/dictionary/lookup/rxnorm_index
> \
>     index.out
> 
> I was trying with the following lucene folder/resource:
> ./ctakes-coreference-
> res/src/main/resources/org/apache/ctakes/coreference/models/index_med
> _5k
> 
> And also the dictionaries:
> ./ctakes-dictionary-lookup-
> res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed-
> like_codes_sample
> ./ctakes-dictionary-lookup-
> res/src/main/resources/org/apache/ctakes/dictionary/lookup/assertion_
> cue_phrase_index
> ./ctakes-dictionary-lookup-
> res/src/main/resources/org/apache/ctakes/dictionary/lookup/OrangeBook
> ./ctakes-dictionary-lookup-
> res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed-
> like_sample
> ./ctakes-dictionary-lookup-
> res/src/main/resources/org/apache/ctakes/dictionary/lookup/drug_index
> 
> Any execution looks like:
> 01 Oct 2017 19:50:19  INFO ConstituencyParser - Initializing
> parser...
> Oct 01, 2017 7:50:20 PM
> org.apache.uima.collection.impl.cpm.engine.ArtifactProducer process
> WARNING: Got Exception. (Thread Name: [CollectionReader Thread]::)
> Message:
> docID must be >= 0 and < maxDoc=5000 (got docID=5000)
> Oct 01, 2017 7:50:20 PM
> org.apache.uima.collection.impl.cpm.engine.ArtifactProducer run(820)
> WARNING: docID must be >= 0 and < maxDoc=5000 (got docID=5000)
> java.lang.IllegalArgumentException: docID must be >= 0 and <
> maxDoc=5000
> (got docID=5000)
> at
> org.apache.lucene.index.BaseCompositeReader.readerIndex(BaseComposite
> Reader.java:152)
> at
> org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeRea
> der.java:115)
> at org.apache.lucene.index.IndexReader.document(IndexReader.java:436)
> at
> org.apache.ctakes.core.cr.LuceneCollectionReader.getNext(LuceneCollec
> tionReader.java:90)
> at
> org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.readNext(
> ArtifactProducer.java:494)
> at
> org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.run(Artif
> actProducer.java:711)
> 
> Collection process complete called, closing file writer.
> 
> I appreciate any of your help,
> Alex