You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ctakes.apache.org by Tomasz Oliwa <ol...@uchicago.edu> on 2015/08/03 21:19:43 UTC

RE: UmlsConcept subject

Pei,

I added some example sentences to the JIRA ticket regarding the subject status.

It would be great if they could be used together with the existing sentences in a training set.

Regards,
Tomasz

________________________________________
From: Pei Chen [chenpei@apache.org]
Sent: Thursday, July 30, 2015 4:07 PM
To: dev@ctakes.apache.org
Subject: Re: UmlsConcept subject

Tomasz,
IIRC, the code in SubjectCleartkAnalysisEngine.java should have the
feature extractors used- I believe there is an ENUM of a preset of
features, but do not recall exactly which one was the best performing
for test set- probably best to check the source code.

I think adding the plain sentences examples in Jira would be a great
help since we can use that for unit testing at a minimum.
Currently, there is no real easy way to 'Append' training data, so one
has create the new set with examples in it.  The code used for
training is also in the project- it should be in the **/eval/* name
spaces.  I believe the gold standard was created in xml (either
knowtator or anafora).

Hope that helps.
--Pei

On Thu, Jul 23, 2015 at 10:33 AM, Tomasz Oliwa <ol...@uchicago.edu> wrote:
> What format (features, labels) is best suitable for some more training examples?
>
> The SubjectCleartkAnalysisEngine class loads a /org/apache/ctakes/assertion/models/subject/model.jar, which contains a liblinear cleartk model.
>
> The model has 3 features, label 12 3.
>
> But what are the features exactly are how are they derived?
>
> How does the target class look like, is is really differentiating between "patient", "brother", "sister" etc. or is it a binary decision model between "patient" and "family_history" (the latter is what is looks to me) ?
>
> This is not documented.
>
> Tomasz