You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Giuseppe Totaro (JIRA)" <ji...@apache.org> on 2015/05/29 02:32:32 UTC

[jira] [Commented] (TIKA-1642) Integrate cTAKES into Tika

    [ https://issues.apache.org/jira/browse/TIKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563993#comment-14563993 ] 

Giuseppe Totaro commented on TIKA-1642:
---------------------------------------

Hi [~selina], I believe that is a great idea. I am going right now to update my code on GitHub and add support for cTAKES metadata as suggested by you.
Then, I will post here a new patch for Tika.
Thanks a lot,
Giuseppe

> Integrate cTAKES into Tika
> --------------------------
>
>                 Key: TIKA-1642
>                 URL: https://issues.apache.org/jira/browse/TIKA-1642
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Selina Chu
>
> [~gostep] has written a preliminary version of [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] to integrate [Apache cTAKES|http://ctakes.apache.org/] into Tika.
> The CTAKESContentHandler allows to perform the following step into Tika:
> * create an AnalysisEngine based on a given XML descriptor;
> * create a CAS (Common Analysis System) appropriate for this AnalysisEngine;
> * populate the CAS with the text extracted by using Tika;
> * perform the AnalysisEngine against the plain text added to CAS;
> * write out the results in the given format (XML, XCAS, XMI, etc.).
> It would be great improvement if we can parse the output of cTAKES and create a list of metadata which describes the terms found in the annotation index and their corresponding tokens. For instance, using the AggregatePlaintextFastUMLSProcessor analysis engine, we can utilize the UMLS database to obtain the annotations related to DiseaseDisorderMention, and I would like to be able to produce a list of words corresponding to the input text which is annotated as DiseaseDisorderMention.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)