You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by Apache Wiki <wi...@apache.org> on 2015/09/23 01:33:16 UTC

[Tika Wiki] Update of "cTAKESParser" by ChrisMattmann

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.

The "cTAKESParser" page has been changed by ChrisMattmann:
https://wiki.apache.org/tika/cTAKESParser?action=diff&rev1=7&rev2=8

Comment:
- update instructions

  
  = Prepare your CTAKES configuration properties file =
  
- The cTAKESParser requires a configuration properties file. You can find an example [[https://issues.apache.org/jira/secure/attachment/12737116/CTAKESConfig.properties|here]] on [[https://issues.apache.org/jira/browse/TIKA-1645|TIKA-1645]].
+ The cTAKESParser requires a configuration properties file. You can find an example [[https://raw.githubusercontent.com/chrismattmann/ctakesparser-utils/master/config/org/apache/tika/parser/ctakes/CTAKESConfig.properties|here]] originally from [[https://issues.apache.org/jira/browse/TIKA-1645|TIKA-1645]] and adapted and maintained in Github now in [[https://github.com/chrismattmann/ctakesparser-utils/|ctakesparser-utils]].
  
  Edit it as follows.
  
@@ -58, +58 @@

  You will need to place the CTAKESConfig.properties file in a classpath directory, e.g., org/apache/tika/parser/ctakes and include it on the classpath when calling the parser. Follow these steps:
  
   1. `mkdir -p $HOME/src/ctakes-config/org/apache/tika/parser/ctakes && cd $HOME/src/ctakes-config/org/apache/tika/parser/ctakes`
-  2. `curl -kO "https://issues.apache.org/jira/secure/attachment/12737116/CTAKESConfig.properties"`
+  2. `curl -kO "https://raw.githubusercontent.com/chrismattmann/ctakesparser-utils/master/config/org/apache/tika/parser/ctakes/CTAKESConfig.properties"`
  
  = Setting up the Tika Config file =
  
- You will need a custom Tika configuration file for the parser. You can find one [[here|https://issues.apache.org/jira/secure/attachment/12737115/tika-config.xml]]. The reason is that since cTAKESParser decorates AutoDetectParser, in reality, cTAKESParser can handle *any* kind of file type that it can. But you have to make cTAKESParser intercept the mime types you want it to extract biomedical information from. So if you want Tika and its cTAKESParser to etxract biomedical information from application/pdf files, you will need this custom config and to add application/pdf as a mime that the parser can deal with. The default config provided looks like:
+ You will need a custom Tika configuration file for the parser. You can find one [[here|https://raw.githubusercontent.com/chrismattmann/ctakesparser-utils/master/config/tika-config.xml]]. The reason is that since cTAKESParser decorates AutoDetectParser, in reality, cTAKESParser can handle *any* kind of file type that it can. But you have to make cTAKESParser intercept the mime types you want it to extract biomedical information from. So if you want Tika and its cTAKESParser to etxract biomedical information from application/pdf files, you will need this custom config and to add application/pdf as a mime that the parser can deal with. The default config provided looks like:
  
  {{{
  <?xml version="1.0" encoding="UTF-8" standalone="no"?>
@@ -70, +70 @@

    <parsers>
      <parser class="org.apache.tika.parser.ctakes.CTAKESParser">
        <mime>application/x-isatab</mime>
+       <parser class="org.apache.tika.parser.DefaultParser"/>
      </parser>
    </parsers>
  </properties>
@@ -84, +85 @@

      <parser class="org.apache.tika.parser.ctakes.CTAKESParser">
        <mime>application/x-isatab</mime>
        <mime>application/pdf</mime>
+       <parser class="org.apache.tika.parser.DefaultParser"/>
      </parser>
    </parsers>
  </properties>
@@ -94, +96 @@

  To download and set up the custom Tika config, do the following.
  
   1. `cd $HOME/src/ctakes-config`
-  2. `curl -kO "https://issues.apache.org/jira/secure/attachment/12737115/tika-config.xml"`
+  2. `curl -kO "https://raw.githubusercontent.com/chrismattmann/ctakesparser-utils/master/config/tika-config.xml"`
  
  = Putting it all together: Tika-App =