You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Alan Simmons (JIRA)" <ji...@apache.org> on 2016/12/07 21:02:58 UTC

[jira] [Updated] (TIKA-2188) Illegal SAXException when using cTAKESParser (Docker configuration)

     [ https://issues.apache.org/jira/browse/TIKA-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Simmons updated TIKA-2188:
-------------------------------
    Summary: Illegal SAXException when using cTAKESParser (Docker configuration)  (was: Illegal SAXException when using cTAKESParser)

> Illegal SAXException when using cTAKESParser (Docker configuration)
> -------------------------------------------------------------------
>
>                 Key: TIKA-2188
>                 URL: https://issues.apache.org/jira/browse/TIKA-2188
>             Project: Tika
>          Issue Type: Bug
>          Components: cli, parser
>    Affects Versions: 1.13
>         Environment: Ubuntu 14.04.5 LTS
>            Reporter: Alan Simmons
>              Labels: newbie
>
> Contents:
> 1. Description of problem
> 2. My tika-config.xml file
> 3. My CTAKESConfig.properties file
> 4. Error stack of problem
> DESCRIPTION OF PROBLEM:
> I am trying to configure Tika to use cTAKES as a parser, per instructions in https://wiki.apache.org/tika/cTAKESParser.
> I am working on a Mac running Sierra (OSX 10.12.1). 
> I was able to configure Tika 1.13 to run with cTAKES 3.2.2 as a parser in my OSX environment. In particular, I was able to run both the standalone app and server against the sample file (Vose...pdf) mentioned in the Wiki.
> I then tried to configure Tika 1.15 (the version from the github repo) in a Docker container. The OS for the Docker is Ubuntu 14.04.5.
> I tried to run the Tika standalone app jar against the Vose PDF. It failed with the stack trace that I include at the bottom of this message.
> I then tried to run the 1.13 Tika app in the Docker. Same problem. 
> In the Docker,
> 1. I am able to run the Tika 1.15 app with the Default parser (e.g., without referring to the custom configuration XML for cTAKES.
> 2. I am able to run the Tika 1.15 app if the configuration file uses the Default parser before the org.apache.tika.parser.ctakes.CTAKESParser.
> 3. I am able to run cTAKES directly from the CLI against the Vose PDF, so I know that cTAKEs can parse the file.
> 4. I ran pdfbox-app-2.0.3.jar against the file with no errors.
> ---------------------
> MY tika-config.xml file:
> <?xml version="1.0" encoding="UTF-8" standalone="no"?>
> <properties>
>   <parsers>
>     <parser class="org.apache.tika.parser.ctakes.CTAKESParser">
>       <mime>application/x-isatab</mime>
>       <mime>application/pdf</mime>
>       <mime>text/plain</mime>
>     </parser>
>   </parsers>
> </properties>
> -----
> MY CTAKESConfig.properties file:
> aeDescriptorPath=/ctakes-clinical-pipeline/desc/analysis_engine/AggregatePlaintextUMLSProcessor.xml
> text=true
> annotationProps=BEGIN,END,ONTOLOGY_CONCEPT_ARR
> separatorChar=:
> metadata=Problem,oncologic history,medical history,Study Title, Study Description
> UMLSUser=<my UMLS user name>
> UMLSPass=<my UMLS password>
> ---
> ERROR STACK TRACE
> NOTE: By comparing the info messages produced in different scenarios (successful Tika+cTAKES, cTAKES CPE, and unsuccessful Tika +cTAKEs), it looks like Tika is loading the cTAKES parser, but having some issue right after POS tagging.
> java -Xms256m -Xmx1024m -classpath $HOME/src/ctakes-config:/tika/tika-app/target/tika-app-1.15-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/\* org.apache.tika.cli.TikaCLI --config=$HOME/src/ctakes-config/tika-config.xml -m Vose.pdf
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in [jar:file:/tika/tika-app/target/tika-app-1.15-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in [jar:file:/ctakes_files/apache-ctakes-3.2.2/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> log4j: reset attribute= "false".
> log4j: Threshold ="null".
> log4j: Retreiving an instance of org.apache.log4j.Logger.
> log4j: Setting [ProgressAppender] additivity to [false].
> log4j: Level value for ProgressAppender is  [INFO].
> log4j: ProgressAppender level set to INFO
> log4j: Class name: [org.apache.log4j.ConsoleAppender]
> log4j: Parsing layout of class: "org.apache.log4j.PatternLayout"
> log4j: Setting property [conversionPattern] to [%m].
> log4j: Adding appender named [noEolAppender] to category [ProgressAppender].
> log4j: Retreiving an instance of org.apache.log4j.Logger.
> log4j: Setting [ProgressDone] additivity to [false].
> log4j: Level value for ProgressDone is  [INFO].
> log4j: ProgressDone level set to INFO
> log4j: Class name: [org.apache.log4j.ConsoleAppender]
> log4j: Parsing layout of class: "org.apache.log4j.PatternLayout"
> log4j: Setting property [conversionPattern] to [%m%n].
> log4j: Adding appender named [eolAppender] to category [ProgressDone].
> log4j: Level value for root is  [INFO].
> log4j: root level set to INFO
> log4j: Class name: [org.apache.log4j.ConsoleAppender]
> log4j: Parsing layout of class: "org.apache.log4j.PatternLayout"
> log4j: Setting property [conversionPattern] to [%d{dd MMM yyyy HH:mm:ss} %5p %c{1} - %m%n].
> log4j: Adding appender named [consoleAppender] to category [root].
> 01 Dec 2016 20:38:45  INFO LvgCmdApiResourceImpl - Loading NLM Norm and Lvg with config file = /ctakes_files/apache-ctakes-3.2.2/resources/org/apache/ctakes/lvg/data/config/lvg.properties
> 01 Dec 2016 20:38:45  INFO LvgCmdApiResourceImpl -   config file absolute path = /ctakes_files/apache-ctakes-3.2.2/resources/org/apache/ctakes/lvg/data/config/lvg.properties
> 01 Dec 2016 20:38:45  INFO LvgCmdApiResourceImpl - cwd = /
> 01 Dec 2016 20:38:45  INFO LvgCmdApiResourceImpl - cd /ctakes_files/apache-ctakes-3.2.2/resources/org/apache/ctakes/lvg/
> 01 Dec 2016 20:38:45  INFO LvgCmdApiResourceImpl - cd /
> 01 Dec 2016 20:38:45  INFO ClearNLPDependencyParserAE - using Morphy analysis? true
> Loading configuration.
> Loading feature templates.
> Loading lexica.
> Loading model:
> ........................................................................................
> 01 Dec 2016 20:39:01  INFO Chunker - Chunker model file: org/apache/ctakes/chunker/models/chunker-model.zip
> 01 Dec 2016 20:39:02  INFO ContextDependentTokenizerAnnotator - Finite state machines loaded.
> 01 Dec 2016 20:39:02  INFO ConstituencyParser - Initializing parser...
> 01 Dec 2016 20:39:07  INFO ContextAnnotator - SCOPE ORDER: [1, 3]
> 01 Dec 2016 20:39:07  INFO NegationContextAnalyzer - initBoundaryData() called for ContextInitializer
> 01 Dec 2016 20:39:08  INFO POSTagger - POS tagger model file: org/apache/ctakes/postagger/models/mayo-pos.zip
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-237: Illegal SAXException from org.apache.tika.parser.ParserDecorator$1@5fe1ce85
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:290)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)
> Caused by: org.xml.sax.SAXException
> 	at org.apache.tika.parser.ctakes.CTAKESContentHandler.endDocument(CTAKESContentHandler.java:162)
> 	at org.apache.tika.sax.ContentHandlerDecorator.endDocument(ContentHandlerDecorator.java:115)
> 	at org.apache.tika.sax.ContentHandlerDecorator.endDocument(ContentHandlerDecorator.java:115)
> 	at org.apache.tika.sax.SafeContentHandler.endDocument(SafeContentHandler.java:281)
> 	at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:230)
> 	at org.apache.tika.parser.EmptyParser.parse(EmptyParser.java:55)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> 	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
> 	at org.apache.tika.parser.ctakes.CTAKESParser.parse(CTAKESParser.java:85)
> 	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> 	... 5 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)