You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by "Hanjan, Harinder" <Ha...@calgary.ca> on 2018/05/02 19:04:20 UTC

Tika Server 1.18 sees PDF as a plain text file

Hello!

I am sending a PDF document to Tika Server and it is being detected as a plain text file (see full stack trace at bottom). If I specify 'Content-Type: application/pdf' in the header of the request, then Tika is able to extract content. In the tests below, mydocument.pdf is simply a text file I printed to PDF using Google Chrome.

Am I wrong in expecting that Tika determine the type of document without any additional help?

Sent:
  curl -X PUT http://localhost:9998/tika --data-binary "@mydocument.pdf"
 curl -X PUT http://localhost:9998/tika -F "data=@mydocument.pdf"
Received:
  HTTP 415 Unsupported Media Type exception

Sent:
  curl -X PUT http://localhost:9998/tika --data-binary "@mydocument.pdf" -H "Content-Type: application/pdf"
  curl -X PUT http://localhost:9998/meta -F "data=@mydocument.pdf" -H "Content-Type: application/pdf"
Received:
  Text for the PDF


INFO  tika (application/x-www-form-urlencoded)
WARN  tika: Text extraction failed
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.server.resource.TikaResource$1@1469bc28
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
        at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:390)
        at org.apache.tika.server.resource.TikaResource$5.write(TikaResource.java:489)
        at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164)
        at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1414)
        at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:243)
        at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:119)
        at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:82)
        at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
        at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
        at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
        at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
        at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:274)
        at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
        at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:76)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
        at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
        at org.eclipse.jetty.server.Server.handle(Server.java:370)
        at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
        at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:973)
        at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1035)
        at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:647)
        at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:231)
        at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
        at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
        at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
        at java.lang.Thread.run(Unknown Source)
Caused by: javax.ws.rs.WebApplicationException: HTTP 415 Unsupported Media Type
        at org.apache.tika.server.resource.TikaResource$1.parse(TikaResource.java:125)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        ... 32 more
ERROR Problem with writing the data, class org.apache.tika.server.resource.TikaResource$5, ContentType: text/plain


Thanks!
Harinder

________________________________
NOTICE -
This communication is intended ONLY for the use of the person or entity named above and may contain information that is confidential or legally privileged. If you are not the intended recipient named above or a person responsible for delivering messages or communications to the intended recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying of this communication or any of the information contained in it is strictly prohibited. If you have received this communication in error, please notify us immediately by telephone and then destroy or delete this communication, or return it to us by mail if requested by us. The City of Calgary thanks you for your attention and co-operation.

Re: Tika Server 1.18 sees PDF as a plain text file

Posted by Tim Allison <ta...@apache.org>.
I've had better luck with -T

curl -T test_recursive_embedded.docx http://localhost:9998/meta

https://wiki.apache.org/tika/TikaJAXRS

On Wed, May 2, 2018 at 3:04 PM, Hanjan, Harinder <Harinder.Hanjan@calgary.ca
> wrote:

> Hello!
>
>
>
> I am sending a PDF document to Tika Server and it is being detected as a
> plain text file (see full stack trace at bottom). If I specify
> ‘Content-Type: application/pdf’ in the header of the request, then Tika is
> able to extract content. In the tests below, mydocument.pdf is simply a
> text file I printed to PDF using Google Chrome.
>
>
>
> Am I wrong in expecting that Tika determine the type of document without
> any additional help?
>
>
>
> Sent:
>
>   curl -X PUT http://localhost:9998/tika --data-binary "@mydocument.pdf"
>
>  curl -X PUT http://localhost:9998/tika -F "data=@mydocument.pdf"
>
> Received:
>
>   HTTP 415 Unsupported Media Type exception
>
>
>
> Sent:
>
>   curl -X PUT http://localhost:9998/tika --data-binary "@mydocument.pdf"
> -H "Content-Type: application/pdf"
>
>   curl -X PUT http://localhost:9998/meta -F "data=@mydocument.pdf" -H
> "Content-Type: application/pdf"
>
> Received:
>
> *  Text for the PDF*
>
>
>
>
>
> INFO  tika (application/x-www-form-urlencoded)
>
> WARN  tika: Text extraction failed
>
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.server.resource.TikaResource$1@1469bc28
>
>         at org.apache.tika.parser.CompositeParser.parse(
> CompositeParser.java:282)
>
>         at org.apache.tika.parser.AutoDetectParser.parse(
> AutoDetectParser.java:143)
>
>         at org.apache.tika.server.resource.TikaResource.parse(
> TikaResource.java:390)
>
>         at org.apache.tika.server.resource.TikaResource$5.write(
> TikaResource.java:489)
>
>         at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(
> BinaryDataProvider.java:164)
>
>         at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(
> JAXRSUtils.java:1414)
>
>         at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.
> serializeMessage(JAXRSOutInterceptor.java:243)
>
>         at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.
> processResponse(JAXRSOutInterceptor.java:119)
>
>         at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.
> handleMessage(JAXRSOutInterceptor.java:82)
>
>         at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(
> PhaseInterceptorChain.java:307)
>
>         at org.apache.cxf.interceptor.OutgoingChainInterceptor.
> handleMessage(OutgoingChainInterceptor.java:83)
>
>         at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(
> PhaseInterceptorChain.java:307)
>
>         at org.apache.cxf.transport.ChainInitiationObserver.onMessage(
> ChainInitiationObserver.java:121)
>
>         at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(
> AbstractHTTPDestination.java:274)
>
>         at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.
> doService(JettyHTTPDestination.java:261)
>
>         at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(
> JettyHTTPHandler.java:76)
>
>         at org.eclipse.jetty.server.handler.ContextHandler.
> doHandle(ContextHandler.java:1088)
>
>         at org.eclipse.jetty.server.handler.ContextHandler.
> doScope(ContextHandler.java:1024)
>
>         at org.eclipse.jetty.server.handler.ScopedHandler.handle(
> ScopedHandler.java:135)
>
>         at org.eclipse.jetty.server.handler.ContextHandlerCollection.
> handle(ContextHandlerCollection.java:255)
>
>         at org.eclipse.jetty.server.handler.HandlerWrapper.handle(
> HandlerWrapper.java:116)
>
>         at org.eclipse.jetty.server.Server.handle(Server.java:370)
>
>         at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(
> AbstractHttpConnection.java:494)
>
>         at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(
> AbstractHttpConnection.java:973)
>
>         at org.eclipse.jetty.server.AbstractHttpConnection$
> RequestHandler.headerComplete(AbstractHttpConnection.java:1035)
>
>         at org.eclipse.jetty.http.HttpParser.parseNext(
> HttpParser.java:647)
>
>         at org.eclipse.jetty.http.HttpParser.parseAvailable(
> HttpParser.java:231)
>
>         at org.eclipse.jetty.server.AsyncHttpConnection.handle(
> AsyncHttpConnection.java:82)
>
>         at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(
> SelectChannelEndPoint.java:696)
>
>         at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(
> SelectChannelEndPoint.java:53)
>
>         at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
> QueuedThreadPool.java:608)
>
>         at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(
> QueuedThreadPool.java:543)
>
>         at java.lang.Thread.run(Unknown Source)
>
> Caused by: javax.ws.rs.WebApplicationException: HTTP 415 Unsupported
> Media Type
>
>         at org.apache.tika.server.resource.TikaResource$1.parse(
> TikaResource.java:125)
>
>         at org.apache.tika.parser.CompositeParser.parse(
> CompositeParser.java:280)
>
>         ... 32 more
>
> ERROR Problem with writing the data, class org.apache.tika.server.resource.TikaResource$5,
> *ContentType: text/plain*
>
>
>
>
>
> Thanks!
>
> Harinder
>
> ------------------------------
> NOTICE -
> This communication is intended ONLY for the use of the person or entity
> named above and may contain information that is confidential or legally
> privileged. If you are not the intended recipient named above or a person
> responsible for delivering messages or communications to the intended
> recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying
> of this communication or any of the information contained in it is strictly
> prohibited. If you have received this communication in error, please notify
> us immediately by telephone and then destroy or delete this communication,
> or return it to us by mail if requested by us. The City of Calgary thanks
> you for your attention and co-operation.
>