You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Pei Chen <ch...@apache.org> on 2016/02/24 23:15:25 UTC

PDFParser in-process mode

Hi tika-dev,
Does the default pdf parser using auto detect parser require to tika
to run in server mode?  It seems to try and open an http connection to
localhost:8080 by default?  Can it run in-process?


...<snip>
FileInputStream stream = new FileInputStream("src/test/resources/somepdf.pdf");
//works fine in-process with other doc types.
Tika tika = new Tika();
tika.parseToString(stream);
...<snip>


24 Feb 2016 17:06:24  WARN PhaseInterceptorChain - Interceptor for
{http://localhost:8080/processHeaderDocument}WebClient has thrown
exception, unwinding now

org.apache.cxf.interceptor.Fault: No message body writer has been
found for class org.apache.cxf.jaxrs.ext.multipart.MultipartBody,
ContentType: multipart/form-data

at org.apache.cxf.jaxrs.client.WebClient$BodyWriter.doWriteBody(WebClient.java:1220)

at org.apache.cxf.jaxrs.client.AbstractClient$AbstractBodyWriter.handleMessage(AbstractClient.java:1044)

at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)

at org.apache.cxf.jaxrs.client.AbstractClient.doRunInterceptorChain(AbstractClient.java:623)

at org.apache.cxf.jaxrs.client.WebClient.doChainedInvocation(WebClient.java:1084)

at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:883)

at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:854)

at org.apache.cxf.jaxrs.client.WebClient.invoke(WebClient.java:320)

at org.apache.cxf.jaxrs.client.WebClient.post(WebClient.java:329)

at org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:74)

at org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:60)

at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)

at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)

at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)

at org.apache.tika.Tika.parseToString(Tika.java:496)

at org.apache.tika.Tika.parseToString(Tika.java:571)

Re: PDFParser in-process mode

Posted by Nick Burch <ap...@gagravarr.org>.
On Wed, 24 Feb 2016, Pei Chen wrote:
> Does the default pdf parser using auto detect parser require to tika
> to run in server mode?

No

> It seems to try and open an http connection to localhost:8080 by 
> default?  Can it run in-process?

The stacktrace shows you're not using the PDF parser:

> at org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:74)
> at org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:60)

See https://wiki.apache.org/tika/GrobidJournalParser for how to configure 
the grobid parser if you want to use it

Nick

Re: PDFParser in-process mode

Posted by Pei Chen <ch...@apache.org>.
Thanks Nick.
Just a copy and paste error in the email.
I was able to figure out how to bypass the JornalParser and just use PDF ones.
--Pei

On Wed, 24 Feb 2016, Pei Chen wrote:

> Does the default pdf parser using auto detect parser require to tika

> to run in server mode?


No


> It seems to try and open an http connection to localhost:8080 by

> default?  Can it run in-process?


The stacktrace shows you're not using the PDF parser:


> at org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:74)

> at org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:60)


See https://wiki.apache.org/tika/GrobidJournalParser for how to configure

the grobid parser if you want to use it


Nick

On Wed, Feb 24, 2016 at 5:15 PM, Pei Chen <ch...@apache.org> wrote:
> Hi tika-dev,
> Does the default pdf parser using auto detect parser require to tika
> to run in server mode?  It seems to try and open an http connection to
> localhost:8080 by default?  Can it run in-process?
>
>
> ...<snip>
> FileInputStream stream = new FileInputStream("src/test/resources/somepdf.pdf");
> //works fine in-process with other doc types.
> Tika tika = new Tika();
> tika.parseToString(stream);
> ...<snip>
>
>
> 24 Feb 2016 17:06:24  WARN PhaseInterceptorChain - Interceptor for
> {http://localhost:8080/processHeaderDocument}WebClient has thrown
> exception, unwinding now
>
> org.apache.cxf.interceptor.Fault: No message body writer has been
> found for class org.apache.cxf.jaxrs.ext.multipart.MultipartBody,
> ContentType: multipart/form-data
>
> at org.apache.cxf.jaxrs.client.WebClient$BodyWriter.doWriteBody(WebClient.java:1220)
>
> at org.apache.cxf.jaxrs.client.AbstractClient$AbstractBodyWriter.handleMessage(AbstractClient.java:1044)
>
> at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
>
> at org.apache.cxf.jaxrs.client.AbstractClient.doRunInterceptorChain(AbstractClient.java:623)
>
> at org.apache.cxf.jaxrs.client.WebClient.doChainedInvocation(WebClient.java:1084)
>
> at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:883)
>
> at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:854)
>
> at org.apache.cxf.jaxrs.client.WebClient.invoke(WebClient.java:320)
>
> at org.apache.cxf.jaxrs.client.WebClient.post(WebClient.java:329)
>
> at org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:74)
>
> at org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:60)
>
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>
> at org.apache.tika.Tika.parseToString(Tika.java:496)
>
> at org.apache.tika.Tika.parseToString(Tika.java:571)