You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Peter Bleackley <bl...@zooey.co.uk> on 2013/10/11 16:57:51 UTC

Problems using DataImportHandler and TikaEntityProcessor

Starting Solr with the command line


java -Dsolr.solr.home=example-DIH/solr -jar start.jar


and then trying to import some data with

java -Durl=http://localhost:8983/solr/tika/update -Dtype=application/pdf 
-jar post.jar *.pdf

fails with error

SimplePostTool: WARNING: Solr returned an error #400 Bad Request
SimplePostTool: WARNING: IOException while reading response: 
java.io.IOException: Server returned HTTP response code: 400 for URL: 
http://localhost:8983/solr/tika/update

These are all valid PDFs that I have previously been able to import with 
Solr Cell.

What am I doing wrong?

Dr Peter J Bleackley
Computational Linguistics Contractor
Playful Technology Ltd



Re: Problems using DataImportHandler and TikaEntityProcessor

Posted by PeteBleackley <bl...@zooey.co.uk>.
OK, so I put my pdf files in a directory /path/to/pdf, and edited
example-DIH/solr/tika/conf/tika-data-config.xml to contain the parameter
&lt;entity name=&quot;tika-test&quot;
processor=&quot;TikaEntityProcessor&quot; url=&quot;/path/to/pdf&quot;
format=&quot;xml&quot; &gt;

What should I do next?


Shawn Heisey-4 wrote
> On 10/11/2013 9:32 AM, PeteBleackley wrote:
>> I tried changing the options to -Dauto -Dfiletypes=pdf. This gave me a
>> 404
>> error, apparently caused by post.jar adding /extract to the end of the
>> URL
> 
> In order to use post.jar, you would need the /update/extract handler,
> which is not defined in the tika core under example-DIH.
> 
> The example-DIH configurations are intended to use and illustrate the
> dataimport handler - documents are imported using the /dataimport
> handler and its config file, not sent directly with post.jar.
> 
> Here's a page covering what you would need in order to send PDFs
> directly rather than import them using DIH:
> 
> http://wiki.apache.org/solr/ExtractingRequestHandler
> 
> Thanks,
> Shawn





--
View this message in context: http://lucene.472066.n3.nabble.com/Problems-using-DataImportHandler-and-TikaEntityProcessor-tp4094983p4095366.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Problems using DataImportHandler and TikaEntityProcessor

Posted by Shawn Heisey <so...@elyograg.org>.
On 10/11/2013 9:32 AM, PeteBleackley wrote:
> I tried changing the options to -Dauto -Dfiletypes=pdf. This gave me a 404
> error, apparently caused by post.jar adding /extract to the end of the URL

In order to use post.jar, you would need the /update/extract handler,
which is not defined in the tika core under example-DIH.

The example-DIH configurations are intended to use and illustrate the
dataimport handler - documents are imported using the /dataimport
handler and its config file, not sent directly with post.jar.

Here's a page covering what you would need in order to send PDFs
directly rather than import them using DIH:

http://wiki.apache.org/solr/ExtractingRequestHandler

Thanks,
Shawn


Re: Problems using DataImportHandler and TikaEntityProcessor

Posted by Furkan KAMACI <fu...@gmail.com>.
Here is a similar conversation:
http://search-lucene.com/m/GeXcg1YfgQ32/Re%253A+Solr+4.0+error+message%253A+%2522Unsupported+ContentType%253A+Content-type%253Atext%252Fxml%2522&subj=Re+Solr+4+0+error+message+Unsupported+ContentType+Content+type+text+xml+

Could you change -Dauto into -Dtype=application/pdf and try it again?


2013/10/11 PeteBleackley <bl...@zooey.co.uk>

> kamaci wrote
> > There may be a problem with you schema. Could you send your solr logs?
> >
> >
> > 2013/10/11 Peter Bleackley &lt;
>
> > bleackleyp@.co
>
> > &gt;
> >
> >> Starting Solr with the command line
> >>
> >>
> >> java -Dsolr.solr.home=example-DIH/**solr -jar start.jar
> >>
> >>
> >> and then trying to import some data with
> >>
> >> java
> >> -Durl=
> http://localhost:8983/**solr/tika/update&lt;http://localhost:8983/solr/tika/update&gt;-Dtype=application/pdf
> >> -jar post.jar *.pdf
> >>
> >> fails with error
> >>
> >> SimplePostTool: WARNING: Solr returned an error #400 Bad Request
> >> SimplePostTool: WARNING: IOException while reading response:
> >> java.io.IOException: Server returned HTTP response code: 400 for URL:
> >>
> http://localhost:8983/solr/**tika/update&lt;http://localhost:8983/solr/tika/update&gt
> ;
> >>
> >> These are all valid PDFs that I have previously been able to import with
> >> Solr Cell.
> >>
> >> What am I doing wrong?
> >>
> >> Dr Peter J Bleackley
> >> Computational Linguistics Contractor
> >> Playful Technology Ltd
> >>
> >>
> >>
>
> 11228 [qtp1831924725-17] INFO
> org.apache.solr.update.processor.LogUpdateProcessor  – [tika] webapp=/solr
> path=/update params={} {} 0 0
> 11229 [qtp1831924725-17] ERROR org.apache.solr.core.SolrCore  –
> org.apache.solr.common.SolrException: Unsupported ContentType:
> application/pdf  Not in: [application/xml, text/csv, text/json,
> application/csv, application/javabin, text/xml, application/json]
>         at
>
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:86)
>         at
>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>         at
>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
>         at
>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
>         at
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
>         at
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
>         at
>
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
>         at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
>         at
>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
>         at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
>         at
>
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
>         at
>
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
>         at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
>         at
>
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
>         at
>
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
>         at
>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>         at
>
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>         at
>
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
>         at
>
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>         at org.eclipse.jetty.server.Server.handle(Server.java:368)
>         at
>
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
>         at
>
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
>         at
>
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
>         at
>
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
>         at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861)
>         at
> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
>         at
>
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
>         at
>
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
>         at
>
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>         at
>
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>         at java.lang.Thread.run(Thread.java:724)
>
>
> I tried changing the options to -Dauto -Dfiletypes=pdf. This gave me a 404
> error, apparently caused by post.jar adding /extract to the end of the URL
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Problems-using-DataImportHandler-and-TikaEntityProcessor-tp4094983p4094987.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Problems using DataImportHandler and TikaEntityProcessor

Posted by PeteBleackley <bl...@zooey.co.uk>.
kamaci wrote
> There may be a problem with you schema. Could you send your solr logs?
> 
> 
> 2013/10/11 Peter Bleackley &lt;

> bleackleyp@.co

> &gt;
> 
>> Starting Solr with the command line
>>
>>
>> java -Dsolr.solr.home=example-DIH/**solr -jar start.jar
>>
>>
>> and then trying to import some data with
>>
>> java
>> -Durl=http://localhost:8983/**solr/tika/update&lt;http://localhost:8983/solr/tika/update&gt;-Dtype=application/pdf
>> -jar post.jar *.pdf
>>
>> fails with error
>>
>> SimplePostTool: WARNING: Solr returned an error #400 Bad Request
>> SimplePostTool: WARNING: IOException while reading response:
>> java.io.IOException: Server returned HTTP response code: 400 for URL:
>> http://localhost:8983/solr/**tika/update&lt;http://localhost:8983/solr/tika/update&gt;
>>
>> These are all valid PDFs that I have previously been able to import with
>> Solr Cell.
>>
>> What am I doing wrong?
>>
>> Dr Peter J Bleackley
>> Computational Linguistics Contractor
>> Playful Technology Ltd
>>
>>
>>

11228 [qtp1831924725-17] INFO 
org.apache.solr.update.processor.LogUpdateProcessor  – [tika] webapp=/solr
path=/update params={} {} 0 0
11229 [qtp1831924725-17] ERROR org.apache.solr.core.SolrCore  –
org.apache.solr.common.SolrException: Unsupported ContentType:
application/pdf  Not in: [application/xml, text/csv, text/json,
application/csv, application/javabin, text/xml, application/json]
	at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:86)
	at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
	at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
	at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
	at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
	at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
	at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
	at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
	at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
	at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
	at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
	at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
	at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
	at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
	at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
	at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
	at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
	at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
	at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
	at org.eclipse.jetty.server.Server.handle(Server.java:368)
	at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
	at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
	at
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
	at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
	at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861)
	at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
	at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
	at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
	at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
	at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
	at java.lang.Thread.run(Thread.java:724)


I tried changing the options to -Dauto -Dfiletypes=pdf. This gave me a 404
error, apparently caused by post.jar adding /extract to the end of the URL





--
View this message in context: http://lucene.472066.n3.nabble.com/Problems-using-DataImportHandler-and-TikaEntityProcessor-tp4094983p4094987.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Problems using DataImportHandler and TikaEntityProcessor

Posted by Furkan KAMACI <fu...@gmail.com>.
There may be a problem with you schema. Could you send your solr logs?


2013/10/11 Peter Bleackley <bl...@zooey.co.uk>

> Starting Solr with the command line
>
>
> java -Dsolr.solr.home=example-DIH/**solr -jar start.jar
>
>
> and then trying to import some data with
>
> java -Durl=http://localhost:8983/**solr/tika/update<http://localhost:8983/solr/tika/update>-Dtype=application/pdf -jar post.jar *.pdf
>
> fails with error
>
> SimplePostTool: WARNING: Solr returned an error #400 Bad Request
> SimplePostTool: WARNING: IOException while reading response:
> java.io.IOException: Server returned HTTP response code: 400 for URL:
> http://localhost:8983/solr/**tika/update<http://localhost:8983/solr/tika/update>
>
> These are all valid PDFs that I have previously been able to import with
> Solr Cell.
>
> What am I doing wrong?
>
> Dr Peter J Bleackley
> Computational Linguistics Contractor
> Playful Technology Ltd
>
>
>