You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Peter Bleackley <bl...@zooey.co.uk> on 2013/10/11 16:57:51 UTC
Problems using DataImportHandler and TikaEntityProcessor
Starting Solr with the command line
java -Dsolr.solr.home=example-DIH/solr -jar start.jar
and then trying to import some data with
java -Durl=http://localhost:8983/solr/tika/update -Dtype=application/pdf
-jar post.jar *.pdf
fails with error
SimplePostTool: WARNING: Solr returned an error #400 Bad Request
SimplePostTool: WARNING: IOException while reading response:
java.io.IOException: Server returned HTTP response code: 400 for URL:
http://localhost:8983/solr/tika/update
These are all valid PDFs that I have previously been able to import with
Solr Cell.
What am I doing wrong?
Dr Peter J Bleackley
Computational Linguistics Contractor
Playful Technology Ltd
Re: Problems using DataImportHandler and TikaEntityProcessor
Posted by PeteBleackley <bl...@zooey.co.uk>.
OK, so I put my pdf files in a directory /path/to/pdf, and edited
example-DIH/solr/tika/conf/tika-data-config.xml to contain the parameter
<entity name="tika-test"
processor="TikaEntityProcessor" url="/path/to/pdf"
format="xml" >
What should I do next?
Shawn Heisey-4 wrote
> On 10/11/2013 9:32 AM, PeteBleackley wrote:
>> I tried changing the options to -Dauto -Dfiletypes=pdf. This gave me a
>> 404
>> error, apparently caused by post.jar adding /extract to the end of the
>> URL
>
> In order to use post.jar, you would need the /update/extract handler,
> which is not defined in the tika core under example-DIH.
>
> The example-DIH configurations are intended to use and illustrate the
> dataimport handler - documents are imported using the /dataimport
> handler and its config file, not sent directly with post.jar.
>
> Here's a page covering what you would need in order to send PDFs
> directly rather than import them using DIH:
>
> http://wiki.apache.org/solr/ExtractingRequestHandler
>
> Thanks,
> Shawn
--
View this message in context: http://lucene.472066.n3.nabble.com/Problems-using-DataImportHandler-and-TikaEntityProcessor-tp4094983p4095366.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problems using DataImportHandler and TikaEntityProcessor
Posted by Shawn Heisey <so...@elyograg.org>.
On 10/11/2013 9:32 AM, PeteBleackley wrote:
> I tried changing the options to -Dauto -Dfiletypes=pdf. This gave me a 404
> error, apparently caused by post.jar adding /extract to the end of the URL
In order to use post.jar, you would need the /update/extract handler,
which is not defined in the tika core under example-DIH.
The example-DIH configurations are intended to use and illustrate the
dataimport handler - documents are imported using the /dataimport
handler and its config file, not sent directly with post.jar.
Here's a page covering what you would need in order to send PDFs
directly rather than import them using DIH:
http://wiki.apache.org/solr/ExtractingRequestHandler
Thanks,
Shawn
Re: Problems using DataImportHandler and TikaEntityProcessor
Posted by Furkan KAMACI <fu...@gmail.com>.
Here is a similar conversation:
http://search-lucene.com/m/GeXcg1YfgQ32/Re%253A+Solr+4.0+error+message%253A+%2522Unsupported+ContentType%253A+Content-type%253Atext%252Fxml%2522&subj=Re+Solr+4+0+error+message+Unsupported+ContentType+Content+type+text+xml+
Could you change -Dauto into -Dtype=application/pdf and try it again?
2013/10/11 PeteBleackley <bl...@zooey.co.uk>
> kamaci wrote
> > There may be a problem with you schema. Could you send your solr logs?
> >
> >
> > 2013/10/11 Peter Bleackley <
>
> > bleackleyp@.co
>
> > >
> >
> >> Starting Solr with the command line
> >>
> >>
> >> java -Dsolr.solr.home=example-DIH/**solr -jar start.jar
> >>
> >>
> >> and then trying to import some data with
> >>
> >> java
> >> -Durl=
> http://localhost:8983/**solr/tika/update<http://localhost:8983/solr/tika/update>-Dtype=application/pdf
> >> -jar post.jar *.pdf
> >>
> >> fails with error
> >>
> >> SimplePostTool: WARNING: Solr returned an error #400 Bad Request
> >> SimplePostTool: WARNING: IOException while reading response:
> >> java.io.IOException: Server returned HTTP response code: 400 for URL:
> >>
> http://localhost:8983/solr/**tika/update<http://localhost:8983/solr/tika/update>
> ;
> >>
> >> These are all valid PDFs that I have previously been able to import with
> >> Solr Cell.
> >>
> >> What am I doing wrong?
> >>
> >> Dr Peter J Bleackley
> >> Computational Linguistics Contractor
> >> Playful Technology Ltd
> >>
> >>
> >>
>
> 11228 [qtp1831924725-17] INFO
> org.apache.solr.update.processor.LogUpdateProcessor – [tika] webapp=/solr
> path=/update params={} {} 0 0
> 11229 [qtp1831924725-17] ERROR org.apache.solr.core.SolrCore –
> org.apache.solr.common.SolrException: Unsupported ContentType:
> application/pdf Not in: [application/xml, text/csv, text/json,
> application/csv, application/javabin, text/xml, application/json]
> at
>
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:86)
> at
>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> at
>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
> at
>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
> at
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
> at
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
> at
>
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
> at
>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
> at
>
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
> at
>
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
> at
>
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
> at
>
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
> at
>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> at
>
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> at
>
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
> at
>
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> at org.eclipse.jetty.server.Server.handle(Server.java:368)
> at
>
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
> at
>
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
> at
>
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
> at
>
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861)
> at
> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
> at
>
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
> at
>
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
> at
>
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> at
>
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> at java.lang.Thread.run(Thread.java:724)
>
>
> I tried changing the options to -Dauto -Dfiletypes=pdf. This gave me a 404
> error, apparently caused by post.jar adding /extract to the end of the URL
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Problems-using-DataImportHandler-and-TikaEntityProcessor-tp4094983p4094987.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
Re: Problems using DataImportHandler and TikaEntityProcessor
Posted by PeteBleackley <bl...@zooey.co.uk>.
kamaci wrote
> There may be a problem with you schema. Could you send your solr logs?
>
>
> 2013/10/11 Peter Bleackley <
> bleackleyp@.co
> >
>
>> Starting Solr with the command line
>>
>>
>> java -Dsolr.solr.home=example-DIH/**solr -jar start.jar
>>
>>
>> and then trying to import some data with
>>
>> java
>> -Durl=http://localhost:8983/**solr/tika/update<http://localhost:8983/solr/tika/update>-Dtype=application/pdf
>> -jar post.jar *.pdf
>>
>> fails with error
>>
>> SimplePostTool: WARNING: Solr returned an error #400 Bad Request
>> SimplePostTool: WARNING: IOException while reading response:
>> java.io.IOException: Server returned HTTP response code: 400 for URL:
>> http://localhost:8983/solr/**tika/update<http://localhost:8983/solr/tika/update>
>>
>> These are all valid PDFs that I have previously been able to import with
>> Solr Cell.
>>
>> What am I doing wrong?
>>
>> Dr Peter J Bleackley
>> Computational Linguistics Contractor
>> Playful Technology Ltd
>>
>>
>>
11228 [qtp1831924725-17] INFO
org.apache.solr.update.processor.LogUpdateProcessor – [tika] webapp=/solr
path=/update params={} {} 0 0
11229 [qtp1831924725-17] ERROR org.apache.solr.core.SolrCore –
org.apache.solr.common.SolrException: Unsupported ContentType:
application/pdf Not in: [application/xml, text/csv, text/json,
application/csv, application/javabin, text/xml, application/json]
at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:86)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:368)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:724)
I tried changing the options to -Dauto -Dfiletypes=pdf. This gave me a 404
error, apparently caused by post.jar adding /extract to the end of the URL
--
View this message in context: http://lucene.472066.n3.nabble.com/Problems-using-DataImportHandler-and-TikaEntityProcessor-tp4094983p4094987.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problems using DataImportHandler and TikaEntityProcessor
Posted by Furkan KAMACI <fu...@gmail.com>.
There may be a problem with you schema. Could you send your solr logs?
2013/10/11 Peter Bleackley <bl...@zooey.co.uk>
> Starting Solr with the command line
>
>
> java -Dsolr.solr.home=example-DIH/**solr -jar start.jar
>
>
> and then trying to import some data with
>
> java -Durl=http://localhost:8983/**solr/tika/update<http://localhost:8983/solr/tika/update>-Dtype=application/pdf -jar post.jar *.pdf
>
> fails with error
>
> SimplePostTool: WARNING: Solr returned an error #400 Bad Request
> SimplePostTool: WARNING: IOException while reading response:
> java.io.IOException: Server returned HTTP response code: 400 for URL:
> http://localhost:8983/solr/**tika/update<http://localhost:8983/solr/tika/update>
>
> These are all valid PDFs that I have previously been able to import with
> Solr Cell.
>
> What am I doing wrong?
>
> Dr Peter J Bleackley
> Computational Linguistics Contractor
> Playful Technology Ltd
>
>
>