You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mareike Glock <ma...@Student.HTW-Berlin.de> on 2019/05/19 09:02:16 UTC
Problem with SolrJ and indexing PDF files
Dear Solr Team,
I am trying to index Word and PDF documents with Solr using SolrJ, but
most of the examples I found on the internet use the SolrServer class
which I guess is deprecated.
The connection to Solr itself is working, because I can add
SolrInputDocuments to the index but it does not work for rich documents
because I get an exception.
public static void main(String[] args) throws IOException,
SolrServerException {
String urlString = "http://localhost:8983/solr/localDocs16";
HttpSolrClient solr = new
HttpSolrClient.Builder(urlString).build();
//is working
for(int i=0;i<1000;++i) {
SolrInputDocument doc = new SolrInputDocument();
doc.addField("cat", "book");
doc.addField("id", "book-" + i);
doc.addField("name", "The Legend of the Hobbit part " + i);
solr.add(doc);
if(i%100==0) solr.commit(); // periodically flush
}
//is not working
File file = new File("path\\testfile.pdf");
ContentStreamUpdateRequest req = new
ContentStreamUpdateRequest("update/extract");
req.addFile(file, "application/pdf");
req.setParam("literal.id", "doc1");
req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
try{
solr.request(req);
}
catch(IOException e){
PrintWriter out = new
PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
e.printStackTrace(out);
out.close();
System.out.println("IO message: " + e.getMessage());
} catch(SolrServerException e){
PrintWriter out = new
PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
e.printStackTrace(out);
out.close();
System.out.println("SolrServer message: " + e.getMessage());
} catch(Exception e){
PrintWriter out = new
PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
e.printStackTrace(out);
out.close();
System.out.println("UnknownException message: " +
e.getMessage());
}finally{
solr.commit();
}
}
I am using Maven (pom.xml attached) and created a JAR file, which I then
tried to execute from the command line, and this is the output I get:
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for
further details.
SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
SLF4J: Defaulting to no-operation MDCAdapter implementation.
SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for
further details.
message: *UnknownException message: Error from server at
http://localhost:8983/solr/localDocs17: Bad contentType for search
handler :application/pdf request={wt=javabin&version=2}*
I hope you may be able to help me with this. I also posted this issue on
Github
<https://stackoverflow.com/questions/56149903/indexing-rich-documents-with-solrj-bad-contenttype-for-search-handler>.
Cheers,
Mareike Glock
Re: Problem with SolrJ and indexing PDF files
Posted by Erick Erickson <er...@gmail.com>.
Here’s a skeletal program to get you started using Tika directly in a SolrJ client, with a long explication of why using Solr’s extracting request handler is probably not what you want to do in production:
https://lucidworks.com/2012/02/14/indexing-with-solrj/
SolrServer was renamed SolrClient 4 1/2 years ago, one of my pet peeves is that lots of pages don’t have dates attached. The link above was updated after this change even though it was published in 2012, but even so you’ll find some methods that have since been deprecated.
If you’re using SolrCloud, you should be using CloudSolrClient rather than SolrClient.
Best,
Erick
> On May 19, 2019, at 5:07 AM, Jörn Franke <jo...@gmail.com> wrote:
>
> You can use the Tika library to parse the PDFs and then post the text to the Solr servers
>
>> Am 19.05.2019 um 11:02 schrieb Mareike Glock <ma...@student.htw-berlin.de>:
>>
>> Dear Solr Team,
>>
>> I am trying to index Word and PDF documents with Solr using SolrJ, but most of the examples I found on the internet use the SolrServer class which I guess is deprecated.
>> The connection to Solr itself is working, because I can add SolrInputDocuments to the index but it does not work for rich documents because I get an exception.
>>
>>
>> public static void main(String[] args) throws IOException, SolrServerException {
>> String urlString = "http://localhost:8983/solr/localDocs16";
>> HttpSolrClient solr = new HttpSolrClient.Builder(urlString).build();
>>
>> //is working
>> for(int i=0;i<1000;++i) {
>> SolrInputDocument doc = new SolrInputDocument();
>> doc.addField("cat", "book");
>> doc.addField("id", "book-" + i);
>> doc.addField("name", "The Legend of the Hobbit part " + i);
>> solr.add(doc);
>> if(i%100==0) solr.commit(); // periodically flush
>> }
>>
>> //is not working
>> File file = new File("path\\testfile.pdf");
>>
>> ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("update/extract");
>>
>> req.addFile(file, "application/pdf");
>> req.setParam("literal.id", "doc1");
>> req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
>> try{
>> solr.request(req);
>> }
>> catch(IOException e){
>> PrintWriter out = new PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
>> e.printStackTrace(out);
>> out.close();
>> System.out.println("IO message: " + e.getMessage());
>> } catch(SolrServerException e){
>> PrintWriter out = new PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
>> e.printStackTrace(out);
>> out.close();
>> System.out.println("SolrServer message: " + e.getMessage());
>> } catch(Exception e){
>> PrintWriter out = new PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
>> e.printStackTrace(out);
>> out.close();
>> System.out.println("UnknownException message: " + e.getMessage());
>> }finally{
>> solr.commit();
>> }
>> }
>>
>>
>> I am using Maven (pom.xml attached) and created a JAR file, which I then tried to execute from the command line, and this is the output I get:
>>
>> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>> SLF4J: Defaulting to no-operation (NOP) logger implementation
>> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
>> SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
>> SLF4J: Defaulting to no-operation MDCAdapter implementation.
>> SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for further details.
>> message: UnknownException message: Error from server at http://localhost:8983/solr/localDocs17: Bad contentType for search handler :application/pdf request={wt=javabin&version=2}
>>
>>
>>
>>
>>
>> I hope you may be able to help me with this. I also posted this issue on Github.
>>
>> Cheers,
>> Mareike Glock
>>
>> <pom.xml>
Re: Problem with SolrJ and indexing PDF files
Posted by Jörn Franke <jo...@gmail.com>.
You can use the Tika library to parse the PDFs and then post the text to the Solr servers
> Am 19.05.2019 um 11:02 schrieb Mareike Glock <ma...@student.htw-berlin.de>:
>
> Dear Solr Team,
>
> I am trying to index Word and PDF documents with Solr using SolrJ, but most of the examples I found on the internet use the SolrServer class which I guess is deprecated.
> The connection to Solr itself is working, because I can add SolrInputDocuments to the index but it does not work for rich documents because I get an exception.
>
>
> public static void main(String[] args) throws IOException, SolrServerException {
> String urlString = "http://localhost:8983/solr/localDocs16";
> HttpSolrClient solr = new HttpSolrClient.Builder(urlString).build();
>
> //is working
> for(int i=0;i<1000;++i) {
> SolrInputDocument doc = new SolrInputDocument();
> doc.addField("cat", "book");
> doc.addField("id", "book-" + i);
> doc.addField("name", "The Legend of the Hobbit part " + i);
> solr.add(doc);
> if(i%100==0) solr.commit(); // periodically flush
> }
>
> //is not working
> File file = new File("path\\testfile.pdf");
>
> ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("update/extract");
>
> req.addFile(file, "application/pdf");
> req.setParam("literal.id", "doc1");
> req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
> try{
> solr.request(req);
> }
> catch(IOException e){
> PrintWriter out = new PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
> e.printStackTrace(out);
> out.close();
> System.out.println("IO message: " + e.getMessage());
> } catch(SolrServerException e){
> PrintWriter out = new PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
> e.printStackTrace(out);
> out.close();
> System.out.println("SolrServer message: " + e.getMessage());
> } catch(Exception e){
> PrintWriter out = new PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
> e.printStackTrace(out);
> out.close();
> System.out.println("UnknownException message: " + e.getMessage());
> }finally{
> solr.commit();
> }
> }
>
>
> I am using Maven (pom.xml attached) and created a JAR file, which I then tried to execute from the command line, and this is the output I get:
>
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
> SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
> SLF4J: Defaulting to no-operation MDCAdapter implementation.
> SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for further details.
> message: UnknownException message: Error from server at http://localhost:8983/solr/localDocs17: Bad contentType for search handler :application/pdf request={wt=javabin&version=2}
>
>
>
>
>
> I hope you may be able to help me with this. I also posted this issue on Github.
>
> Cheers,
> Mareike Glock
>
> <pom.xml>