You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mareike Glock <ma...@Student.HTW-Berlin.de> on 2019/05/19 09:02:16 UTC

Problem with SolrJ and indexing PDF files

Dear Solr Team,

I am trying to index Word and PDF documents with Solr using SolrJ, but 
most of the examples I found on the internet use the SolrServer class 
which I guess is deprecated.
The connection to Solr itself is working, because I can add 
SolrInputDocuments to the index but it does not work for rich documents 
because I get an exception.


public static void main(String[] args) throws IOException, 
SolrServerException {
         String urlString = "http://localhost:8983/solr/localDocs16";
         HttpSolrClient solr = new 
HttpSolrClient.Builder(urlString).build();

         //is working
         for(int i=0;i<1000;++i) {
             SolrInputDocument doc = new SolrInputDocument();
             doc.addField("cat", "book");
             doc.addField("id", "book-" + i);
             doc.addField("name", "The Legend of the Hobbit part " + i);
             solr.add(doc);
             if(i%100==0) solr.commit();  // periodically flush
         }

         //is not working
         File file = new File("path\\testfile.pdf");

         ContentStreamUpdateRequest req = new 
ContentStreamUpdateRequest("update/extract");

         req.addFile(file, "application/pdf");
         req.setParam("literal.id", "doc1");
         req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
         try{
             solr.request(req);
         }
         catch(IOException e){
             PrintWriter out = new 
PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
             e.printStackTrace(out);
             out.close();
             System.out.println("IO message: " + e.getMessage());
         } catch(SolrServerException e){
             PrintWriter out = new 
PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
             e.printStackTrace(out);
             out.close();
             System.out.println("SolrServer message: " + e.getMessage());
         } catch(Exception e){
             PrintWriter out = new 
PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
             e.printStackTrace(out);
             out.close();
             System.out.println("UnknownException message: " + 
e.getMessage());
         }finally{
             solr.commit();
         }
}


I am using Maven (pom.xml attached) and created a JAR file, which I then 
tried to execute from the command line, and this is the output I get:

     SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
     SLF4J: Defaulting to no-operation (NOP) logger implementation
     SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for 
further details.
     SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
     SLF4J: Defaulting to no-operation MDCAdapter implementation.
     SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for 
further details.
     message: *UnknownException message: Error from server at 
http://localhost:8983/solr/localDocs17: Bad contentType for search 
handler :application/pdf request={wt=javabin&version=2}*



I hope you may be able to help me with this. I also posted this issue on 
Github 
<https://stackoverflow.com/questions/56149903/indexing-rich-documents-with-solrj-bad-contenttype-for-search-handler>.

Cheers,
Mareike Glock


Re: Problem with SolrJ and indexing PDF files

Posted by Erick Erickson <er...@gmail.com>.
Here’s a skeletal program to get you started using Tika directly in a SolrJ client, with a long explication of why using Solr’s extracting request handler is probably not what you want to do in production: 

https://lucidworks.com/2012/02/14/indexing-with-solrj/

SolrServer was renamed SolrClient 4 1/2 years ago, one of my pet peeves is that lots of pages don’t have dates attached. The link above was updated after this change even though it was published in 2012, but even so you’ll find some methods that have since been deprecated.

If you’re using SolrCloud, you should be using CloudSolrClient rather than SolrClient.

Best,
Erick

> On May 19, 2019, at 5:07 AM, Jörn Franke <jo...@gmail.com> wrote:
> 
> You can use the Tika library to parse the PDFs and then post the text to the Solr servers
> 
>> Am 19.05.2019 um 11:02 schrieb Mareike Glock <ma...@student.htw-berlin.de>:
>> 
>> Dear Solr Team,
>> 
>> I am trying to index Word and PDF documents with Solr using SolrJ, but most of the examples I found on the internet use the SolrServer class which I guess is deprecated. 
>> The connection to Solr itself is working, because I can add SolrInputDocuments to the index but it does not work for rich documents because I get an exception.
>> 
>> 
>> public static void main(String[] args) throws IOException, SolrServerException {
>>        String urlString = "http://localhost:8983/solr/localDocs16";
>>        HttpSolrClient solr = new HttpSolrClient.Builder(urlString).build();
>> 
>>        //is working
>>        for(int i=0;i<1000;++i) {
>>            SolrInputDocument doc = new SolrInputDocument();
>>            doc.addField("cat", "book");
>>            doc.addField("id", "book-" + i);
>>            doc.addField("name", "The Legend of the Hobbit part " + i);
>>            solr.add(doc);
>>            if(i%100==0) solr.commit();  // periodically flush
>>        }
>> 
>>        //is not working
>>        File file = new File("path\\testfile.pdf");
>> 
>>        ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("update/extract");
>> 
>>        req.addFile(file, "application/pdf");
>>        req.setParam("literal.id", "doc1");
>>        req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
>>        try{
>>            solr.request(req);
>>        }
>>        catch(IOException e){
>>            PrintWriter out = new PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
>>            e.printStackTrace(out);
>>            out.close();
>>            System.out.println("IO message: " + e.getMessage());
>>        } catch(SolrServerException e){
>>            PrintWriter out = new PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
>>            e.printStackTrace(out);
>>            out.close();
>>            System.out.println("SolrServer message: " + e.getMessage());
>>        } catch(Exception e){
>>            PrintWriter out = new PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
>>            e.printStackTrace(out);
>>            out.close();
>>            System.out.println("UnknownException message: " + e.getMessage());
>>        }finally{
>>            solr.commit();
>>        }
>> }
>> 
>> 
>> I am using Maven (pom.xml attached) and created a JAR file, which I then tried to execute from the command line, and this is the output I get:
>> 
>>    SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>>    SLF4J: Defaulting to no-operation (NOP) logger implementation
>>    SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
>>    SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
>>    SLF4J: Defaulting to no-operation MDCAdapter implementation.
>>    SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for further details.
>>    message: UnknownException message: Error from server at http://localhost:8983/solr/localDocs17: Bad contentType for search handler :application/pdf request={wt=javabin&version=2}
>> 
>> 
>> 
>> 
>> 
>> I hope you may be able to help me with this. I also posted this issue on Github.
>> 
>> Cheers,
>> Mareike Glock
>> 
>> <pom.xml>


Re: Problem with SolrJ and indexing PDF files

Posted by Jörn Franke <jo...@gmail.com>.
You can use the Tika library to parse the PDFs and then post the text to the Solr servers

> Am 19.05.2019 um 11:02 schrieb Mareike Glock <ma...@student.htw-berlin.de>:
> 
> Dear Solr Team,
> 
> I am trying to index Word and PDF documents with Solr using SolrJ, but most of the examples I found on the internet use the SolrServer class which I guess is deprecated. 
> The connection to Solr itself is working, because I can add SolrInputDocuments to the index but it does not work for rich documents because I get an exception.
> 
> 
> public static void main(String[] args) throws IOException, SolrServerException {
>         String urlString = "http://localhost:8983/solr/localDocs16";
>         HttpSolrClient solr = new HttpSolrClient.Builder(urlString).build();
> 
>         //is working
>         for(int i=0;i<1000;++i) {
>             SolrInputDocument doc = new SolrInputDocument();
>             doc.addField("cat", "book");
>             doc.addField("id", "book-" + i);
>             doc.addField("name", "The Legend of the Hobbit part " + i);
>             solr.add(doc);
>             if(i%100==0) solr.commit();  // periodically flush
>         }
> 
>         //is not working
>         File file = new File("path\\testfile.pdf");
> 
>         ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("update/extract");
> 
>         req.addFile(file, "application/pdf");
>         req.setParam("literal.id", "doc1");
>         req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
>         try{
>             solr.request(req);
>         }
>         catch(IOException e){
>             PrintWriter out = new PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
>             e.printStackTrace(out);
>             out.close();
>             System.out.println("IO message: " + e.getMessage());
>         } catch(SolrServerException e){
>             PrintWriter out = new PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
>             e.printStackTrace(out);
>             out.close();
>             System.out.println("SolrServer message: " + e.getMessage());
>         } catch(Exception e){
>             PrintWriter out = new PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
>             e.printStackTrace(out);
>             out.close();
>             System.out.println("UnknownException message: " + e.getMessage());
>         }finally{
>             solr.commit();
>         }
> }
> 
> 
> I am using Maven (pom.xml attached) and created a JAR file, which I then tried to execute from the command line, and this is the output I get:
> 
>     SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>     SLF4J: Defaulting to no-operation (NOP) logger implementation
>     SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
>     SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
>     SLF4J: Defaulting to no-operation MDCAdapter implementation.
>     SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for further details.
>     message: UnknownException message: Error from server at http://localhost:8983/solr/localDocs17: Bad contentType for search handler :application/pdf request={wt=javabin&version=2}
> 
> 
> 
> 
> 
> I hope you may be able to help me with this. I also posted this issue on Github.
> 
> Cheers,
> Mareike Glock
> 
> <pom.xml>