You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Angel Ice <lb...@yahoo.fr> on 2009/09/02 13:56:44 UTC

Using SolrJ with Tika

Hi everybody.

I hope it's the right place for questions, if not sorry.

I'm trying to index rich documents (PDF, MS docs etc) in SolR/Lucene.
I have seen a few examples explaining how to use tika to solve this. But most of these examples are using curl to send documents to Solr or an HTML POST with an input file.
But i'd like to do it in full java.
Is there a way to use Solrj to index the documents with the ExtractingRequestHandler of SolR or at least to get the extracted xml back (with the extract.only option) ?

Many thanks.

Laurent.

Re: Using SolrJ with Tika

Posted by Grant Ingersoll <gs...@apache.org>.

Hi Angel,

I'm looking into it.  Might need a new SolrRequest, but still playing  
around and will let you know...

-Grant

On Sep 2, 2009, at 4:56 AM, Angel Ice wrote:

> Hi everybody.
>
> I hope it's the right place for questions, if not sorry.
>
> I'm trying to index rich documents (PDF, MS docs etc) in SolR/Lucene.
> I have seen a few examples explaining how to use tika to solve this.  
> But most of these examples are using curl to send documents to Solr  
> or an HTML POST with an input file.
> But i'd like to do it in full java.
> Is there a way to use Solrj to index the documents with the  
> ExtractingRequestHandler of SolR or at least to get the extracted  
> xml back (with the extract.only option) ?
>
> Many thanks.
>
> Laurent.
>
>
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Re : Using SolrJ with Tika

Posted by Grant Ingersoll <gs...@apache.org>.

See https://issues.apache.org/jira/browse/SOLR-1411

On Sep 3, 2009, at 6:47 AM, Angel Ice wrote:

> Hi
>
> This is the solution I was testing.
> I got some difficulties with AutoDetectParser but I think it's the  
> solution I will use in the end.
>
>
> Thanks for the advice anyway :)
>
> Regards,
>
> Laurent
>
>
>
>
> ________________________________
> De : Abdullah Shaikh <ab...@viithiisys.com>
> À : solr-user@lucene.apache.org
> Envoyé le : Jeudi, 3 Septembre 2009, 14h31mn 10s
> Objet : Re: Using SolrJ with Tika
>
> Hi Laurent,
>
> I am not sure if this is what you need, but you can extract the  
> content from
> the uploaded document (MS Docs, PDF etc) using TIKA and then send it  
> to SOLR
> for indexing.
>
> String CONTENT = extract the content using TIKA (you can use
> AutoDetectParser)
>
> and then,
>
> SolrInputDocument doc = new SolrInputDocument();
> doc.addField("DOC_CONTENT", CONTENT);
>
> solrServer.add(doc);
> soltServer.commit();
>
>
> On Wed, Sep 2, 2009 at 5:26 PM, Angel Ice <lb...@yahoo.fr> wrote:
>
>> Hi everybody.
>>
>> I hope it's the right place for questions, if not sorry.
>>
>> I'm trying to index rich documents (PDF, MS docs etc) in SolR/Lucene.
>> I have seen a few examples explaining how to use tika to solve  
>> this. But
>> most of these examples are using curl to send documents to Solr or  
>> an HTML
>> POST with an input file.
>> But i'd like to do it in full java.
>> Is there a way to use Solrj to index the documents with the
>> ExtractingRequestHandler of SolR or at least to get the extracted  
>> xml back
>> (with the extract.only option) ?
>>
>> Many thanks.
>>
>> Laurent.
>>
>>
>>
>>
>
>
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re : Using SolrJ with Tika

Posted by Angel Ice <lb...@yahoo.fr>.

Hi

This is the solution I was testing.
I got some difficulties with AutoDetectParser but I think it's the solution I will use in the end.


Thanks for the advice anyway :)

Regards,

Laurent




________________________________
De : Abdullah Shaikh <ab...@viithiisys.com>
À : solr-user@lucene.apache.org
Envoyé le : Jeudi, 3 Septembre 2009, 14h31mn 10s
Objet : Re: Using SolrJ with Tika

Hi Laurent,

I am not sure if this is what you need, but you can extract the content from
the uploaded document (MS Docs, PDF etc) using TIKA and then send it to SOLR
for indexing.

String CONTENT = extract the content using TIKA (you can use
AutoDetectParser)

and then,

SolrInputDocument doc = new SolrInputDocument();
doc.addField("DOC_CONTENT", CONTENT);

solrServer.add(doc);
soltServer.commit();


On Wed, Sep 2, 2009 at 5:26 PM, Angel Ice <lb...@yahoo.fr> wrote:

> Hi everybody.
>
> I hope it's the right place for questions, if not sorry.
>
> I'm trying to index rich documents (PDF, MS docs etc) in SolR/Lucene.
> I have seen a few examples explaining how to use tika to solve this. But
> most of these examples are using curl to send documents to Solr or an HTML
> POST with an input file.
> But i'd like to do it in full java.
> Is there a way to use Solrj to index the documents with the
> ExtractingRequestHandler of SolR or at least to get the extracted xml back
> (with the extract.only option) ?
>
> Many thanks.
>
> Laurent.
>
>
>
>

Re: Using SolrJ with Tika

Posted by Abdullah Shaikh <ab...@viithiisys.com>.

Hi Laurent,

I am not sure if this is what you need, but you can extract the content from
the uploaded document (MS Docs, PDF etc) using TIKA and then send it to SOLR
for indexing.

String CONTENT = extract the content using TIKA (you can use
AutoDetectParser)

and then,

SolrInputDocument doc = new SolrInputDocument();
doc.addField("DOC_CONTENT", CONTENT);

solrServer.add(doc);
soltServer.commit();


On Wed, Sep 2, 2009 at 5:26 PM, Angel Ice <lb...@yahoo.fr> wrote:

> Hi everybody.
>
> I hope it's the right place for questions, if not sorry.
>
> I'm trying to index rich documents (PDF, MS docs etc) in SolR/Lucene.
> I have seen a few examples explaining how to use tika to solve this. But
> most of these examples are using curl to send documents to Solr or an HTML
> POST with an input file.
> But i'd like to do it in full java.
> Is there a way to use Solrj to index the documents with the
> ExtractingRequestHandler of SolR or at least to get the extracted xml back
> (with the extract.only option) ?
>
> Many thanks.
>
> Laurent.
>
>
>
>

Re: Re : Using SolrJ with Tika

Posted by rajan chandi <ch...@gmail.com>.

I have not used these APIs but Actually, You don't need CURL to POST the
document to Solr.
You can execute an HTTP POST using only Java.

http://www.jguru.com/faq/view.jsp?EID=62798

You might want to look at SolrInputDocument.

No matter what mechanism you may use to post the document. The point is that
Solr 1.4 sounds to handle this OOTB.

Regards
Rajan

On Wed, Sep 2, 2009 at 8:20 PM, Angel Ice <lb...@yahoo.fr> wrote:

> Hi Rajan.
>
> As mentioned in my message, I don't want tu use Curl to post documents and
> can't use an HTTP POST (the document has already been posted to my JEE
> webapp for other purposes). All I can use is just java.
>
> In fact, I'd like the user to post the document to my webapp with an HTML
> POST (it's a struts2 webapp).  --This is OK.
> Then my webapp uses the document for its own purposes. --This is OK.
> And finally the webapp send  the document to solr in order to index it.
>  --This is not OK.
>
> That's what I am doing with other stuffs that I index where there is no
> rich document, just some simple text fields to index, like daily articles.
> In this case, once my webapp has finished its job on the article (creating,
> saving ...), I index the title/author/... like this :
>    SolrInputDocument doc = new SolrInputDocument();
>    doc.addField("art_title", "foo");
>    ...
>    solrServer.add(doc);
>    soltServer.commit();
>
> I'm looking for a way to do the same thing for rich document, once my
> webapp has finished its job with the document.
>
> Regards,
>
> Laurent
>
>
>
>
>
> ________________________________
> De : rajan chandi <ch...@gmail.com>
> À : solr-user@lucene.apache.org
> Envoyé le : Mercredi, 2 Septembre 2009, 16h13mn 22s
> Objet : Re: Using SolrJ with Tika
>
> Laurent,
>
> Check-out Solr 1.4.
>
> You can download the trunk and Build it on your box.
>
> The Solr 1.4 does this out-of-the-box. No configuration required.
>
> You can use HTTP POST to post the document using some Linux utility like
> Curl and the PDF/Word/RTF/PPT/XLS etc. will be indexed. We tested this last
> week.
>
> Tika has already been included in Solr 1.4.
>
> Cheers
> Rajan
>
> On Wed, Sep 2, 2009 at 5:26 PM, Angel Ice <lb...@yahoo.fr> wrote:
>
> > Hi everybody.
> >
> > I hope it's the right place for questions, if not sorry.
> >
> > I'm trying to index rich documents (PDF, MS docs etc) in SolR/Lucene.
> > I have seen a few examples explaining how to use tika to solve this. But
> > most of these examples are using curl to send documents to Solr or an
> HTML
> > POST with an input file.
> > But i'd like to do it in full java.
> > Is there a way to use Solrj to index the documents with the
> > ExtractingRequestHandler of SolR or at least to get the extracted xml
> back
> > (with the extract.only option) ?
> >
> > Many thanks.
> >
> > Laurent.
> >
> >
> >
> >
>
>
>
>
>

Re : Using SolrJ with Tika

Posted by Angel Ice <lb...@yahoo.fr>.

Hi Rajan.

As mentioned in my message, I don't want tu use Curl to post documents and can't use an HTTP POST (the document has already been posted to my JEE webapp for other purposes). All I can use is just java.

In fact, I'd like the user to post the document to my webapp with an HTML POST (it's a struts2 webapp).  --This is OK.
Then my webapp uses the document for its own purposes. --This is OK.
And finally the webapp send  the document to solr in order to index it.  --This is not OK.

That's what I am doing with other stuffs that I index where there is no rich document, just some simple text fields to index, like daily articles.
In this case, once my webapp has finished its job on the article (creating, saving ...), I index the title/author/... like this :
    SolrInputDocument doc = new SolrInputDocument();
    doc.addField("art_title", "foo");
    ...
    solrServer.add(doc);
    soltServer.commit();

I'm looking for a way to do the same thing for rich document, once my webapp has finished its job with the document.

Regards,

Laurent

________________________________
De : rajan chandi <ch...@gmail.com>
À : solr-user@lucene.apache.org
Envoyé le : Mercredi, 2 Septembre 2009, 16h13mn 22s
Objet : Re: Using SolrJ with Tika

Laurent,

Check-out Solr 1.4.

You can download the trunk and Build it on your box.

The Solr 1.4 does this out-of-the-box. No configuration required.

You can use HTTP POST to post the document using some Linux utility like
Curl and the PDF/Word/RTF/PPT/XLS etc. will be indexed. We tested this last
week.

Tika has already been included in Solr 1.4.

Cheers
Rajan

On Wed, Sep 2, 2009 at 5:26 PM, Angel Ice <lb...@yahoo.fr> wrote:

> Hi everybody.
>
> I hope it's the right place for questions, if not sorry.
>
> I'm trying to index rich documents (PDF, MS docs etc) in SolR/Lucene.
> I have seen a few examples explaining how to use tika to solve this. But
> most of these examples are using curl to send documents to Solr or an HTML
> POST with an input file.
> But i'd like to do it in full java.
> Is there a way to use Solrj to index the documents with the
> ExtractingRequestHandler of SolR or at least to get the extracted xml back
> (with the extract.only option) ?
>
> Many thanks.
>
> Laurent.
>
>
>
>

Re: Using SolrJ with Tika

Posted by rajan chandi <ch...@gmail.com>.

Laurent,

Check-out Solr 1.4.

You can download the trunk and Build it on your box.

The Solr 1.4 does this out-of-the-box. No configuration required.

You can use HTTP POST to post the document using some Linux utility like
Curl and the PDF/Word/RTF/PPT/XLS etc. will be indexed. We tested this last
week.

Tika has already been included in Solr 1.4.

Cheers
Rajan

On Wed, Sep 2, 2009 at 5:26 PM, Angel Ice <lb...@yahoo.fr> wrote:

> Hi everybody.
>
> I hope it's the right place for questions, if not sorry.
>
> I'm trying to index rich documents (PDF, MS docs etc) in SolR/Lucene.
> I have seen a few examples explaining how to use tika to solve this. But
> most of these examples are using curl to send documents to Solr or an HTML
> POST with an input file.
> But i'd like to do it in full java.
> Is there a way to use Solrj to index the documents with the
> ExtractingRequestHandler of SolR or at least to get the extracted xml back
> (with the extract.only option) ?
>
> Many thanks.
>
> Laurent.
>
>
>
>