You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Steve Johnson <st...@parisgroup.net> on 2010/06/27 00:50:01 UTC

How to index rich document with XML payload?

Greetings,

I am new to Solr, but have gotten as far as successfully indexing 
documents both by sending XML describing the document and by sending the 
document itself using "update/extract".  What I want to do now is, in 
effect, do both of these on each of my documents.  I want to be able to 
have Tika do its magic first, and then I want to add additional fields 
to my document entries using XML.

Is there any way to do this?  In general, is there any way to apply 
multiple update requests to a single document entry?

I do understand that I can put literal values on the "update/extract" 
URL to do what I'm asking.  This is what I'll have to do if I can't 
figure out another way, but it seems messy to me...I'd much rather send 
an XML payload.

TIA for any help.

Re: How to index rich document with XML payload?

Posted by go canal <go...@yahoo.com>.

Simple code like this:


File file = new File ("test.pdf");
InputStream input = new FileInputStream(file);
Metadata metadata = new Metadata ();
ContentHandler handler = new BodyContentHandler();
AutoDetectParser parse = new AutoDetectParser();
parse.parse(input, handler, metadata);
        input.close();

the extracted content is handler.toString() rgds,
canal




________________________________
From: go canal <go...@yahoo.com>
To: solr-user@lucene.apache.org
Sent: Sun, June 27, 2010 9:45:57 AM
Subject: Re: How to index rich document with XML payload?

Hi,
I just started using Solr....I am using SolrJ client, but uploading the file directly to Solr. I think we can use Tika in our code first.

Here I send the file directly to Solr which will do the text extraction:

CommonsHttpSolrServer solr = new CommonsHttpSolrServer("http://localhost:8983/solr");
solr.setRequestWriter(new BinaryRequestWriter());

ContentStreamUpdateRequest up = new ContentStreamUpdateRequest ("/update/extract");
// read a file
File file = new File ("tutorial.pdf");
up.addFile(file);
up.setParam("literal.id", "tutorial.pdf");
up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
solr.request(up);

So what we need to do is to add Tika.

I have a question about up.setParam - am I able to create my own fields ?
rgds,
canal




________________________________
From: Steve Johnson <st...@parisgroup.net>
To: solr-user@lucene.apache.org
Sent: Sun, June 27, 2010 6:50:01 AM
Subject: How to index rich document with XML payload?

Greetings,

I am new to Solr, but have gotten as far as successfully indexing documents both by sending XML describing the document and by sending the document itself using "update/extract".  What I want to do now is, in effect, do both of these on each of my documents.  I want to be able to have Tika do its magic first, and then I want to add additional fields to my document entries using XML.

Is there any way to do this?  In general, is there any way to apply multiple update requests to a single document entry?

I do understand that I can put literal values on the "update/extract" URL to do what I'm asking.  This is what I'll have to do if I can't figure out another way, but it seems messy to me...I'd much rather send an XML payload.

TIA for any help.

Re: How to index rich document with XML payload?

Posted by go canal <go...@yahoo.com>.

Hi,
I just started using Solr....I am using SolrJ client, but uploading the file directly to Solr. I think we can use Tika in our code first.

Here I send the file directly to Solr which will do the text extraction:

CommonsHttpSolrServer solr = new CommonsHttpSolrServer("http://localhost:8983/solr");
solr.setRequestWriter(new BinaryRequestWriter());

ContentStreamUpdateRequest up = new ContentStreamUpdateRequest ("/update/extract");
// read a file
File file = new File ("tutorial.pdf");
up.addFile(file);
up.setParam("literal.id", "tutorial.pdf");
up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
solr.request(up);

So what we need to do is to add Tika.

I have a question about up.setParam - am I able to create my own fields ?
 rgds,
canal




________________________________
From: Steve Johnson <st...@parisgroup.net>
To: solr-user@lucene.apache.org
Sent: Sun, June 27, 2010 6:50:01 AM
Subject: How to index rich document with XML payload?

Greetings,

I am new to Solr, but have gotten as far as successfully indexing documents both by sending XML describing the document and by sending the document itself using "update/extract".  What I want to do now is, in effect, do both of these on each of my documents.  I want to be able to have Tika do its magic first, and then I want to add additional fields to my document entries using XML.

Is there any way to do this?  In general, is there any way to apply multiple update requests to a single document entry?

I do understand that I can put literal values on the "update/extract" URL to do what I'm asking.  This is what I'll have to do if I can't figure out another way, but it seems messy to me...I'd much rather send an XML payload.

TIA for any help.