You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Steve Johnson <st...@parisgroup.net> on 2010/06/27 00:50:01 UTC
How to index rich document with XML payload?
Greetings,
I am new to Solr, but have gotten as far as successfully indexing
documents both by sending XML describing the document and by sending the
document itself using "update/extract". What I want to do now is, in
effect, do both of these on each of my documents. I want to be able to
have Tika do its magic first, and then I want to add additional fields
to my document entries using XML.
Is there any way to do this? In general, is there any way to apply
multiple update requests to a single document entry?
I do understand that I can put literal values on the "update/extract"
URL to do what I'm asking. This is what I'll have to do if I can't
figure out another way, but it seems messy to me...I'd much rather send
an XML payload.
TIA for any help.
Re: How to index rich document with XML payload?
Posted by go canal <go...@yahoo.com>.
Simple code like this:
File file = new File ("test.pdf");
InputStream input = new FileInputStream(file);
Metadata metadata = new Metadata ();
ContentHandler handler = new BodyContentHandler();
AutoDetectParser parse = new AutoDetectParser();
parse.parse(input, handler, metadata);
input.close();
the extracted content is handler.toString() rgds,
canal
________________________________
From: go canal <go...@yahoo.com>
To: solr-user@lucene.apache.org
Sent: Sun, June 27, 2010 9:45:57 AM
Subject: Re: How to index rich document with XML payload?
Hi,
I just started using Solr....I am using SolrJ client, but uploading the file directly to Solr. I think we can use Tika in our code first.
Here I send the file directly to Solr which will do the text extraction:
CommonsHttpSolrServer solr = new CommonsHttpSolrServer("http://localhost:8983/solr");
solr.setRequestWriter(new BinaryRequestWriter());
ContentStreamUpdateRequest up = new ContentStreamUpdateRequest ("/update/extract");
// read a file
File file = new File ("tutorial.pdf");
up.addFile(file);
up.setParam("literal.id", "tutorial.pdf");
up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
solr.request(up);
So what we need to do is to add Tika.
I have a question about up.setParam - am I able to create my own fields ?
rgds,
canal
________________________________
From: Steve Johnson <st...@parisgroup.net>
To: solr-user@lucene.apache.org
Sent: Sun, June 27, 2010 6:50:01 AM
Subject: How to index rich document with XML payload?
Greetings,
I am new to Solr, but have gotten as far as successfully indexing documents both by sending XML describing the document and by sending the document itself using "update/extract". What I want to do now is, in effect, do both of these on each of my documents. I want to be able to have Tika do its magic first, and then I want to add additional fields to my document entries using XML.
Is there any way to do this? In general, is there any way to apply multiple update requests to a single document entry?
I do understand that I can put literal values on the "update/extract" URL to do what I'm asking. This is what I'll have to do if I can't figure out another way, but it seems messy to me...I'd much rather send an XML payload.
TIA for any help.
Re: How to index rich document with XML payload?
Posted by go canal <go...@yahoo.com>.
Hi,
I just started using Solr....I am using SolrJ client, but uploading the file directly to Solr. I think we can use Tika in our code first.
Here I send the file directly to Solr which will do the text extraction:
CommonsHttpSolrServer solr = new CommonsHttpSolrServer("http://localhost:8983/solr");
solr.setRequestWriter(new BinaryRequestWriter());
ContentStreamUpdateRequest up = new ContentStreamUpdateRequest ("/update/extract");
// read a file
File file = new File ("tutorial.pdf");
up.addFile(file);
up.setParam("literal.id", "tutorial.pdf");
up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
solr.request(up);
So what we need to do is to add Tika.
I have a question about up.setParam - am I able to create my own fields ?
rgds,
canal
________________________________
From: Steve Johnson <st...@parisgroup.net>
To: solr-user@lucene.apache.org
Sent: Sun, June 27, 2010 6:50:01 AM
Subject: How to index rich document with XML payload?
Greetings,
I am new to Solr, but have gotten as far as successfully indexing documents both by sending XML describing the document and by sending the document itself using "update/extract". What I want to do now is, in effect, do both of these on each of my documents. I want to be able to have Tika do its magic first, and then I want to add additional fields to my document entries using XML.
Is there any way to do this? In general, is there any way to apply multiple update requests to a single document entry?
I do understand that I can put literal values on the "update/extract" URL to do what I'm asking. This is what I'll have to do if I can't figure out another way, but it seems messy to me...I'd much rather send an XML payload.
TIA for any help.