You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Fábio Aragão da Silva <ar...@gmail.com> on 2010/03/24 15:58:13 UTC

multiple binary documents into a single solr document - Vignette/OpenText integration

hello there,
I'm working on the development of a piece of code that integrates Solr
with Vignette/OpenText Content Management, meaning Vignette content
instances will be indexed in solr when published and deleted from solr
when unpublished. I'm using solr 1.4, solrj and solr cell.

I've implemented most of the code and I've ran into only a single
issue so far: vignette content management supports the attachment of
multiple binary documents (such as .doc, .pdf or .xls files) to a
single content instance. I am mapping each content instance in
Vignette to a solr document, but now I have a content instance in
vignette with multiple binary files attached to it.

So my question is: is it possible to have more than one binary file
indexed into a single document in solr?

I'm a beginner in solr, but from what I understood I have two options
to index content using solrj: either to use UpdateRequest() and the
add() method to add a SolrInputDocument to the request (in case the
document doesn´t represent a binary file), or to use
ContentStreamUpdateRequest() and the addFile() method to add a binary
file to the content stream request.

I don't see a way, though, to say "this document is comprised of two
files, a word and a pdf, so index them as one document in solr using
content1 and content2 fields - or merge their content into a single
'content' field)".

I tried calling the addFile() twice (one call for each file) and no
error but nothing getting indexed as well.

ContentStreamUpdateRequest req = new
ContentStreamUpdateRequest("/update/extract");
req.addFile(new File("file1.doc"));
req.addFile(new File("file2.pdf"));
req.setParam("literal.id", "multiple_files_test");
req.setParam("uprefix", "attr_");
req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
server.request(req);

Any thoughts on this would be greatly appreciated.

greetings from Brazil,
Fábio.

Re: multiple binary documents into a single solr document - Vignette/OpenText integration

Posted by briankous <bk...@behr.com>.
Hi there,

We are trying to replace opentext (V7.6) autonomy with solr  so that we can
index other contents, too.  Due to lack of manpower and time, the management
wants to buy the adapter if available. Do you know of any vendor who sells
the adapter or professional service?  Thank you.

Brian Ko
bko@behr.com
-- 
View this message in context: http://lucene.472066.n3.nabble.com/multiple-binary-documents-into-a-single-solr-document-Vignette-OpenText-integration-tp472172p2065107.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: multiple binary documents into a single solr document - Vignette/OpenText integration

Posted by Lance Norskog <go...@gmail.com>.
Do you want to index the text in the attachments?

If so, you probably are better off creating a unique document for the
mail body and each attachment. A field in the document could give the
id of the main email document. The main email document could contain a
multivalued field giving all of the attachment ids.

On Thu, Mar 25, 2010 at 10:14 AM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> : > I tried calling the addFile() twice (one call for each file) and no
> : > error but nothing getting indexed as well.
>        ...
> : Write your own RequestHandler that uses the existing ExtractingRequestHandler
> : to actually parse the streams, and then you combine the results arbitrarily in
> : your handler, eventually sending an AddUpdateCommand to the update processor.
> : You can obtain both the update processor and SolrCell instance from
> : req.getCore().
>
> The key bit being: yes you contain attach multiple files to your request,
> and yes the SolrQueryRequest abstraction can handle that (it appears as
> two "ContentStreams" to the RequestHandler) but the existing
> ExtractingRequestHandler assumes there will only be one ContentStream and
> constructsa one document for it -- the API isn't really designed arround
> the idea of how to generate a single SolrInputDOcument from multipole
> COntentStreams (where would you get the "title" from? etc...)
>
> There was talk about trying to generalize this, but i don't think anyone
> else has looked into it much.  Here's one refrence, but i definitely
> remember a more recent thread about this idea...
>
> http://n3.nabble.com/ExtractingRequestHandler-and-XmlUpdateHandler-tt492202.html#a492211
>
>
>
> -Hoss
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: multiple binary documents into a single solr document - Vignette/OpenText integration

Posted by Chris Hostetter <ho...@fucit.org>.
: > I tried calling the addFile() twice (one call for each file) and no
: > error but nothing getting indexed as well.
	...
: Write your own RequestHandler that uses the existing ExtractingRequestHandler
: to actually parse the streams, and then you combine the results arbitrarily in
: your handler, eventually sending an AddUpdateCommand to the update processor.
: You can obtain both the update processor and SolrCell instance from
: req.getCore().

The key bit being: yes you contain attach multiple files to your request, 
and yes the SolrQueryRequest abstraction can handle that (it appears as 
two "ContentStreams" to the RequestHandler) but the existing 
ExtractingRequestHandler assumes there will only be one ContentStream and 
constructsa one document for it -- the API isn't really designed arround 
the idea of how to generate a single SolrInputDOcument from multipole 
COntentStreams (where would you get the "title" from? etc...)

There was talk about trying to generalize this, but i don't think anyone 
else has looked into it much.  Here's one refrence, but i definitely 
remember a more recent thread about this idea...

http://n3.nabble.com/ExtractingRequestHandler-and-XmlUpdateHandler-tt492202.html#a492211



-Hoss


Re: multiple binary documents into a single solr document - Vignette/OpenText integration

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-03-24 15:58, Fábio Aragão da Silva wrote:
> hello there,
> I'm working on the development of a piece of code that integrates Solr
> with Vignette/OpenText Content Management, meaning Vignette content
> instances will be indexed in solr when published and deleted from solr
> when unpublished. I'm using solr 1.4, solrj and solr cell.
>
> I've implemented most of the code and I've ran into only a single
> issue so far: vignette content management supports the attachment of
> multiple binary documents (such as .doc, .pdf or .xls files) to a
> single content instance. I am mapping each content instance in
> Vignette to a solr document, but now I have a content instance in
> vignette with multiple binary files attached to it.
>
> So my question is: is it possible to have more than one binary file
> indexed into a single document in solr?
>
> I'm a beginner in solr, but from what I understood I have two options
> to index content using solrj: either to use UpdateRequest() and the
> add() method to add a SolrInputDocument to the request (in case the
> document doesn´t represent a binary file), or to use
> ContentStreamUpdateRequest() and the addFile() method to add a binary
> file to the content stream request.
>
> I don't see a way, though, to say "this document is comprised of two
> files, a word and a pdf, so index them as one document in solr using
> content1 and content2 fields - or merge their content into a single
> 'content' field)".
>
> I tried calling the addFile() twice (one call for each file) and no
> error but nothing getting indexed as well.
>
> ContentStreamUpdateRequest req = new
> ContentStreamUpdateRequest("/update/extract");
> req.addFile(new File("file1.doc"));
> req.addFile(new File("file2.pdf"));
> req.setParam("literal.id", "multiple_files_test");
> req.setParam("uprefix", "attr_");
> req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
> server.request(req);
>
> Any thoughts on this would be greatly appreciated.

Write your own RequestHandler that uses the existing 
ExtractingRequestHandler to actually parse the streams, and then you 
combine the results arbitrarily in your handler, eventually sending an 
AddUpdateCommand to the update processor. You can obtain both the update 
processor and SolrCell instance from req.getCore().


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com