You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Anupam Bhattacharya <an...@gmail.com> on 2012/03/26 15:25:20 UTC

Index a set of file as one document in SOLR

I have a set/group of documents of XML and PDF type.

Each XML document contains the bibliographic information and has a
reference to the supporting PDF document.
How can i index this Parent-Child doc types in SOLR schema as one doc. The
PDF should be full text indexed for searching & only the corresponding
Parent XML details should be shown if the PDF contains the searched
keyword.

How to design this kind of functionality in SOLR ?

Appreciate any help on this.

Regards
Anupam

Re: Index a set of file as one document in SOLR

Posted by Erick Erickson <er...@gmail.com>.
Consider writing a SolrJ program that extracts the data from the
PDF file and combines it with the XML data. Here's an example
to get you started, it shows how to do the PDF extraction at least.
The other part of the code is a database connection, ignore that part.

You'll have to read in the XML, parse it, extract the relevant bits
and add them to the SolrInputDocument (see the example)

http://www.lucidimagination.com/blog/2012/02/14/indexing-with-solrj/

Best
Erick

On Mon, Mar 26, 2012 at 9:25 AM, Anupam Bhattacharya
<an...@gmail.com> wrote:
> I have a set/group of documents of XML and PDF type.
>
> Each XML document contains the bibliographic information and has a
> reference to the supporting PDF document.
> How can i index this Parent-Child doc types in SOLR schema as one doc. The
> PDF should be full text indexed for searching & only the corresponding
> Parent XML details should be shown if the PDF contains the searched
> keyword.
>
> How to design this kind of functionality in SOLR ?
>
> Appreciate any help on this.
>
> Regards
> Anupam