You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by nitinkhosla79 <ni...@gmail.com> on 2012/03/04 21:23:29 UTC

Indexing and mapping multiple files to a unique solr id

My use case is to index 2 files: metadata file and a binary pdf file to a
unique solr id. Metadata file has content in form of xml file and schema
fields are mapped to elements in that file.

What I do: Extract content from pdf files(using pdftotext), process that
content and retrieve specific information(example: pdf's first page/line has
information about the medicine, research stage). Information
retrieved(medicine/research stage) needs to be indexed and one should be
able to search/sort/facet.

I can create a xml file with information retrieved(lets call this as
metadata file). Now assuming my schema would be
<field name="medicine" type="text" stored="true" indexed="true"/>
<field name="researchStage"......                                     ../>

Is there a way to put this metadata file and the pdf file in Solr?

What I have tried:
a) Based on a suggestion in archives, I zipped these files and gave to
ExtractRequestHandler. I was able 
    to put all the content in SOLR and make it searchable. But it appear as
content of zip file. 
    (i had to apply some patches to Solr Code base to make this work). But
this is not sufficient as the 
    content in metadata file is not mapped to field names.
    curl
"http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true" -F
"myfile=@file.zip"

b) I tried to work with DataImportHandler(binURLdatasource). But I dont
think I understand how it works. 
    So could not go far.

If someone has tips, please share.
I want to avoid creating 1 file(by merging pdf text + metadata file). 


--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-and-mapping-multiple-files-to-a-unique-solr-id-tp3798872p3798872.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing and mapping multiple files to a unique solr id

Posted by Erick Erickson <er...@gmail.com>.
Sounds like a fine application for using SolrJ. Here's a blog
on the topic: http://www.lucidimagination.com/blog/2012/02/14/indexing-with-solrj/

In your case, just replace the tika bit with the PDFBox extraction you're
using (or, you could just let Tika do it instead) and combine that
with whatever you want in SolrInputDocument (see the code)...

Best
Erick

On Sun, Mar 4, 2012 at 3:23 PM, nitinkhosla79 <ni...@gmail.com> wrote:
> My use case is to index 2 files: metadata file and a binary pdf file to a
> unique solr id. Metadata file has content in form of xml file and schema
> fields are mapped to elements in that file.
>
> What I do: Extract content from pdf files(using pdftotext), process that
> content and retrieve specific information(example: pdf's first page/line has
> information about the medicine, research stage). Information
> retrieved(medicine/research stage) needs to be indexed and one should be
> able to search/sort/facet.
>
> I can create a xml file with information retrieved(lets call this as
> metadata file). Now assuming my schema would be
> <field name="medicine" type="text" stored="true" indexed="true"/>
> <field name="researchStage"......                                     ../>
>
> Is there a way to put this metadata file and the pdf file in Solr?
>
> What I have tried:
> a) Based on a suggestion in archives, I zipped these files and gave to
> ExtractRequestHandler. I was able
>    to put all the content in SOLR and make it searchable. But it appear as
> content of zip file.
>    (i had to apply some patches to Solr Code base to make this work). But
> this is not sufficient as the
>    content in metadata file is not mapped to field names.
>    curl
> "http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true" -F
> "myfile=@file.zip"
>
> b) I tried to work with DataImportHandler(binURLdatasource). But I dont
> think I understand how it works.
>    So could not go far.
>
> If someone has tips, please share.
> I want to avoid creating 1 file(by merging pdf text + metadata file).
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Indexing-and-mapping-multiple-files-to-a-unique-solr-id-tp3798872p3798872.html
> Sent from the Solr - User mailing list archive at Nabble.com.