You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Yavar Husain <ya...@gmail.com> on 2015/03/04 11:04:50 UTC

Pattern for extracting text from a rich document and an associated metadata file

What is the best pattern to index the following kind of data:

HarryPotter.PDF
HarryPotter.txt

Avengers.Docx
Avengers.txt

For each of the above file the meta data lies in the text file having same
name as the rich document (as can be seen above).

(1) Now the brute force method that I can think of is extract text from
rich document and extract meta data from the associated txt file, club them
to form an xml and send it to Solr for indexing.

(2) Another thing that I can think of is to use SolrJ and just
programatically read the PDF and the txt file and send that to Solr. If
this is the case then is it possible to send PDF directly to Solr without
having to extract text first in my SolrJ program.

Is there something better that I can do quickly? I know if I just had rich
documents I would have used the Tika-Solr integration/requestHandlers to do
the job.

Any help would be appreciated.

Thanks,
Yavar

Re: Pattern for extracting text from a rich document and an associated metadata file

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi Yavar,

I would stick with Erik's post : 
http://lucidworks.com/blog/indexing-with-solrj/

Ahmet



On Wednesday, March 4, 2015 12:05 PM, Yavar Husain <ya...@gmail.com> wrote:
What is the best pattern to index the following kind of data:

HarryPotter.PDF
HarryPotter.txt

Avengers.Docx
Avengers.txt

For each of the above file the meta data lies in the text file having same
name as the rich document (as can be seen above).

(1) Now the brute force method that I can think of is extract text from
rich document and extract meta data from the associated txt file, club them
to form an xml and send it to Solr for indexing.

(2) Another thing that I can think of is to use SolrJ and just
programatically read the PDF and the txt file and send that to Solr. If
this is the case then is it possible to send PDF directly to Solr without
having to extract text first in my SolrJ program.

Is there something better that I can do quickly? I know if I just had rich
documents I would have used the Tika-Solr integration/requestHandlers to do
the job.

Any help would be appreciated.

Thanks,
Yavar