You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Yavar Husain <ya...@gmail.com> on 2015/03/04 11:04:50 UTC
Pattern for extracting text from a rich document and an associated
metadata file
What is the best pattern to index the following kind of data:
HarryPotter.PDF
HarryPotter.txt
Avengers.Docx
Avengers.txt
For each of the above file the meta data lies in the text file having same
name as the rich document (as can be seen above).
(1) Now the brute force method that I can think of is extract text from
rich document and extract meta data from the associated txt file, club them
to form an xml and send it to Solr for indexing.
(2) Another thing that I can think of is to use SolrJ and just
programatically read the PDF and the txt file and send that to Solr. If
this is the case then is it possible to send PDF directly to Solr without
having to extract text first in my SolrJ program.
Is there something better that I can do quickly? I know if I just had rich
documents I would have used the Tika-Solr integration/requestHandlers to do
the job.
Any help would be appreciated.
Thanks,
Yavar
Re: Pattern for extracting text from a rich document and an
associated metadata file
Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi Yavar,
I would stick with Erik's post :
http://lucidworks.com/blog/indexing-with-solrj/
Ahmet
On Wednesday, March 4, 2015 12:05 PM, Yavar Husain <ya...@gmail.com> wrote:
What is the best pattern to index the following kind of data:
HarryPotter.PDF
HarryPotter.txt
Avengers.Docx
Avengers.txt
For each of the above file the meta data lies in the text file having same
name as the rich document (as can be seen above).
(1) Now the brute force method that I can think of is extract text from
rich document and extract meta data from the associated txt file, club them
to form an xml and send it to Solr for indexing.
(2) Another thing that I can think of is to use SolrJ and just
programatically read the PDF and the txt file and send that to Solr. If
this is the case then is it possible to send PDF directly to Solr without
having to extract text first in my SolrJ program.
Is there something better that I can do quickly? I know if I just had rich
documents I would have used the Tika-Solr integration/requestHandlers to do
the job.
Any help would be appreciated.
Thanks,
Yavar