You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by vascaino90 <jo...@gmail.com> on 2016/11/18 18:14:00 UTC

Index and search on PDF text using Solr

Hello, i'm new in Solr and i have a big problem.
I have many text documents in PDF format (more than 10000) and I need to
create a site with this PDFs. In this site, I have to create a search by any
terms in this PDFs.
I don't have idea how to start.
Anyone can help me?

Thank you so much.



--
View this message in context: http://lucene.472066.n3.nabble.com/Index-and-search-on-PDF-text-using-Solr-tp4306486.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Index and search on PDF text using Solr

Posted by Erick Erickson <er...@gmail.com>.
see the section in the Solr Reference Guide: "Uploading Data with Solr
Cell using Apache Tika" here:

https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

to get a start.

The basic idea is to use Apache Tika to parse the PDF file and then
stuff the data into Solr. There are a
lot of tweaks you'll need to do, particularly mapping the meta-data
fields to Solr fields, but the above
should get you started. Once you get that operating, you can refine
your approach.

I'm personally not a fan of doing all this on the Solr server in a
_production_ environment unless it's a
one-time operation, here's a writeup of why I think that and a model
Java program that'd allow you to
do this on a Java client. It uses some older Solr classes (i.e.
CloudSolrServer is not CloudSolrClient)
but it should give you a starting place if you want to do something
similar. It has both a database
bit and a Tika bit but the database bits can just be taken out,
there's nothing about parsing the files
with Tika that requires it.

https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/

Best,
Erick

On Fri, Nov 18, 2016 at 10:14 AM, vascaino90 <jo...@gmail.com> wrote:
> Hello, i'm new in Solr and i have a big problem.
> I have many text documents in PDF format (more than 10000) and I need to
> create a site with this PDFs. In this site, I have to create a search by any
> terms in this PDFs.
> I don't have idea how to start.
> Anyone can help me?
>
> Thank you so much.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Index-and-search-on-PDF-text-using-Solr-tp4306486.html
> Sent from the Solr - User mailing list archive at Nabble.com.