You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by RICARDO EITO BRUN <re...@bib.uc3m.es> on 2015/12/17 15:54:37 UTC

Problem with Solr indexing "non-searchable" pdf files

Hi,
I am using SOLR as part of the dspace 5.4 SW application.
I have a problem when running the dspace indexing command
(index-discovery). Most of the files are not being added to the index, and
an exception is raised.

It seems that Solr does not process the PDF files that are result of
scanning without OCR (non-searchable PDF files).

Is there any way to tell Solr that the document metadata should be
processed even if the PDF file itself cannot be indexed?

Any suggestion on how to make the pdf files "searchable" using some kind of
batch process/tool?

Thanks in advance,
Ricardo

-- 
RICARDO EITO BRUN
Universidad Carlos III de Madrid

Re: Problem with Solr indexing "non-searchable" pdf files

Posted by Erick Erickson <er...@gmail.com>.

Not sure how much help I can be, I have no clue what DSpace is
doing with Solr.

If you're willing to try to index straight to Solr, you can always use
SolrJ to parse the files, it's actually not very hard. Here's an example:
https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/

some database stuff is mixed in there, but that can be removed.

Otherwise, perhaps the DSpace folks have more guidance on
what/how they expect to do with PDFs.

Best,
Erick

On Thu, Dec 17, 2015 at 6:54 AM, RICARDO EITO BRUN <re...@bib.uc3m.es> wrote:
> Hi,
> I am using SOLR as part of the dspace 5.4 SW application.
> I have a problem when running the dspace indexing command
> (index-discovery). Most of the files are not being added to the index, and
> an exception is raised.
>
> It seems that Solr does not process the PDF files that are result of
> scanning without OCR (non-searchable PDF files).
>
> Is there any way to tell Solr that the document metadata should be
> processed even if the PDF file itself cannot be indexed?
>
> Any suggestion on how to make the pdf files "searchable" using some kind of
> batch process/tool?
>
> Thanks in advance,
> Ricardo
>
> --
> RICARDO EITO BRUN
> Universidad Carlos III de Madrid