You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Peter Bleackley <bl...@zooey.co.uk> on 2013/10/10 11:57:47 UTC

Using Solr Cell to index the internal structure of a PDF

I'm trying to index a set of PDF documents with Solr 4.5.0. So far I can 
get Solr to ingest the entire document as one long string, stored in the 
index as "content". However, I want to index structure within the documents.

I know that the ExtractingRequestHandler uses Apache Tika to convert the 
documents to XHTML. I've used the Tika GUI to look at the XHTML 
representation, and I can see that each page is represented as a <div> 
element, and that structure within pages is represented by <p> elements. 
How do I configure Solr to index documents at this level of granularity?

Dr Peter J Bleackley
Computational Linguistics Contractor
Playful Technology Ltd

Re: Using Solr Cell to index the internal structure of a PDF

Posted by Furkan KAMACI <fu...@gmail.com>.

You can have a look here:
http://solr.pl/en/2011/04/04/indexing-files-like-doc-pdf-solr-and-tika-integration/


2013/10/10 Peter Bleackley <bl...@zooey.co.uk>

> I'm trying to index a set of PDF documents with Solr 4.5.0. So far I can
> get Solr to ingest the entire document as one long string, stored in the
> index as "content". However, I want to index structure within the documents.
>
> I know that the ExtractingRequestHandler uses Apache Tika to convert the
> documents to XHTML. I've used the Tika GUI to look at the XHTML
> representation, and I can see that each page is represented as a <div>
> element, and that structure within pages is represented by <p> elements.
> How do I configure Solr to index documents at this level of granularity?
>
> Dr Peter J Bleackley
> Computational Linguistics Contractor
> Playful Technology Ltd
>