You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Roland Ucker <ro...@googlemail.com> on 2012/06/12 08:32:45 UTC

Text Extraction Using iText

Hello,

I would like to write my own pdf text/metadata extraction
module using iText instead of tika/pdfbox.

Where to start? Any hints?

Regards,
Roland

Re: Text Extraction Using iText

Posted by Jack Krupansky <ja...@basetechnology.com>.
Start by looking at the Tika code that integrates PDFBox since that is exactly where you want to end up – if you want to integrate your code with Tika and SolrCell.

http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/ 

If you are going to replace PDFBox in Tika for SolrCell, that is one thing, but if you want to feed the output of your extractor directly to Solr from your own client application, see the Solr XML format and the SolrJ interface. Ultimately, your extractor will produce two things: 1) extracted content or body text, and 2) metadata, all of which are simply “fields” in a “Solr input document.”

http://wiki.apache.org/solr/UpdateXmlMessages 
http://wiki.apache.org/solr/Solrj

-- Jack Krupansky

From: Roland Ucker 
Sent: Tuesday, June 12, 2012 2:32 AM
To: dev@lucene.apache.org 
Subject: Text Extraction Using iText

Hello,

I would like to write my own pdf text/metadata extraction module using iText instead of tika/pdfbox.

Where to start? Any hints?

Regards,
Roland