You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by "raufer92@gmail.com" <ra...@gmail.com> on 2017/08/29 10:21:38 UTC

Parsing text from PDF while keeping positional information

Hello,

I’m currently trying to use Apache Tika to extract text from various PDF files.

I’ve been searching through the API but couldn’t exactly assess if what I want is possible.

The normal parsing operation outputs a list of lines

	line 1
	line 2
	…
	line n

I was curious about the possibility of, not only extracting the lines, but obtain positional information regarding each one

e.g.: the page from where the line was parsed and also the cartesian position on the PDF file (if viewed as an image)

	line 1  (metadata 1)
	line 2  (metadata 1)
	…
	line n  (metadata n)

Is this possible with Apache Tika?

Thanks,
Raul

RE: Parsing text from PDF while keeping positional information

Posted by "Allison, Timothy B." <ta...@mitre.org>.
We don't currently do this, unfortunately.  Wildloop has a pull request that would add this: https://github.com/apache/tika/pull/152

If at all possible, I'd want to make this be the same format as the hocr we're getting from tesseract so that consumers don't have to have one way of processing our xhtml for OCR, but a different one for pdfs.

What do you think?

Best,

            Tim



-----Original Message-----
From: raufer92@gmail.com [mailto:raufer92@gmail.com] 
Sent: Tuesday, August 29, 2017 6:22 AM
To: user@tika.apache.org
Subject: Parsing text from PDF while keeping positional information

Hello,

I’m currently trying to use Apache Tika to extract text from various PDF files.

I’ve been searching through the API but couldn’t exactly assess if what I want is possible.

The normal parsing operation outputs a list of lines

	line 1
	line 2
	…
	line n

I was curious about the possibility of, not only extracting the lines, but obtain positional information regarding each one

e.g.: the page from where the line was parsed and also the cartesian position on the PDF file (if viewed as an image)

	line 1  (metadata 1)
	line 2  (metadata 1)
	…
	line n  (metadata n)

Is this possible with Apache Tika?

Thanks,
Raul