You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by "psconceicao@outlook.com" <ps...@outlook.com> on 2016/12/28 00:09:38 UTC
FW: Identify not visible characters - Overlapped characters - Missing
file
Hello,
The PDF was removed by the system, so I'm sending a link to the file:
https://we.tl/OjuxglYTNH
Many thanks,
Paulo Sergio
De: Paulo Conceicao [mailto:psconceicao@outlook.com]
Enviada: terça-feira, 27 de dezembro de 2016 23:53
Para: users@pdfbox.apache.org
Assunto: Identify not visible characters - Overlapped characters
Hello everyone,
I am using PDFBox 1.8.12 (because I'm developing in C#) and I can extract all characters from a PDF with the respective position.
My objective is to perform a layout analysis and try to reproduce the PDF layout in a text file.
However, I'm facing a huge problem: identify not visible characters.
In the annexed file, the text "Alandroal (Nossa Senhora da Conceic..." is using some space used by the word "Rural" (row 5), but not visible.
I would like to someone help me to get a way to identify the text not visible, in order to avoid those characters in the text file.
This approach: http://stackoverflow.com/questions/19809813/how-to-check-if-a-text-is-transparent-with-pdfbox doesn't work in the annexed file (only works with images).
Many thanks in advance,
Paulo Sergio