You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Pravin Agrawal <Pr...@persistent.co.in> on 2011/02/23 12:31:32 UTC

Pdfbox 1.4.0 concatenates word at end of line and the start of next line

Hi All,

 I tried extracting text from a sample pdf using the class org.apache.pdfbox.ExtractText from command line using pdfbox 1.4.0.
The text extracted shows some concatenated words such as "ofgovernance", "ProgressiveAlliance" which are not present in the actual pdf.
It seems that the pdfbox is concatenating words at the end of line and the start of next line for few cases.

Please find the sample pdf attached with this mail .

Could someone please let me know if this is a known bug and how to solve it.

<snip>
[root@vm-ps3152 lib]# java -cp .:pdfbox-1.4.0.jar:commons-logging-1.0.4.jar:fontbox-1.4.0.jar org.apache.pdfbox.ExtractText -console /tmp/chapter-04.pdf | more
Feb 23, 2011 3:55:04 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EI
173
Women, Children and Development
4.1 One of the six basic principles ofgovernance laid down in the United ProgressiveAlliance governmentÂs National CommonMinimum Programme (NCMP) is Âto empowe
r
women politically, educationally, economically
and legally. In the light of this, it is necessary
to assess how women and children actuallyfared in the process of development during theTenth Plan and what correctives need to be
applied.
</snip>


Thanks in advance
-Pravin

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.