You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Ahmed Eltayeb (JIRA)" <ji...@apache.org> on 2017/03/16 13:57:41 UTC

[jira] [Created] (PDFBOX-3719) pdfbox reads spaces as tabs

Ahmed Eltayeb created PDFBOX-3719:
-------------------------------------

             Summary: pdfbox reads spaces as tabs 
                 Key: PDFBOX-3719
                 URL: https://issues.apache.org/jira/browse/PDFBOX-3719
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 1.8.13
            Reporter: Ahmed Eltayeb
         Attachments: DummyDoc.docx, DummyDoc.pdf

i converted this pdf from the attached word document "DummyDoc.docx" 

then when using pdfbox1.8 to extract text
java -jar pdfbox-app-1.8.13.jar ExtractText "DummyDoc.pdf" txt.txt

and the generated is 

Dummy	document	for	tag	extraction	
	
Section	1	
	
\\DummyTagOne_01  
This	is	text	body	one	
	
\\DummyTagOne_02  
This	is	text	body	two	
	
Section	2	
\\DummyTagTwo_01  
This	is	text	body	three	
	
\\DummyTagTwo_02  
This	is	text	body	four	
	
\\DummyTagTwo_03  
This	is	text	body	five	


as you can see "This	is	text	body	one	" instead of "This is text body one	" and so on 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org