You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Kaleb Akalework (JIRA)" <ji...@apache.org> on 2016/09/15 19:25:20 UTC

[jira] [Updated] (PDFBOX-3499) PDFBox 2.0.2 not parsing Japanese and Chinese Characters correctly from PDF

     [ https://issues.apache.org/jira/browse/PDFBOX-3499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kaleb Akalework updated PDFBOX-3499:
------------------------------------
    Attachment: nihao2.pdf

This is the input PDF I used

> PDFBox 2.0.2 not parsing Japanese and Chinese Characters correctly from PDF
> ---------------------------------------------------------------------------
>
>                 Key: PDFBOX-3499
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3499
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.2
>            Reporter: Kaleb Akalework
>         Attachments: nihao2.pdf
>
>
> I'm trying to use PDFBox 2.0.2 to parse PDF files that contain Japanese and Chinese characters, but for some reason it does parse it correctly. Every character that is extracted is changed to the first letter in the line. For example if the document contains 早上好, this, the extracted text will correctly know that it has 3 characters but all 3 characters will be 早早早, the last two characters are replaced by the first character. This same string is correctly parsed, in a word document.  I was trying to use this with Tika-13, which was is PDFBOX 2.0.2. Under Tim Allisons (From Tika) advice i tried it with PDFBOX 2.0.3. And I still see the same problem. The follwoing is the code I used.
> mport java.io.File;
> import java.io.IOException;
> import org.apache.pdfbox.cos.COSDocument;
> import org.apache.pdfbox.io.RandomAccessFile;
> import org.apache.pdfbox.pdfparser.PDFParser;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.text.PDFTextStripper;
> public class PDFBoxTesting {
> private static PDFParser parser;
> private static PDFTextStripper pdfStripper;
> private static PDDocument pdDoc ;
> private static COSDocument cosDoc ;
> private static String Text ;
> private static String filePath;
> private static File file;
> public static String ToText() throws IOException
> { pdfStripper = null; pdDoc = null; cosDoc = null; filePath = "C:\\Users\\kaleba\\Desktop\\nihao2.pdf"; file = new File(filePath); parser = new PDFParser(new RandomAccessFile(file,"r")); // update for PDFBox V 2.0 parser.parse(); cosDoc = parser.getDocument(); pdfStripper = new PDFTextStripper(); pdDoc = new PDDocument(cosDoc); pdDoc.getNumberOfPages(); pdfStripper.setStartPage(1); pdfStripper.setEndPage(10); // reading text from page 1 to 10 // if you want to get text from full pdf file use this code // pdfStripper.setEndPage(pdDoc.getNumberOfPages()); Text = pdfStripper.getText(pdDoc); // put breakpoint after executing getTtext. return Text; }
> public static void main(String[] args) {
> // TODO Auto-generated method stub
> try
> { ToText(); }
> catch (Exception e)
> { int i=1; }
> }
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org