You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by 김보섭 <bo...@gmail.com> on 2018/11/28 10:01:34 UTC
Text extracting error
We've tried to extract text from PDF
When we tried to extract Korean from text in PDF file, the order of those
have been broken while English was done well.
This does not mean that Korean is not extracted from PDF, it is well done,
but sequence has some problem.
This Problem occurred when
1. if PDF files have chart
2. size of the character is different one another
when we extracted PDF that have chart, then the text in the lowest row
shows at the beginning and the text in the highest row shows at the end
ex) | 가 | 나 | (in the chart)
|다 | 라 |
-> 다라
가나(extracted)
and when PDF has multiple text size and font
the smallest and the the most simple font text have been extracted at the
beginning and
the largest and less simple text font text have been extracted at the end.
please check if this is a bug when extracting Korean
public static void extractStringfromPDF() throws IOException{
final FileChooser filechooser = new FileChooser();
File file = filechooser.showOpenDialog(null);
try {
PDDocument document = PDDocument.load(file);
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document);
File txtFile = new File(file.getPath() + ".txt");
FileWriter fw = new FileWriter(txtFile, true);
fw.write(text);
fw.flush();
fw.close();
System.out.println(text);
document.close();
}catch(Exception e) {e.printStackTrace();}
}
the above code is that we used in our program
AW: Text extracting error
Posted by Tilman Hausherr <TH...@t-online.de>.
Hi, please try the setSortByPosition() method of the stripper. See also the
PDFBox FAQ.
Tilman
------------------------------------------------------------------------
Gesendet mit der Telekom Mail App
<https://kommunikationsdienste.t-online.de/redirects/email_app_android_sendmail_footer>
--- Original-Nachricht ---
Von: 김보섭
Betreff: Text extracting error
Datum: 28.11.2018, 11:01 Uhr
An: users@pdfbox.apache.org
We've tried to extract text from PDF
When we tried to extract Korean from text in PDF file, the order of those
have been broken while English was done well.
This does not mean that Korean is not extracted from PDF, it is well done,
but sequence has some problem.
This Problem occurred when
1. if PDF files have chart
2. size of the character is different one another
when we extracted PDF that have chart, then the text in the lowest row
shows at the beginning and the text in the highest row shows at the end
ex) | 가 | 나 | (in the chart)
|다 | 라 |
-> 다라
가나(extracted)
and when PDF has multiple text size and font
the smallest and the the most simple font text have been extracted at the
beginning and
the largest and less simple text font text have been extracted at the end.
please check if this is a bug when extracting Korean
public static void extractStringfromPDF() throws IOException{
final FileChooser filechooser = new FileChooser();
File file = filechooser.showOpenDialog(null
<http://filechooser.showOpenDialog(null> );
try {
PDDocument document = PDDocument.load(file);
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document
<http://pdfStripper.getText(document> );
File txtFile = new File(file.getPath <http://file.getPath> () + ".txt");
FileWriter fw = new FileWriter(txtFile, true);
fw.write(text);
fw.flush();
fw.close <http://fw.close> ();
System.out.println(text <http://System.out.println(text> );
document.close <http://document.close> ();
}catch(Exception e) {e.printStackTrace <http://e.printStackTrace> ();}
}
the above code is that we used in our program
Re: Text extracting error
Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi,
> We've tried to extract text from PDF
> When we tried to extract Korean from text in PDF file, the order of those
> have been broken while English was done well.
> This does not mean that Korean is not extracted from PDF, it is well done,
> but sequence has some problem.
> This Problem occurred when
> 1. if PDF files have chart
> 2. size of the character is different one another
>
> when we extracted PDF that have chart, then the text in the lowest row
> shows at the beginning and the text in the highest row shows at the end
>
> ex) | 가 | 나 | (in the chart)
> |다 | 라 |
> -> 다라
> 가나(extracted)
>
> and when PDF has multiple text size and font
> the smallest and the the most simple font text have been extracted at the
> beginning and
> the largest and less simple text font text have been extracted at the end.
>
> please check if this is a bug when extracting Korean
>
> public static void extractStringfromPDF() throws IOException{
> final FileChooser filechooser = new FileChooser();
> File file = filechooser.showOpenDialog(null);
> try {
> PDDocument document = PDDocument.load(file);
> PDFTextStripper pdfStripper = new PDFTextStripper();
> String text = pdfStripper.getText(document);
>
> File txtFile = new File(file.getPath() + ".txt");
> FileWriter fw = new FileWriter(txtFile, true);
> fw.write(text);
> fw.flush();
> fw.close();
> System.out.println(text);
> document.close();
> }catch(Exception e) {e.printStackTrace();}
> }
> the above code is that we used in our program
please try using the setSortByPosition option
https://pdfbox.apache.org/docs/2.0.12/javadocs/org/apache/pdfbox/text/PDFTextStripper.html#setSortByPosition-boolean-
as this will return the text in "visual" order and not in the order the text objects appear in the PDF. Dependent on the input
PDF this might give you a better result.
Maruan
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org