You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by 김보섭 <bo...@gmail.com> on 2018/11/28 10:01:34 UTC

Text extracting error

We've tried to extract text from PDF
When we tried to extract Korean from text in PDF file, the order of those
have been broken while English was done well.
This does not mean that Korean is not extracted from PDF, it is well done,
but sequence has some problem.
This Problem occurred when
1. if PDF files have chart
2. size of the character is different one another

when we extracted PDF that have chart, then the text in the lowest row
shows at the beginning and the text in the highest row shows at the end

ex) | 가 | 나 | (in the chart)
     |다 | 라 |
-> 다라
가나(extracted)

and when PDF has multiple text size and font
the smallest and the the most simple font text have been extracted at the
beginning and
the largest and less simple text font text have been extracted at the end.

please check if this is a bug when extracting Korean

public static void extractStringfromPDF() throws IOException{
      final FileChooser filechooser = new FileChooser();
      File file = filechooser.showOpenDialog(null);
      try {
         PDDocument document = PDDocument.load(file);
         PDFTextStripper pdfStripper = new PDFTextStripper();
         String text = pdfStripper.getText(document);

         File txtFile = new File(file.getPath() + ".txt");
         FileWriter fw = new FileWriter(txtFile, true);
         fw.write(text);
         fw.flush();
         fw.close();
         System.out.println(text);
         document.close();
      }catch(Exception e) {e.printStackTrace();}
   }
the above code is that we used in our program

AW: Text extracting error

Posted by Tilman Hausherr <TH...@t-online.de>.

Hi, please try the setSortByPosition() method of the stripper. See also the 
PDFBox FAQ.
Tilman


------------------------------------------------------------------------
Gesendet mit der Telekom Mail App
<https://kommunikationsdienste.t-online.de/redirects/email_app_android_sendmail_footer>



--- Original-Nachricht ---
Von: 김보섭
Betreff: Text extracting error
Datum: 28.11.2018, 11:01 Uhr
An: users@pdfbox.apache.org





We've tried to extract text from PDF
When we tried to extract Korean from text in PDF file, the order of those
have been broken while English was done well.
This does not mean that Korean is not extracted from PDF, it is well done,
but sequence has some problem.
This Problem occurred when
1. if PDF files have chart
2. size of the character is different one another

when we extracted PDF that have chart, then the text in the lowest row
shows at the beginning and the text in the highest row shows at the end

ex) | 가 | 나 | (in the chart)
|다 | 라 |
-> 다라
가나(extracted)

and when PDF has multiple text size and font
the smallest and the the most simple font text have been extracted at the
beginning and
the largest and less simple text font text have been extracted at the end.

please check if this is a bug when extracting Korean

public static void extractStringfromPDF() throws IOException{
final FileChooser filechooser = new FileChooser();
File file = filechooser.showOpenDialog(null
<http://filechooser.showOpenDialog(null> );
try {
PDDocument document = PDDocument.load(file);
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document
<http://pdfStripper.getText(document> );

File txtFile = new File(file.getPath <http://file.getPath> () + ".txt");
FileWriter fw = new FileWriter(txtFile, true);
fw.write(text);
fw.flush();
fw.close <http://fw.close> ();
System.out.println(text <http://System.out.println(text> );
document.close <http://document.close> ();
}catch(Exception e) {e.printStackTrace <http://e.printStackTrace> ();}
}
the above code is that we used in our program

Re: Text extracting error

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Hi,

> We've tried to extract text from PDF
> When we tried to extract Korean from text in PDF file, the order of those
> have been broken while English was done well.
> This does not mean that Korean is not extracted from PDF, it is well done,
> but sequence has some problem.
> This Problem occurred when
> 1. if PDF files have chart
> 2. size of the character is different one another
> 
> when we extracted PDF that have chart, then the text in the lowest row
> shows at the beginning and the text in the highest row shows at the end
> 
> ex) | 가 | 나 | (in the chart)
>      |다 | 라 |
> -> 다라
> 가나(extracted)
> 
> and when PDF has multiple text size and font
> the smallest and the the most simple font text have been extracted at the
> beginning and
> the largest and less simple text font text have been extracted at the end.
> 
> please check if this is a bug when extracting Korean
> 
> public static void extractStringfromPDF() throws IOException{
>       final FileChooser filechooser = new FileChooser();
>       File file = filechooser.showOpenDialog(null);
>       try {
>          PDDocument document = PDDocument.load(file);
>          PDFTextStripper pdfStripper = new PDFTextStripper();
>          String text = pdfStripper.getText(document);
> 
>          File txtFile = new File(file.getPath() + ".txt");
>          FileWriter fw = new FileWriter(txtFile, true);
>          fw.write(text);
>          fw.flush();
>          fw.close();
>          System.out.println(text);
>          document.close();
>       }catch(Exception e) {e.printStackTrace();}
>    }
> the above code is that we used in our program


please try using the setSortByPosition option

https://pdfbox.apache.org/docs/2.0.12/javadocs/org/apache/pdfbox/text/PDFTextStripper.html#setSortByPosition-boolean-

as this will return the text in "visual" order and not in the order the text objects appear in the PDF. Dependent on the input
PDF this might give you a better result.

Maruan


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org