You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2013/04/10 19:18:16 UTC
[jira] [Updated] (PDFBOX-951) Text extraction has issues on some pdfs

     [ https://issues.apache.org/jira/browse/PDFBOX-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler updated PDFBOX-951:
--------------------------------------

    Fix Version/s: 1.8.0
    
> Text extraction has issues on some pdfs
> ---------------------------------------
>
>                 Key: PDFBOX-951
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-951
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Dusan Radojevic
>            Priority: Minor
>             Fix For: 1.8.0
>
>         Attachments: file1.pdf, file2.pdf, file3.pdf, PDFBOX951-file3.txt
>
>
> Hi,
> i have noticed a big improvement in latest releases. Extraction is better but still has some problems.
> I have attached some files where i had problems.
> This is the code i use when extracting text:
> PDFTextStripper stripper = new PDFTextStripper(); 
> stripper.setStartPage( 1 );
> stripper.setEndPage( 1 );
> stripper.setSortByPosition(true);
> stripper.setWordSeparator("~");
> stripper.writeText(doc, sw);
> And here are some extracted lines from file1.pdf (I have skipped few lines and made them shorter because the problem is on the beggining of the line):
> 01. ENGLESKA CARLING KUP~KONA^AN ISHOD~DUPLA [ANSA~PRVO POL....
> 02. Dan~^as~R.B.~1/2 finala~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-...
> 03. Uto~20:45~2101~ARSENAL      0~1      IPSWICH~1.15~6.25~12.00~1.00~4.03~1.30~3.10~11.00~1.35~29.0~50~4.50~1....
> 04. Sre~20:45~2102~BIRMINGHAM     1~2    WEST HAM~2.10~3.10~3.15~1.25~1.26~1.56~2.60~1.95~3.85~3.....
> 05. ENGLESKA 1~KONA^AN GOLOVI ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL.~...
> 06. Dan~^as~R.B.~igre bez uslova~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-....
> 07. Uto~20:30~2001~BLACKPOOL~MANCHESTER UTD~7.50~4.20~1.36~2.69~1.12~1.00~7.00~2.35~1.75~13.0.....
> 08. Uto~20:45~2002~WIGAN ASTON VILLA~2.80~3.00~2.35~1.45~1.28~1.32~3.65~1.93~2.9.....
> 09. Sre~21:00~2003~LIVERPOOL~FULHAM~1.50~3.50~6.05~1.05~1.20~2.22~1.95~2.13~6.55...
> 10. ENGLESKA 2~KONA^AN GOLOVI A TIMA DUPLA ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL....
> 11. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2~1p2+2p2+~0-1.....
> 12. Uto~20:45~2171~DONCASTER BARNSLEY~1.95~3.10~3.55~1.20~1.26~1.65~2.50~2.00~4.25~3.15~13.0~30~4.70~....
> 13. Uto~20:45~2172NOTTINGHAM FOREST~BRISTOL CITY~1.55~3.45~5.50~1.07~1.21~2.12~2.00~2.12~....
> 14. ENGLESKA 3~KONA^AN DUPLA [ANSA~PRVO POL.~45' / 90'~GOLOVI LA ISHOD~POL.~....
> 15. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2....
> 16. Uto~20:45~2201~BRIGHTON COLCHESTER 1.70~3.30~4.45~1.12~1.23~1.89~2.15~2.10~5.20~.....
> 17. Uto~20:45~2202~HARTLEPOOL~NOTTS COUNTY~2.50~3.05~2.60~1.37~1.27.....
> 18. Uto~20:45~2203~LEYTON MK DONS~2.20~3.10~2.95~1.29~1.26~1.51~2.75~1.98~3.....
> 19. Uto~20:45~2204~SHEFFIELD WED~YEOVIL 1.67~3.35~4.70~1.11~1.22~1.96~2.10~2.10.....
> Lines 07, 09, 17 are extracted well and well formated.
> Lines 03 and 04 share the same problem, there are unnecessary spaces which should be line separators (in my case "~" separates words). I have seen this in other documents.
> Lines 08 and 18 for example doesn't have word separator ("~") between two team names. The space in the document between "Wigan" and "Aston Villa"  words is realy big.
> Lines 16 and 19 doesn't have word separator between second team name and first quota (COLCHESTER 1.70 and YEOVIL 1.67)
> FILE2.pdf has another problem. Words are shuffled. Here is top line from this file:
> G~KONA^ANERMANY~DUPLA~PRVO~GOL GOL45' / 90'~UKUPNO GOLOVA~KOMBI 1~ISHOD~ŠANSA~POLUVREME~NE GOL
> games are extracted excelent from this file but lines that should be headers are messed up:
> GERMANY~HENDIKEP~DRUGO~UKUPNO~UKUPNO~PRVI~ZADNJI
> DOMAC
>  GOLOVA~ GOLOVA~GOST
> UKUPNO~DUPLA  GOLOVA~VIŠE DAJE~WINNER 1~0:1~POLUVREME~PRVO~DRUGO~LOVA POLUVREME~ POLUVREME~GO~JE~GIONL~GOL~POBEDA DA
> And FILE3.pdf has some real issues. Nothing is extracted as it should be. Here are some lines:
> g~I~ o~ d~s~r~v~p~o~h
> .~S~a~.~K~ v~"~o~L~v~i~"~r~e~s~.~a~n~o~K~č~a~ n~i~d~s~o~h~l~p~u~a~D~ š~a~s~n~a~r~e~m~l~P~v~e~ u~o~-~ r~a~K~j~r~e~m~l~ v~e~u~o~p~a~o~  v~l~g~r~v~p~o~n~p~u~o~o~k~U~ o~n~p~u~k~U~a~r~e~m~l~ v~l~g~ g~r~u~d~v~e~u~o~p~o~o~o
>  g~A~ p~F~a~d~n~C~E~l~u~n
> r~e~m~l~a~v~e~n~u~o~p
> 9~0
> I hope this will help you people improve pdfbox.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira