You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Dusan Radojevic (JIRA)" <ji...@apache.org> on 2011/01/27 15:04:46 UTC

[jira] Created: (PDFBOX-951) Text extraction has issues on some pdfs

Text extraction has issues on some pdfs
---------------------------------------

                 Key: PDFBOX-951
                 URL: https://issues.apache.org/jira/browse/PDFBOX-951
             Project: PDFBox
          Issue Type: Improvement
          Components: Text extraction
    Affects Versions: 1.4.0
            Reporter: Dusan Radojevic
            Priority: Minor
             Fix For: 1.5.0




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-951) Text extraction has issues on some pdfs

Posted by "Dusan Radojevic (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dusan Radojevic updated PDFBOX-951:
-----------------------------------

    Description: 
Hi,

i have noticed a big improvement in latest releases. Extraction is better but still has some problems.
I have attached some files where i had problems.

This is the code i use when extracting text:

PDFTextStripper stripper = new PDFTextStripper(); 
stripper.setStartPage( 1 );
stripper.setEndPage( i );
stripper.setSortByPosition(true);
stripper.setWordSeparator("~");
stripper.writeText(doc, sw);

And here are some extracted lines from file1.pdf (I have skipped few lines and made them shorter because the problem is on the beggining of the line):

01. ENGLESKA CARLING KUP~KONA^AN ISHOD~DUPLA [ANSA~PRVO POL....
02. Dan~^as~R.B.~1/2 finala~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-...
03. Uto~20:45~2101~ARSENAL      0~1      IPSWICH~1.15~6.25~12.00~1.00~4.03~1.30~3.10~11.00~1.35~29.0~50~4.50~1....
04. Sre~20:45~2102~BIRMINGHAM     1~2    WEST HAM~2.10~3.10~3.15~1.25~1.26~1.56~2.60~1.95~3.85~3.....
05. ENGLESKA 1~KONA^AN GOLOVI ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL.~...
06. Dan~^as~R.B.~igre bez uslova~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-....
07. Uto~20:30~2001~BLACKPOOL~MANCHESTER UTD~7.50~4.20~1.36~2.69~1.12~1.00~7.00~2.35~1.75~13.0.....
08. Uto~20:45~2002~WIGAN ASTON VILLA~2.80~3.00~2.35~1.45~1.28~1.32~3.65~1.93~2.9.....
09. Sre~21:00~2003~LIVERPOOL~FULHAM~1.50~3.50~6.05~1.05~1.20~2.22~1.95~2.13~6.55...
10. ENGLESKA 2~KONA^AN GOLOVI A TIMA DUPLA ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL....
11. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2~1p2+2p2+~0-1.....
12. Uto~20:45~2171~DONCASTER BARNSLEY~1.95~3.10~3.55~1.20~1.26~1.65~2.50~2.00~4.25~3.15~13.0~30~4.70~....
13. Uto~20:45~2172NOTTINGHAM FOREST~BRISTOL CITY~1.55~3.45~5.50~1.07~1.21~2.12~2.00~2.12~....
14. ENGLESKA 3~KONA^AN DUPLA [ANSA~PRVO POL.~45' / 90'~GOLOVI LA ISHOD~POL.~....
15. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2....
16. Uto~20:45~2201~BRIGHTON COLCHESTER 1.70~3.30~4.45~1.12~1.23~1.89~2.15~2.10~5.20~.....
17. Uto~20:45~2202~HARTLEPOOL~NOTTS COUNTY~2.50~3.05~2.60~1.37~1.27.....
18. Uto~20:45~2203~LEYTON MK DONS~2.20~3.10~2.95~1.29~1.26~1.51~2.75~1.98~3.....
19. Uto~20:45~2204~SHEFFIELD WED~YEOVIL 1.67~3.35~4.70~1.11~1.22~1.96~2.10~2.10.....

Lines 07, 09, 17 are extracted well and well formated.
Lines 03 and 04 share the same problem, there are unnecessary spaces which should be line separators (in my case "~" separates words). I have seen this in other documents.
Lines 08 and 18 for example doesn't have word separator ("~") between two team names. The space in the document between "Wigan" and "Aston Villa"  words is realy big.
Lines 16 and 19 doesn't have word separator between second team name and first quota (COLCHESTER 1.70 and YEOVIL 1.67)


> Text extraction has issues on some pdfs
> ---------------------------------------
>
>                 Key: PDFBOX-951
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-951
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Dusan Radojevic
>            Priority: Minor
>             Fix For: 1.5.0
>
>
> Hi,
> i have noticed a big improvement in latest releases. Extraction is better but still has some problems.
> I have attached some files where i had problems.
> This is the code i use when extracting text:
> PDFTextStripper stripper = new PDFTextStripper(); 
> stripper.setStartPage( 1 );
> stripper.setEndPage( i );
> stripper.setSortByPosition(true);
> stripper.setWordSeparator("~");
> stripper.writeText(doc, sw);
> And here are some extracted lines from file1.pdf (I have skipped few lines and made them shorter because the problem is on the beggining of the line):
> 01. ENGLESKA CARLING KUP~KONA^AN ISHOD~DUPLA [ANSA~PRVO POL....
> 02. Dan~^as~R.B.~1/2 finala~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-...
> 03. Uto~20:45~2101~ARSENAL      0~1      IPSWICH~1.15~6.25~12.00~1.00~4.03~1.30~3.10~11.00~1.35~29.0~50~4.50~1....
> 04. Sre~20:45~2102~BIRMINGHAM     1~2    WEST HAM~2.10~3.10~3.15~1.25~1.26~1.56~2.60~1.95~3.85~3.....
> 05. ENGLESKA 1~KONA^AN GOLOVI ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL.~...
> 06. Dan~^as~R.B.~igre bez uslova~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-....
> 07. Uto~20:30~2001~BLACKPOOL~MANCHESTER UTD~7.50~4.20~1.36~2.69~1.12~1.00~7.00~2.35~1.75~13.0.....
> 08. Uto~20:45~2002~WIGAN ASTON VILLA~2.80~3.00~2.35~1.45~1.28~1.32~3.65~1.93~2.9.....
> 09. Sre~21:00~2003~LIVERPOOL~FULHAM~1.50~3.50~6.05~1.05~1.20~2.22~1.95~2.13~6.55...
> 10. ENGLESKA 2~KONA^AN GOLOVI A TIMA DUPLA ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL....
> 11. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2~1p2+2p2+~0-1.....
> 12. Uto~20:45~2171~DONCASTER BARNSLEY~1.95~3.10~3.55~1.20~1.26~1.65~2.50~2.00~4.25~3.15~13.0~30~4.70~....
> 13. Uto~20:45~2172NOTTINGHAM FOREST~BRISTOL CITY~1.55~3.45~5.50~1.07~1.21~2.12~2.00~2.12~....
> 14. ENGLESKA 3~KONA^AN DUPLA [ANSA~PRVO POL.~45' / 90'~GOLOVI LA ISHOD~POL.~....
> 15. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2....
> 16. Uto~20:45~2201~BRIGHTON COLCHESTER 1.70~3.30~4.45~1.12~1.23~1.89~2.15~2.10~5.20~.....
> 17. Uto~20:45~2202~HARTLEPOOL~NOTTS COUNTY~2.50~3.05~2.60~1.37~1.27.....
> 18. Uto~20:45~2203~LEYTON MK DONS~2.20~3.10~2.95~1.29~1.26~1.51~2.75~1.98~3.....
> 19. Uto~20:45~2204~SHEFFIELD WED~YEOVIL 1.67~3.35~4.70~1.11~1.22~1.96~2.10~2.10.....
> Lines 07, 09, 17 are extracted well and well formated.
> Lines 03 and 04 share the same problem, there are unnecessary spaces which should be line separators (in my case "~" separates words). I have seen this in other documents.
> Lines 08 and 18 for example doesn't have word separator ("~") between two team names. The space in the document between "Wigan" and "Aston Villa"  words is realy big.
> Lines 16 and 19 doesn't have word separator between second team name and first quota (COLCHESTER 1.70 and YEOVIL 1.67)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-951) Text extraction has issues on some pdfs

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler updated PDFBOX-951:
--------------------------------------

    Attachment: PDFBOX951-file3.txt

Extracted Text from file 3 using the current trunk (rev. 1063402)

> Text extraction has issues on some pdfs
> ---------------------------------------
>
>                 Key: PDFBOX-951
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-951
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Dusan Radojevic
>            Priority: Minor
>         Attachments: file1.pdf, file2.pdf, file3.pdf, PDFBOX951-file3.txt
>
>
> Hi,
> i have noticed a big improvement in latest releases. Extraction is better but still has some problems.
> I have attached some files where i had problems.
> This is the code i use when extracting text:
> PDFTextStripper stripper = new PDFTextStripper(); 
> stripper.setStartPage( 1 );
> stripper.setEndPage( 1 );
> stripper.setSortByPosition(true);
> stripper.setWordSeparator("~");
> stripper.writeText(doc, sw);
> And here are some extracted lines from file1.pdf (I have skipped few lines and made them shorter because the problem is on the beggining of the line):
> 01. ENGLESKA CARLING KUP~KONA^AN ISHOD~DUPLA [ANSA~PRVO POL....
> 02. Dan~^as~R.B.~1/2 finala~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-...
> 03. Uto~20:45~2101~ARSENAL      0~1      IPSWICH~1.15~6.25~12.00~1.00~4.03~1.30~3.10~11.00~1.35~29.0~50~4.50~1....
> 04. Sre~20:45~2102~BIRMINGHAM     1~2    WEST HAM~2.10~3.10~3.15~1.25~1.26~1.56~2.60~1.95~3.85~3.....
> 05. ENGLESKA 1~KONA^AN GOLOVI ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL.~...
> 06. Dan~^as~R.B.~igre bez uslova~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-....
> 07. Uto~20:30~2001~BLACKPOOL~MANCHESTER UTD~7.50~4.20~1.36~2.69~1.12~1.00~7.00~2.35~1.75~13.0.....
> 08. Uto~20:45~2002~WIGAN ASTON VILLA~2.80~3.00~2.35~1.45~1.28~1.32~3.65~1.93~2.9.....
> 09. Sre~21:00~2003~LIVERPOOL~FULHAM~1.50~3.50~6.05~1.05~1.20~2.22~1.95~2.13~6.55...
> 10. ENGLESKA 2~KONA^AN GOLOVI A TIMA DUPLA ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL....
> 11. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2~1p2+2p2+~0-1.....
> 12. Uto~20:45~2171~DONCASTER BARNSLEY~1.95~3.10~3.55~1.20~1.26~1.65~2.50~2.00~4.25~3.15~13.0~30~4.70~....
> 13. Uto~20:45~2172NOTTINGHAM FOREST~BRISTOL CITY~1.55~3.45~5.50~1.07~1.21~2.12~2.00~2.12~....
> 14. ENGLESKA 3~KONA^AN DUPLA [ANSA~PRVO POL.~45' / 90'~GOLOVI LA ISHOD~POL.~....
> 15. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2....
> 16. Uto~20:45~2201~BRIGHTON COLCHESTER 1.70~3.30~4.45~1.12~1.23~1.89~2.15~2.10~5.20~.....
> 17. Uto~20:45~2202~HARTLEPOOL~NOTTS COUNTY~2.50~3.05~2.60~1.37~1.27.....
> 18. Uto~20:45~2203~LEYTON MK DONS~2.20~3.10~2.95~1.29~1.26~1.51~2.75~1.98~3.....
> 19. Uto~20:45~2204~SHEFFIELD WED~YEOVIL 1.67~3.35~4.70~1.11~1.22~1.96~2.10~2.10.....
> Lines 07, 09, 17 are extracted well and well formated.
> Lines 03 and 04 share the same problem, there are unnecessary spaces which should be line separators (in my case "~" separates words). I have seen this in other documents.
> Lines 08 and 18 for example doesn't have word separator ("~") between two team names. The space in the document between "Wigan" and "Aston Villa"  words is realy big.
> Lines 16 and 19 doesn't have word separator between second team name and first quota (COLCHESTER 1.70 and YEOVIL 1.67)
> FILE2.pdf has another problem. Words are shuffled. Here is top line from this file:
> G~KONA^ANERMANY~DUPLA~PRVO~GOL GOL45' / 90'~UKUPNO GOLOVA~KOMBI 1~ISHOD~ŠANSA~POLUVREME~NE GOL
> games are extracted excelent from this file but lines that should be headers are messed up:
> GERMANY~HENDIKEP~DRUGO~UKUPNO~UKUPNO~PRVI~ZADNJI
> DOMAC
>  GOLOVA~ GOLOVA~GOST
> UKUPNO~DUPLA  GOLOVA~VIŠE DAJE~WINNER 1~0:1~POLUVREME~PRVO~DRUGO~LOVA POLUVREME~ POLUVREME~GO~JE~GIONL~GOL~POBEDA DA
> And FILE3.pdf has some real issues. Nothing is extracted as it should be. Here are some lines:
> g~I~ o~ d~s~r~v~p~o~h
> .~S~a~.~K~ v~"~o~L~v~i~"~r~e~s~.~a~n~o~K~č~a~ n~i~d~s~o~h~l~p~u~a~D~ š~a~s~n~a~r~e~m~l~P~v~e~ u~o~-~ r~a~K~j~r~e~m~l~ v~e~u~o~p~a~o~  v~l~g~r~v~p~o~n~p~u~o~o~k~U~ o~n~p~u~k~U~a~r~e~m~l~ v~l~g~ g~r~u~d~v~e~u~o~p~o~o~o
>  g~A~ p~F~a~d~n~C~E~l~u~n
> r~e~m~l~a~v~e~n~u~o~p
> 9~0
> I hope this will help you people improve pdfbox.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-951) Text extraction has issues on some pdfs

Posted by "Dusan Radojevic (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dusan Radojevic updated PDFBOX-951:
-----------------------------------

    Issue Type: Bug  (was: Improvement)

> Text extraction has issues on some pdfs
> ---------------------------------------
>
>                 Key: PDFBOX-951
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-951
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Dusan Radojevic
>            Priority: Minor
>             Fix For: 1.5.0
>
>         Attachments: file1.pdf, file2.pdf, file3.pdf
>
>
> Hi,
> i have noticed a big improvement in latest releases. Extraction is better but still has some problems.
> I have attached some files where i had problems.
> This is the code i use when extracting text:
> PDFTextStripper stripper = new PDFTextStripper(); 
> stripper.setStartPage( 1 );
> stripper.setEndPage( i );
> stripper.setSortByPosition(true);
> stripper.setWordSeparator("~");
> stripper.writeText(doc, sw);
> And here are some extracted lines from file1.pdf (I have skipped few lines and made them shorter because the problem is on the beggining of the line):
> 01. ENGLESKA CARLING KUP~KONA^AN ISHOD~DUPLA [ANSA~PRVO POL....
> 02. Dan~^as~R.B.~1/2 finala~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-...
> 03. Uto~20:45~2101~ARSENAL      0~1      IPSWICH~1.15~6.25~12.00~1.00~4.03~1.30~3.10~11.00~1.35~29.0~50~4.50~1....
> 04. Sre~20:45~2102~BIRMINGHAM     1~2    WEST HAM~2.10~3.10~3.15~1.25~1.26~1.56~2.60~1.95~3.85~3.....
> 05. ENGLESKA 1~KONA^AN GOLOVI ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL.~...
> 06. Dan~^as~R.B.~igre bez uslova~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-....
> 07. Uto~20:30~2001~BLACKPOOL~MANCHESTER UTD~7.50~4.20~1.36~2.69~1.12~1.00~7.00~2.35~1.75~13.0.....
> 08. Uto~20:45~2002~WIGAN ASTON VILLA~2.80~3.00~2.35~1.45~1.28~1.32~3.65~1.93~2.9.....
> 09. Sre~21:00~2003~LIVERPOOL~FULHAM~1.50~3.50~6.05~1.05~1.20~2.22~1.95~2.13~6.55...
> 10. ENGLESKA 2~KONA^AN GOLOVI A TIMA DUPLA ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL....
> 11. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2~1p2+2p2+~0-1.....
> 12. Uto~20:45~2171~DONCASTER BARNSLEY~1.95~3.10~3.55~1.20~1.26~1.65~2.50~2.00~4.25~3.15~13.0~30~4.70~....
> 13. Uto~20:45~2172NOTTINGHAM FOREST~BRISTOL CITY~1.55~3.45~5.50~1.07~1.21~2.12~2.00~2.12~....
> 14. ENGLESKA 3~KONA^AN DUPLA [ANSA~PRVO POL.~45' / 90'~GOLOVI LA ISHOD~POL.~....
> 15. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2....
> 16. Uto~20:45~2201~BRIGHTON COLCHESTER 1.70~3.30~4.45~1.12~1.23~1.89~2.15~2.10~5.20~.....
> 17. Uto~20:45~2202~HARTLEPOOL~NOTTS COUNTY~2.50~3.05~2.60~1.37~1.27.....
> 18. Uto~20:45~2203~LEYTON MK DONS~2.20~3.10~2.95~1.29~1.26~1.51~2.75~1.98~3.....
> 19. Uto~20:45~2204~SHEFFIELD WED~YEOVIL 1.67~3.35~4.70~1.11~1.22~1.96~2.10~2.10.....
> Lines 07, 09, 17 are extracted well and well formated.
> Lines 03 and 04 share the same problem, there are unnecessary spaces which should be line separators (in my case "~" separates words). I have seen this in other documents.
> Lines 08 and 18 for example doesn't have word separator ("~") between two team names. The space in the document between "Wigan" and "Aston Villa"  words is realy big.
> Lines 16 and 19 doesn't have word separator between second team name and first quota (COLCHESTER 1.70 and YEOVIL 1.67)
> FILE2.pdf has another problem. Words are shuffled. Here is top line from this file:
> G~KONA^ANERMANY~DUPLA~PRVO~GOL GOL45' / 90'~UKUPNO GOLOVA~KOMBI 1~ISHOD~ŠANSA~POLUVREME~NE GOL
> games are extracted excelent from this file but lines that should be headers are messed up:
> GERMANY~HENDIKEP~DRUGO~UKUPNO~UKUPNO~PRVI~ZADNJI
> DOMAC
>  GOLOVA~ GOLOVA~GOST
> UKUPNO~DUPLA  GOLOVA~VIŠE DAJE~WINNER 1~0:1~POLUVREME~PRVO~DRUGO~LOVA POLUVREME~ POLUVREME~GO~JE~GIONL~GOL~POBEDA DA
> And FILE3.pdf has some real issues. Nothing is extracted as it should be. Here are some lines:
> g~I~ o~ d~s~r~v~p~o~h
> .~S~a~.~K~ v~"~o~L~v~i~"~r~e~s~.~a~n~o~K~č~a~ n~i~d~s~o~h~l~p~u~a~D~ š~a~s~n~a~r~e~m~l~P~v~e~ u~o~-~ r~a~K~j~r~e~m~l~ v~e~u~o~p~a~o~  v~l~g~r~v~p~o~n~p~u~o~o~k~U~ o~n~p~u~k~U~a~r~e~m~l~ v~l~g~ g~r~u~d~v~e~u~o~p~o~o~o
>  g~A~ p~F~a~d~n~C~E~l~u~n
> r~e~m~l~a~v~e~n~u~o~p
> 9~0
> I hope this will help you people improve pdfbox.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-951) Text extraction has issues on some pdfs

Posted by "Dusan Radojevic (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dusan Radojevic updated PDFBOX-951:
-----------------------------------

    Description: 
Hi,

i have noticed a big improvement in latest releases. Extraction is better but still has some problems.
I have attached some files where i had problems.

This is the code i use when extracting text:

PDFTextStripper stripper = new PDFTextStripper(); 
stripper.setStartPage( 1 );
stripper.setEndPage( i );
stripper.setSortByPosition(true);
stripper.setWordSeparator("~");
stripper.writeText(doc, sw);

And here are some extracted lines from file1.pdf (I have skipped few lines and made them shorter because the problem is on the beggining of the line):

01. ENGLESKA CARLING KUP~KONA^AN ISHOD~DUPLA [ANSA~PRVO POL....
02. Dan~^as~R.B.~1/2 finala~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-...
03. Uto~20:45~2101~ARSENAL      0~1      IPSWICH~1.15~6.25~12.00~1.00~4.03~1.30~3.10~11.00~1.35~29.0~50~4.50~1....
04. Sre~20:45~2102~BIRMINGHAM     1~2    WEST HAM~2.10~3.10~3.15~1.25~1.26~1.56~2.60~1.95~3.85~3.....
05. ENGLESKA 1~KONA^AN GOLOVI ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL.~...
06. Dan~^as~R.B.~igre bez uslova~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-....
07. Uto~20:30~2001~BLACKPOOL~MANCHESTER UTD~7.50~4.20~1.36~2.69~1.12~1.00~7.00~2.35~1.75~13.0.....
08. Uto~20:45~2002~WIGAN ASTON VILLA~2.80~3.00~2.35~1.45~1.28~1.32~3.65~1.93~2.9.....
09. Sre~21:00~2003~LIVERPOOL~FULHAM~1.50~3.50~6.05~1.05~1.20~2.22~1.95~2.13~6.55...
10. ENGLESKA 2~KONA^AN GOLOVI A TIMA DUPLA ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL....
11. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2~1p2+2p2+~0-1.....
12. Uto~20:45~2171~DONCASTER BARNSLEY~1.95~3.10~3.55~1.20~1.26~1.65~2.50~2.00~4.25~3.15~13.0~30~4.70~....
13. Uto~20:45~2172NOTTINGHAM FOREST~BRISTOL CITY~1.55~3.45~5.50~1.07~1.21~2.12~2.00~2.12~....
14. ENGLESKA 3~KONA^AN DUPLA [ANSA~PRVO POL.~45' / 90'~GOLOVI LA ISHOD~POL.~....
15. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2....
16. Uto~20:45~2201~BRIGHTON COLCHESTER 1.70~3.30~4.45~1.12~1.23~1.89~2.15~2.10~5.20~.....
17. Uto~20:45~2202~HARTLEPOOL~NOTTS COUNTY~2.50~3.05~2.60~1.37~1.27.....
18. Uto~20:45~2203~LEYTON MK DONS~2.20~3.10~2.95~1.29~1.26~1.51~2.75~1.98~3.....
19. Uto~20:45~2204~SHEFFIELD WED~YEOVIL 1.67~3.35~4.70~1.11~1.22~1.96~2.10~2.10.....

Lines 07, 09, 17 are extracted well and well formated.
Lines 03 and 04 share the same problem, there are unnecessary spaces which should be line separators (in my case "~" separates words). I have seen this in other documents.
Lines 08 and 18 for example doesn't have word separator ("~") between two team names. The space in the document between "Wigan" and "Aston Villa"  words is realy big.
Lines 16 and 19 doesn't have word separator between second team name and first quota (COLCHESTER 1.70 and YEOVIL 1.67)

FILE2.pdf has another problem. Words are shuffled. Here is top line from this file:

G~KONA^ANERMANY~DUPLA~PRVO~GOL GOL45' / 90'~UKUPNO GOLOVA~KOMBI 1~ISHOD~ŠANSA~POLUVREME~NE GOL

games are extracted excelent from this file but lines that should be headers are messed up:

GERMANY~HENDIKEP~DRUGO~UKUPNO~UKUPNO~PRVI~ZADNJI
DOMAC
 GOLOVA~ GOLOVA~GOST
UKUPNO~DUPLA  GOLOVA~VIŠE DAJE~WINNER 1~0:1~POLUVREME~PRVO~DRUGO~LOVA POLUVREME~ POLUVREME~GO~JE~GIONL~GOL~POBEDA DA

And FILE3.pdf has some real issues. Nothing is extracted as it should be. Here are some lines:

g~I~ o~ d~s~r~v~p~o~h
.~S~a~.~K~ v~"~o~L~v~i~"~r~e~s~.~a~n~o~K~č~a~ n~i~d~s~o~h~l~p~u~a~D~ š~a~s~n~a~r~e~m~l~P~v~e~ u~o~-~ r~a~K~j~r~e~m~l~ v~e~u~o~p~a~o~  v~l~g~r~v~p~o~n~p~u~o~o~k~U~ o~n~p~u~k~U~a~r~e~m~l~ v~l~g~ g~r~u~d~v~e~u~o~p~o~o~o
 g~A~ p~F~a~d~n~C~E~l~u~n
r~e~m~l~a~v~e~n~u~o~p
9~0

I hope this will help you people improve pdfbox.

  was:
Hi,

i have noticed a big improvement in latest releases. Extraction is better but still has some problems.
I have attached some files where i had problems.

This is the code i use when extracting text:

PDFTextStripper stripper = new PDFTextStripper(); 
stripper.setStartPage( 1 );
stripper.setEndPage( i );
stripper.setSortByPosition(true);
stripper.setWordSeparator("~");
stripper.writeText(doc, sw);

And here are some extracted lines from file1.pdf (I have skipped few lines and made them shorter because the problem is on the beggining of the line):

01. ENGLESKA CARLING KUP~KONA^AN ISHOD~DUPLA [ANSA~PRVO POL....
02. Dan~^as~R.B.~1/2 finala~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-...
03. Uto~20:45~2101~ARSENAL      0~1      IPSWICH~1.15~6.25~12.00~1.00~4.03~1.30~3.10~11.00~1.35~29.0~50~4.50~1....
04. Sre~20:45~2102~BIRMINGHAM     1~2    WEST HAM~2.10~3.10~3.15~1.25~1.26~1.56~2.60~1.95~3.85~3.....
05. ENGLESKA 1~KONA^AN GOLOVI ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL.~...
06. Dan~^as~R.B.~igre bez uslova~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-....
07. Uto~20:30~2001~BLACKPOOL~MANCHESTER UTD~7.50~4.20~1.36~2.69~1.12~1.00~7.00~2.35~1.75~13.0.....
08. Uto~20:45~2002~WIGAN ASTON VILLA~2.80~3.00~2.35~1.45~1.28~1.32~3.65~1.93~2.9.....
09. Sre~21:00~2003~LIVERPOOL~FULHAM~1.50~3.50~6.05~1.05~1.20~2.22~1.95~2.13~6.55...
10. ENGLESKA 2~KONA^AN GOLOVI A TIMA DUPLA ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL....
11. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2~1p2+2p2+~0-1.....
12. Uto~20:45~2171~DONCASTER BARNSLEY~1.95~3.10~3.55~1.20~1.26~1.65~2.50~2.00~4.25~3.15~13.0~30~4.70~....
13. Uto~20:45~2172NOTTINGHAM FOREST~BRISTOL CITY~1.55~3.45~5.50~1.07~1.21~2.12~2.00~2.12~....
14. ENGLESKA 3~KONA^AN DUPLA [ANSA~PRVO POL.~45' / 90'~GOLOVI LA ISHOD~POL.~....
15. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2....
16. Uto~20:45~2201~BRIGHTON COLCHESTER 1.70~3.30~4.45~1.12~1.23~1.89~2.15~2.10~5.20~.....
17. Uto~20:45~2202~HARTLEPOOL~NOTTS COUNTY~2.50~3.05~2.60~1.37~1.27.....
18. Uto~20:45~2203~LEYTON MK DONS~2.20~3.10~2.95~1.29~1.26~1.51~2.75~1.98~3.....
19. Uto~20:45~2204~SHEFFIELD WED~YEOVIL 1.67~3.35~4.70~1.11~1.22~1.96~2.10~2.10.....

Lines 07, 09, 17 are extracted well and well formated.
Lines 03 and 04 share the same problem, there are unnecessary spaces which should be line separators (in my case "~" separates words). I have seen this in other documents.
Lines 08 and 18 for example doesn't have word separator ("~") between two team names. The space in the document between "Wigan" and "Aston Villa"  words is realy big.
Lines 16 and 19 doesn't have word separator between second team name and first quota (COLCHESTER 1.70 and YEOVIL 1.67)



> Text extraction has issues on some pdfs
> ---------------------------------------
>
>                 Key: PDFBOX-951
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-951
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Dusan Radojevic
>            Priority: Minor
>             Fix For: 1.5.0
>
>         Attachments: file1.pdf, file2.pdf, file3.pdf
>
>
> Hi,
> i have noticed a big improvement in latest releases. Extraction is better but still has some problems.
> I have attached some files where i had problems.
> This is the code i use when extracting text:
> PDFTextStripper stripper = new PDFTextStripper(); 
> stripper.setStartPage( 1 );
> stripper.setEndPage( i );
> stripper.setSortByPosition(true);
> stripper.setWordSeparator("~");
> stripper.writeText(doc, sw);
> And here are some extracted lines from file1.pdf (I have skipped few lines and made them shorter because the problem is on the beggining of the line):
> 01. ENGLESKA CARLING KUP~KONA^AN ISHOD~DUPLA [ANSA~PRVO POL....
> 02. Dan~^as~R.B.~1/2 finala~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-...
> 03. Uto~20:45~2101~ARSENAL      0~1      IPSWICH~1.15~6.25~12.00~1.00~4.03~1.30~3.10~11.00~1.35~29.0~50~4.50~1....
> 04. Sre~20:45~2102~BIRMINGHAM     1~2    WEST HAM~2.10~3.10~3.15~1.25~1.26~1.56~2.60~1.95~3.85~3.....
> 05. ENGLESKA 1~KONA^AN GOLOVI ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL.~...
> 06. Dan~^as~R.B.~igre bez uslova~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-....
> 07. Uto~20:30~2001~BLACKPOOL~MANCHESTER UTD~7.50~4.20~1.36~2.69~1.12~1.00~7.00~2.35~1.75~13.0.....
> 08. Uto~20:45~2002~WIGAN ASTON VILLA~2.80~3.00~2.35~1.45~1.28~1.32~3.65~1.93~2.9.....
> 09. Sre~21:00~2003~LIVERPOOL~FULHAM~1.50~3.50~6.05~1.05~1.20~2.22~1.95~2.13~6.55...
> 10. ENGLESKA 2~KONA^AN GOLOVI A TIMA DUPLA ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL....
> 11. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2~1p2+2p2+~0-1.....
> 12. Uto~20:45~2171~DONCASTER BARNSLEY~1.95~3.10~3.55~1.20~1.26~1.65~2.50~2.00~4.25~3.15~13.0~30~4.70~....
> 13. Uto~20:45~2172NOTTINGHAM FOREST~BRISTOL CITY~1.55~3.45~5.50~1.07~1.21~2.12~2.00~2.12~....
> 14. ENGLESKA 3~KONA^AN DUPLA [ANSA~PRVO POL.~45' / 90'~GOLOVI LA ISHOD~POL.~....
> 15. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2....
> 16. Uto~20:45~2201~BRIGHTON COLCHESTER 1.70~3.30~4.45~1.12~1.23~1.89~2.15~2.10~5.20~.....
> 17. Uto~20:45~2202~HARTLEPOOL~NOTTS COUNTY~2.50~3.05~2.60~1.37~1.27.....
> 18. Uto~20:45~2203~LEYTON MK DONS~2.20~3.10~2.95~1.29~1.26~1.51~2.75~1.98~3.....
> 19. Uto~20:45~2204~SHEFFIELD WED~YEOVIL 1.67~3.35~4.70~1.11~1.22~1.96~2.10~2.10.....
> Lines 07, 09, 17 are extracted well and well formated.
> Lines 03 and 04 share the same problem, there are unnecessary spaces which should be line separators (in my case "~" separates words). I have seen this in other documents.
> Lines 08 and 18 for example doesn't have word separator ("~") between two team names. The space in the document between "Wigan" and "Aston Villa"  words is realy big.
> Lines 16 and 19 doesn't have word separator between second team name and first quota (COLCHESTER 1.70 and YEOVIL 1.67)
> FILE2.pdf has another problem. Words are shuffled. Here is top line from this file:
> G~KONA^ANERMANY~DUPLA~PRVO~GOL GOL45' / 90'~UKUPNO GOLOVA~KOMBI 1~ISHOD~ŠANSA~POLUVREME~NE GOL
> games are extracted excelent from this file but lines that should be headers are messed up:
> GERMANY~HENDIKEP~DRUGO~UKUPNO~UKUPNO~PRVI~ZADNJI
> DOMAC
>  GOLOVA~ GOLOVA~GOST
> UKUPNO~DUPLA  GOLOVA~VIŠE DAJE~WINNER 1~0:1~POLUVREME~PRVO~DRUGO~LOVA POLUVREME~ POLUVREME~GO~JE~GIONL~GOL~POBEDA DA
> And FILE3.pdf has some real issues. Nothing is extracted as it should be. Here are some lines:
> g~I~ o~ d~s~r~v~p~o~h
> .~S~a~.~K~ v~"~o~L~v~i~"~r~e~s~.~a~n~o~K~č~a~ n~i~d~s~o~h~l~p~u~a~D~ š~a~s~n~a~r~e~m~l~P~v~e~ u~o~-~ r~a~K~j~r~e~m~l~ v~e~u~o~p~a~o~  v~l~g~r~v~p~o~n~p~u~o~o~k~U~ o~n~p~u~k~U~a~r~e~m~l~ v~l~g~ g~r~u~d~v~e~u~o~p~o~o~o
>  g~A~ p~F~a~d~n~C~E~l~u~n
> r~e~m~l~a~v~e~n~u~o~p
> 9~0
> I hope this will help you people improve pdfbox.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-951) Text extraction has issues on some pdfs

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987776#action_12987776 ] 

Andreas Lehmkühler commented on PDFBOX-951:
-------------------------------------------

I had a look at file1.pdf. Those extra spaces are already within the pdf. 

The following is a small piece of the pdf code. The text is located within the brackets

[(2102)-1330(BIRMINGHAM     1)-591(2    WEST HAM)-5386(2.10)-848(3.10)-857(3.15) .....]TJ


> Text extraction has issues on some pdfs
> ---------------------------------------
>
>                 Key: PDFBOX-951
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-951
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Dusan Radojevic
>            Priority: Minor
>         Attachments: file1.pdf, file2.pdf, file3.pdf, PDFBOX951-file3.txt
>
>
> Hi,
> i have noticed a big improvement in latest releases. Extraction is better but still has some problems.
> I have attached some files where i had problems.
> This is the code i use when extracting text:
> PDFTextStripper stripper = new PDFTextStripper(); 
> stripper.setStartPage( 1 );
> stripper.setEndPage( 1 );
> stripper.setSortByPosition(true);
> stripper.setWordSeparator("~");
> stripper.writeText(doc, sw);
> And here are some extracted lines from file1.pdf (I have skipped few lines and made them shorter because the problem is on the beggining of the line):
> 01. ENGLESKA CARLING KUP~KONA^AN ISHOD~DUPLA [ANSA~PRVO POL....
> 02. Dan~^as~R.B.~1/2 finala~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-...
> 03. Uto~20:45~2101~ARSENAL      0~1      IPSWICH~1.15~6.25~12.00~1.00~4.03~1.30~3.10~11.00~1.35~29.0~50~4.50~1....
> 04. Sre~20:45~2102~BIRMINGHAM     1~2    WEST HAM~2.10~3.10~3.15~1.25~1.26~1.56~2.60~1.95~3.85~3.....
> 05. ENGLESKA 1~KONA^AN GOLOVI ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL.~...
> 06. Dan~^as~R.B.~igre bez uslova~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-....
> 07. Uto~20:30~2001~BLACKPOOL~MANCHESTER UTD~7.50~4.20~1.36~2.69~1.12~1.00~7.00~2.35~1.75~13.0.....
> 08. Uto~20:45~2002~WIGAN ASTON VILLA~2.80~3.00~2.35~1.45~1.28~1.32~3.65~1.93~2.9.....
> 09. Sre~21:00~2003~LIVERPOOL~FULHAM~1.50~3.50~6.05~1.05~1.20~2.22~1.95~2.13~6.55...
> 10. ENGLESKA 2~KONA^AN GOLOVI A TIMA DUPLA ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL....
> 11. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2~1p2+2p2+~0-1.....
> 12. Uto~20:45~2171~DONCASTER BARNSLEY~1.95~3.10~3.55~1.20~1.26~1.65~2.50~2.00~4.25~3.15~13.0~30~4.70~....
> 13. Uto~20:45~2172NOTTINGHAM FOREST~BRISTOL CITY~1.55~3.45~5.50~1.07~1.21~2.12~2.00~2.12~....
> 14. ENGLESKA 3~KONA^AN DUPLA [ANSA~PRVO POL.~45' / 90'~GOLOVI LA ISHOD~POL.~....
> 15. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2....
> 16. Uto~20:45~2201~BRIGHTON COLCHESTER 1.70~3.30~4.45~1.12~1.23~1.89~2.15~2.10~5.20~.....
> 17. Uto~20:45~2202~HARTLEPOOL~NOTTS COUNTY~2.50~3.05~2.60~1.37~1.27.....
> 18. Uto~20:45~2203~LEYTON MK DONS~2.20~3.10~2.95~1.29~1.26~1.51~2.75~1.98~3.....
> 19. Uto~20:45~2204~SHEFFIELD WED~YEOVIL 1.67~3.35~4.70~1.11~1.22~1.96~2.10~2.10.....
> Lines 07, 09, 17 are extracted well and well formated.
> Lines 03 and 04 share the same problem, there are unnecessary spaces which should be line separators (in my case "~" separates words). I have seen this in other documents.
> Lines 08 and 18 for example doesn't have word separator ("~") between two team names. The space in the document between "Wigan" and "Aston Villa"  words is realy big.
> Lines 16 and 19 doesn't have word separator between second team name and first quota (COLCHESTER 1.70 and YEOVIL 1.67)
> FILE2.pdf has another problem. Words are shuffled. Here is top line from this file:
> G~KONA^ANERMANY~DUPLA~PRVO~GOL GOL45' / 90'~UKUPNO GOLOVA~KOMBI 1~ISHOD~ŠANSA~POLUVREME~NE GOL
> games are extracted excelent from this file but lines that should be headers are messed up:
> GERMANY~HENDIKEP~DRUGO~UKUPNO~UKUPNO~PRVI~ZADNJI
> DOMAC
>  GOLOVA~ GOLOVA~GOST
> UKUPNO~DUPLA  GOLOVA~VIŠE DAJE~WINNER 1~0:1~POLUVREME~PRVO~DRUGO~LOVA POLUVREME~ POLUVREME~GO~JE~GIONL~GOL~POBEDA DA
> And FILE3.pdf has some real issues. Nothing is extracted as it should be. Here are some lines:
> g~I~ o~ d~s~r~v~p~o~h
> .~S~a~.~K~ v~"~o~L~v~i~"~r~e~s~.~a~n~o~K~č~a~ n~i~d~s~o~h~l~p~u~a~D~ š~a~s~n~a~r~e~m~l~P~v~e~ u~o~-~ r~a~K~j~r~e~m~l~ v~e~u~o~p~a~o~  v~l~g~r~v~p~o~n~p~u~o~o~k~U~ o~n~p~u~k~U~a~r~e~m~l~ v~l~g~ g~r~u~d~v~e~u~o~p~o~o~o
>  g~A~ p~F~a~d~n~C~E~l~u~n
> r~e~m~l~a~v~e~n~u~o~p
> 9~0
> I hope this will help you people improve pdfbox.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-951) Text extraction has issues on some pdfs

Posted by "Dusan Radojevic (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dusan Radojevic updated PDFBOX-951:
-----------------------------------

    Attachment: file1.pdf

> Text extraction has issues on some pdfs
> ---------------------------------------
>
>                 Key: PDFBOX-951
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-951
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Dusan Radojevic
>            Priority: Minor
>             Fix For: 1.5.0
>
>         Attachments: file1.pdf
>
>
> Hi,
> i have noticed a big improvement in latest releases. Extraction is better but still has some problems.
> I have attached some files where i had problems.
> This is the code i use when extracting text:
> PDFTextStripper stripper = new PDFTextStripper(); 
> stripper.setStartPage( 1 );
> stripper.setEndPage( i );
> stripper.setSortByPosition(true);
> stripper.setWordSeparator("~");
> stripper.writeText(doc, sw);
> And here are some extracted lines from file1.pdf (I have skipped few lines and made them shorter because the problem is on the beggining of the line):
> 01. ENGLESKA CARLING KUP~KONA^AN ISHOD~DUPLA [ANSA~PRVO POL....
> 02. Dan~^as~R.B.~1/2 finala~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-...
> 03. Uto~20:45~2101~ARSENAL      0~1      IPSWICH~1.15~6.25~12.00~1.00~4.03~1.30~3.10~11.00~1.35~29.0~50~4.50~1....
> 04. Sre~20:45~2102~BIRMINGHAM     1~2    WEST HAM~2.10~3.10~3.15~1.25~1.26~1.56~2.60~1.95~3.85~3.....
> 05. ENGLESKA 1~KONA^AN GOLOVI ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL.~...
> 06. Dan~^as~R.B.~igre bez uslova~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-....
> 07. Uto~20:30~2001~BLACKPOOL~MANCHESTER UTD~7.50~4.20~1.36~2.69~1.12~1.00~7.00~2.35~1.75~13.0.....
> 08. Uto~20:45~2002~WIGAN ASTON VILLA~2.80~3.00~2.35~1.45~1.28~1.32~3.65~1.93~2.9.....
> 09. Sre~21:00~2003~LIVERPOOL~FULHAM~1.50~3.50~6.05~1.05~1.20~2.22~1.95~2.13~6.55...
> 10. ENGLESKA 2~KONA^AN GOLOVI A TIMA DUPLA ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL....
> 11. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2~1p2+2p2+~0-1.....
> 12. Uto~20:45~2171~DONCASTER BARNSLEY~1.95~3.10~3.55~1.20~1.26~1.65~2.50~2.00~4.25~3.15~13.0~30~4.70~....
> 13. Uto~20:45~2172NOTTINGHAM FOREST~BRISTOL CITY~1.55~3.45~5.50~1.07~1.21~2.12~2.00~2.12~....
> 14. ENGLESKA 3~KONA^AN DUPLA [ANSA~PRVO POL.~45' / 90'~GOLOVI LA ISHOD~POL.~....
> 15. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2....
> 16. Uto~20:45~2201~BRIGHTON COLCHESTER 1.70~3.30~4.45~1.12~1.23~1.89~2.15~2.10~5.20~.....
> 17. Uto~20:45~2202~HARTLEPOOL~NOTTS COUNTY~2.50~3.05~2.60~1.37~1.27.....
> 18. Uto~20:45~2203~LEYTON MK DONS~2.20~3.10~2.95~1.29~1.26~1.51~2.75~1.98~3.....
> 19. Uto~20:45~2204~SHEFFIELD WED~YEOVIL 1.67~3.35~4.70~1.11~1.22~1.96~2.10~2.10.....
> Lines 07, 09, 17 are extracted well and well formated.
> Lines 03 and 04 share the same problem, there are unnecessary spaces which should be line separators (in my case "~" separates words). I have seen this in other documents.
> Lines 08 and 18 for example doesn't have word separator ("~") between two team names. The space in the document between "Wigan" and "Aston Villa"  words is realy big.
> Lines 16 and 19 doesn't have word separator between second team name and first quota (COLCHESTER 1.70 and YEOVIL 1.67)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-951) Text extraction has issues on some pdfs

Posted by "Dusan Radojevic (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dusan Radojevic updated PDFBOX-951:
-----------------------------------

    Description: 
Hi,

i have noticed a big improvement in latest releases. Extraction is better but still has some problems.
I have attached some files where i had problems.

This is the code i use when extracting text:

PDFTextStripper stripper = new PDFTextStripper(); 
stripper.setStartPage( 1 );
stripper.setEndPage( 1 );
stripper.setSortByPosition(true);
stripper.setWordSeparator("~");
stripper.writeText(doc, sw);

And here are some extracted lines from file1.pdf (I have skipped few lines and made them shorter because the problem is on the beggining of the line):

01. ENGLESKA CARLING KUP~KONA^AN ISHOD~DUPLA [ANSA~PRVO POL....
02. Dan~^as~R.B.~1/2 finala~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-...
03. Uto~20:45~2101~ARSENAL      0~1      IPSWICH~1.15~6.25~12.00~1.00~4.03~1.30~3.10~11.00~1.35~29.0~50~4.50~1....
04. Sre~20:45~2102~BIRMINGHAM     1~2    WEST HAM~2.10~3.10~3.15~1.25~1.26~1.56~2.60~1.95~3.85~3.....
05. ENGLESKA 1~KONA^AN GOLOVI ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL.~...
06. Dan~^as~R.B.~igre bez uslova~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-....
07. Uto~20:30~2001~BLACKPOOL~MANCHESTER UTD~7.50~4.20~1.36~2.69~1.12~1.00~7.00~2.35~1.75~13.0.....
08. Uto~20:45~2002~WIGAN ASTON VILLA~2.80~3.00~2.35~1.45~1.28~1.32~3.65~1.93~2.9.....
09. Sre~21:00~2003~LIVERPOOL~FULHAM~1.50~3.50~6.05~1.05~1.20~2.22~1.95~2.13~6.55...
10. ENGLESKA 2~KONA^AN GOLOVI A TIMA DUPLA ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL....
11. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2~1p2+2p2+~0-1.....
12. Uto~20:45~2171~DONCASTER BARNSLEY~1.95~3.10~3.55~1.20~1.26~1.65~2.50~2.00~4.25~3.15~13.0~30~4.70~....
13. Uto~20:45~2172NOTTINGHAM FOREST~BRISTOL CITY~1.55~3.45~5.50~1.07~1.21~2.12~2.00~2.12~....
14. ENGLESKA 3~KONA^AN DUPLA [ANSA~PRVO POL.~45' / 90'~GOLOVI LA ISHOD~POL.~....
15. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2....
16. Uto~20:45~2201~BRIGHTON COLCHESTER 1.70~3.30~4.45~1.12~1.23~1.89~2.15~2.10~5.20~.....
17. Uto~20:45~2202~HARTLEPOOL~NOTTS COUNTY~2.50~3.05~2.60~1.37~1.27.....
18. Uto~20:45~2203~LEYTON MK DONS~2.20~3.10~2.95~1.29~1.26~1.51~2.75~1.98~3.....
19. Uto~20:45~2204~SHEFFIELD WED~YEOVIL 1.67~3.35~4.70~1.11~1.22~1.96~2.10~2.10.....

Lines 07, 09, 17 are extracted well and well formated.
Lines 03 and 04 share the same problem, there are unnecessary spaces which should be line separators (in my case "~" separates words). I have seen this in other documents.
Lines 08 and 18 for example doesn't have word separator ("~") between two team names. The space in the document between "Wigan" and "Aston Villa"  words is realy big.
Lines 16 and 19 doesn't have word separator between second team name and first quota (COLCHESTER 1.70 and YEOVIL 1.67)

FILE2.pdf has another problem. Words are shuffled. Here is top line from this file:

G~KONA^ANERMANY~DUPLA~PRVO~GOL GOL45' / 90'~UKUPNO GOLOVA~KOMBI 1~ISHOD~ŠANSA~POLUVREME~NE GOL

games are extracted excelent from this file but lines that should be headers are messed up:

GERMANY~HENDIKEP~DRUGO~UKUPNO~UKUPNO~PRVI~ZADNJI
DOMAC
 GOLOVA~ GOLOVA~GOST
UKUPNO~DUPLA  GOLOVA~VIŠE DAJE~WINNER 1~0:1~POLUVREME~PRVO~DRUGO~LOVA POLUVREME~ POLUVREME~GO~JE~GIONL~GOL~POBEDA DA

And FILE3.pdf has some real issues. Nothing is extracted as it should be. Here are some lines:

g~I~ o~ d~s~r~v~p~o~h
.~S~a~.~K~ v~"~o~L~v~i~"~r~e~s~.~a~n~o~K~č~a~ n~i~d~s~o~h~l~p~u~a~D~ š~a~s~n~a~r~e~m~l~P~v~e~ u~o~-~ r~a~K~j~r~e~m~l~ v~e~u~o~p~a~o~  v~l~g~r~v~p~o~n~p~u~o~o~k~U~ o~n~p~u~k~U~a~r~e~m~l~ v~l~g~ g~r~u~d~v~e~u~o~p~o~o~o
 g~A~ p~F~a~d~n~C~E~l~u~n
r~e~m~l~a~v~e~n~u~o~p
9~0

I hope this will help you people improve pdfbox.

  was:
Hi,

i have noticed a big improvement in latest releases. Extraction is better but still has some problems.
I have attached some files where i had problems.

This is the code i use when extracting text:

PDFTextStripper stripper = new PDFTextStripper(); 
stripper.setStartPage( 1 );
stripper.setEndPage( i );
stripper.setSortByPosition(true);
stripper.setWordSeparator("~");
stripper.writeText(doc, sw);

And here are some extracted lines from file1.pdf (I have skipped few lines and made them shorter because the problem is on the beggining of the line):

01. ENGLESKA CARLING KUP~KONA^AN ISHOD~DUPLA [ANSA~PRVO POL....
02. Dan~^as~R.B.~1/2 finala~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-...
03. Uto~20:45~2101~ARSENAL      0~1      IPSWICH~1.15~6.25~12.00~1.00~4.03~1.30~3.10~11.00~1.35~29.0~50~4.50~1....
04. Sre~20:45~2102~BIRMINGHAM     1~2    WEST HAM~2.10~3.10~3.15~1.25~1.26~1.56~2.60~1.95~3.85~3.....
05. ENGLESKA 1~KONA^AN GOLOVI ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL.~...
06. Dan~^as~R.B.~igre bez uslova~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-....
07. Uto~20:30~2001~BLACKPOOL~MANCHESTER UTD~7.50~4.20~1.36~2.69~1.12~1.00~7.00~2.35~1.75~13.0.....
08. Uto~20:45~2002~WIGAN ASTON VILLA~2.80~3.00~2.35~1.45~1.28~1.32~3.65~1.93~2.9.....
09. Sre~21:00~2003~LIVERPOOL~FULHAM~1.50~3.50~6.05~1.05~1.20~2.22~1.95~2.13~6.55...
10. ENGLESKA 2~KONA^AN GOLOVI A TIMA DUPLA ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL....
11. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2~1p2+2p2+~0-1.....
12. Uto~20:45~2171~DONCASTER BARNSLEY~1.95~3.10~3.55~1.20~1.26~1.65~2.50~2.00~4.25~3.15~13.0~30~4.70~....
13. Uto~20:45~2172NOTTINGHAM FOREST~BRISTOL CITY~1.55~3.45~5.50~1.07~1.21~2.12~2.00~2.12~....
14. ENGLESKA 3~KONA^AN DUPLA [ANSA~PRVO POL.~45' / 90'~GOLOVI LA ISHOD~POL.~....
15. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2....
16. Uto~20:45~2201~BRIGHTON COLCHESTER 1.70~3.30~4.45~1.12~1.23~1.89~2.15~2.10~5.20~.....
17. Uto~20:45~2202~HARTLEPOOL~NOTTS COUNTY~2.50~3.05~2.60~1.37~1.27.....
18. Uto~20:45~2203~LEYTON MK DONS~2.20~3.10~2.95~1.29~1.26~1.51~2.75~1.98~3.....
19. Uto~20:45~2204~SHEFFIELD WED~YEOVIL 1.67~3.35~4.70~1.11~1.22~1.96~2.10~2.10.....

Lines 07, 09, 17 are extracted well and well formated.
Lines 03 and 04 share the same problem, there are unnecessary spaces which should be line separators (in my case "~" separates words). I have seen this in other documents.
Lines 08 and 18 for example doesn't have word separator ("~") between two team names. The space in the document between "Wigan" and "Aston Villa"  words is realy big.
Lines 16 and 19 doesn't have word separator between second team name and first quota (COLCHESTER 1.70 and YEOVIL 1.67)

FILE2.pdf has another problem. Words are shuffled. Here is top line from this file:

G~KONA^ANERMANY~DUPLA~PRVO~GOL GOL45' / 90'~UKUPNO GOLOVA~KOMBI 1~ISHOD~ŠANSA~POLUVREME~NE GOL

games are extracted excelent from this file but lines that should be headers are messed up:

GERMANY~HENDIKEP~DRUGO~UKUPNO~UKUPNO~PRVI~ZADNJI
DOMAC
 GOLOVA~ GOLOVA~GOST
UKUPNO~DUPLA  GOLOVA~VIŠE DAJE~WINNER 1~0:1~POLUVREME~PRVO~DRUGO~LOVA POLUVREME~ POLUVREME~GO~JE~GIONL~GOL~POBEDA DA

And FILE3.pdf has some real issues. Nothing is extracted as it should be. Here are some lines:

g~I~ o~ d~s~r~v~p~o~h
.~S~a~.~K~ v~"~o~L~v~i~"~r~e~s~.~a~n~o~K~č~a~ n~i~d~s~o~h~l~p~u~a~D~ š~a~s~n~a~r~e~m~l~P~v~e~ u~o~-~ r~a~K~j~r~e~m~l~ v~e~u~o~p~a~o~  v~l~g~r~v~p~o~n~p~u~o~o~k~U~ o~n~p~u~k~U~a~r~e~m~l~ v~l~g~ g~r~u~d~v~e~u~o~p~o~o~o
 g~A~ p~F~a~d~n~C~E~l~u~n
r~e~m~l~a~v~e~n~u~o~p
9~0

I hope this will help you people improve pdfbox.


> Text extraction has issues on some pdfs
> ---------------------------------------
>
>                 Key: PDFBOX-951
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-951
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Dusan Radojevic
>            Priority: Minor
>             Fix For: 1.5.0
>
>         Attachments: file1.pdf, file2.pdf, file3.pdf
>
>
> Hi,
> i have noticed a big improvement in latest releases. Extraction is better but still has some problems.
> I have attached some files where i had problems.
> This is the code i use when extracting text:
> PDFTextStripper stripper = new PDFTextStripper(); 
> stripper.setStartPage( 1 );
> stripper.setEndPage( 1 );
> stripper.setSortByPosition(true);
> stripper.setWordSeparator("~");
> stripper.writeText(doc, sw);
> And here are some extracted lines from file1.pdf (I have skipped few lines and made them shorter because the problem is on the beggining of the line):
> 01. ENGLESKA CARLING KUP~KONA^AN ISHOD~DUPLA [ANSA~PRVO POL....
> 02. Dan~^as~R.B.~1/2 finala~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-...
> 03. Uto~20:45~2101~ARSENAL      0~1      IPSWICH~1.15~6.25~12.00~1.00~4.03~1.30~3.10~11.00~1.35~29.0~50~4.50~1....
> 04. Sre~20:45~2102~BIRMINGHAM     1~2    WEST HAM~2.10~3.10~3.15~1.25~1.26~1.56~2.60~1.95~3.85~3.....
> 05. ENGLESKA 1~KONA^AN GOLOVI ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL.~...
> 06. Dan~^as~R.B.~igre bez uslova~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-....
> 07. Uto~20:30~2001~BLACKPOOL~MANCHESTER UTD~7.50~4.20~1.36~2.69~1.12~1.00~7.00~2.35~1.75~13.0.....
> 08. Uto~20:45~2002~WIGAN ASTON VILLA~2.80~3.00~2.35~1.45~1.28~1.32~3.65~1.93~2.9.....
> 09. Sre~21:00~2003~LIVERPOOL~FULHAM~1.50~3.50~6.05~1.05~1.20~2.22~1.95~2.13~6.55...
> 10. ENGLESKA 2~KONA^AN GOLOVI A TIMA DUPLA ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL....
> 11. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2~1p2+2p2+~0-1.....
> 12. Uto~20:45~2171~DONCASTER BARNSLEY~1.95~3.10~3.55~1.20~1.26~1.65~2.50~2.00~4.25~3.15~13.0~30~4.70~....
> 13. Uto~20:45~2172NOTTINGHAM FOREST~BRISTOL CITY~1.55~3.45~5.50~1.07~1.21~2.12~2.00~2.12~....
> 14. ENGLESKA 3~KONA^AN DUPLA [ANSA~PRVO POL.~45' / 90'~GOLOVI LA ISHOD~POL.~....
> 15. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2....
> 16. Uto~20:45~2201~BRIGHTON COLCHESTER 1.70~3.30~4.45~1.12~1.23~1.89~2.15~2.10~5.20~.....
> 17. Uto~20:45~2202~HARTLEPOOL~NOTTS COUNTY~2.50~3.05~2.60~1.37~1.27.....
> 18. Uto~20:45~2203~LEYTON MK DONS~2.20~3.10~2.95~1.29~1.26~1.51~2.75~1.98~3.....
> 19. Uto~20:45~2204~SHEFFIELD WED~YEOVIL 1.67~3.35~4.70~1.11~1.22~1.96~2.10~2.10.....
> Lines 07, 09, 17 are extracted well and well formated.
> Lines 03 and 04 share the same problem, there are unnecessary spaces which should be line separators (in my case "~" separates words). I have seen this in other documents.
> Lines 08 and 18 for example doesn't have word separator ("~") between two team names. The space in the document between "Wigan" and "Aston Villa"  words is realy big.
> Lines 16 and 19 doesn't have word separator between second team name and first quota (COLCHESTER 1.70 and YEOVIL 1.67)
> FILE2.pdf has another problem. Words are shuffled. Here is top line from this file:
> G~KONA^ANERMANY~DUPLA~PRVO~GOL GOL45' / 90'~UKUPNO GOLOVA~KOMBI 1~ISHOD~ŠANSA~POLUVREME~NE GOL
> games are extracted excelent from this file but lines that should be headers are messed up:
> GERMANY~HENDIKEP~DRUGO~UKUPNO~UKUPNO~PRVI~ZADNJI
> DOMAC
>  GOLOVA~ GOLOVA~GOST
> UKUPNO~DUPLA  GOLOVA~VIŠE DAJE~WINNER 1~0:1~POLUVREME~PRVO~DRUGO~LOVA POLUVREME~ POLUVREME~GO~JE~GIONL~GOL~POBEDA DA
> And FILE3.pdf has some real issues. Nothing is extracted as it should be. Here are some lines:
> g~I~ o~ d~s~r~v~p~o~h
> .~S~a~.~K~ v~"~o~L~v~i~"~r~e~s~.~a~n~o~K~č~a~ n~i~d~s~o~h~l~p~u~a~D~ š~a~s~n~a~r~e~m~l~P~v~e~ u~o~-~ r~a~K~j~r~e~m~l~ v~e~u~o~p~a~o~  v~l~g~r~v~p~o~n~p~u~o~o~k~U~ o~n~p~u~k~U~a~r~e~m~l~ v~l~g~ g~r~u~d~v~e~u~o~p~o~o~o
>  g~A~ p~F~a~d~n~C~E~l~u~n
> r~e~m~l~a~v~e~n~u~o~p
> 9~0
> I hope this will help you people improve pdfbox.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-951) Text extraction has issues on some pdfs

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler updated PDFBOX-951:
--------------------------------------

    Fix Version/s:     (was: 1.5.0)

> Text extraction has issues on some pdfs
> ---------------------------------------
>
>                 Key: PDFBOX-951
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-951
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Dusan Radojevic
>            Priority: Minor
>         Attachments: file1.pdf, file2.pdf, file3.pdf
>
>
> Hi,
> i have noticed a big improvement in latest releases. Extraction is better but still has some problems.
> I have attached some files where i had problems.
> This is the code i use when extracting text:
> PDFTextStripper stripper = new PDFTextStripper(); 
> stripper.setStartPage( 1 );
> stripper.setEndPage( 1 );
> stripper.setSortByPosition(true);
> stripper.setWordSeparator("~");
> stripper.writeText(doc, sw);
> And here are some extracted lines from file1.pdf (I have skipped few lines and made them shorter because the problem is on the beggining of the line):
> 01. ENGLESKA CARLING KUP~KONA^AN ISHOD~DUPLA [ANSA~PRVO POL....
> 02. Dan~^as~R.B.~1/2 finala~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-...
> 03. Uto~20:45~2101~ARSENAL      0~1      IPSWICH~1.15~6.25~12.00~1.00~4.03~1.30~3.10~11.00~1.35~29.0~50~4.50~1....
> 04. Sre~20:45~2102~BIRMINGHAM     1~2    WEST HAM~2.10~3.10~3.15~1.25~1.26~1.56~2.60~1.95~3.85~3.....
> 05. ENGLESKA 1~KONA^AN GOLOVI ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL.~...
> 06. Dan~^as~R.B.~igre bez uslova~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-....
> 07. Uto~20:30~2001~BLACKPOOL~MANCHESTER UTD~7.50~4.20~1.36~2.69~1.12~1.00~7.00~2.35~1.75~13.0.....
> 08. Uto~20:45~2002~WIGAN ASTON VILLA~2.80~3.00~2.35~1.45~1.28~1.32~3.65~1.93~2.9.....
> 09. Sre~21:00~2003~LIVERPOOL~FULHAM~1.50~3.50~6.05~1.05~1.20~2.22~1.95~2.13~6.55...
> 10. ENGLESKA 2~KONA^AN GOLOVI A TIMA DUPLA ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL....
> 11. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2~1p2+2p2+~0-1.....
> 12. Uto~20:45~2171~DONCASTER BARNSLEY~1.95~3.10~3.55~1.20~1.26~1.65~2.50~2.00~4.25~3.15~13.0~30~4.70~....
> 13. Uto~20:45~2172NOTTINGHAM FOREST~BRISTOL CITY~1.55~3.45~5.50~1.07~1.21~2.12~2.00~2.12~....
> 14. ENGLESKA 3~KONA^AN DUPLA [ANSA~PRVO POL.~45' / 90'~GOLOVI LA ISHOD~POL.~....
> 15. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2....
> 16. Uto~20:45~2201~BRIGHTON COLCHESTER 1.70~3.30~4.45~1.12~1.23~1.89~2.15~2.10~5.20~.....
> 17. Uto~20:45~2202~HARTLEPOOL~NOTTS COUNTY~2.50~3.05~2.60~1.37~1.27.....
> 18. Uto~20:45~2203~LEYTON MK DONS~2.20~3.10~2.95~1.29~1.26~1.51~2.75~1.98~3.....
> 19. Uto~20:45~2204~SHEFFIELD WED~YEOVIL 1.67~3.35~4.70~1.11~1.22~1.96~2.10~2.10.....
> Lines 07, 09, 17 are extracted well and well formated.
> Lines 03 and 04 share the same problem, there are unnecessary spaces which should be line separators (in my case "~" separates words). I have seen this in other documents.
> Lines 08 and 18 for example doesn't have word separator ("~") between two team names. The space in the document between "Wigan" and "Aston Villa"  words is realy big.
> Lines 16 and 19 doesn't have word separator between second team name and first quota (COLCHESTER 1.70 and YEOVIL 1.67)
> FILE2.pdf has another problem. Words are shuffled. Here is top line from this file:
> G~KONA^ANERMANY~DUPLA~PRVO~GOL GOL45' / 90'~UKUPNO GOLOVA~KOMBI 1~ISHOD~ŠANSA~POLUVREME~NE GOL
> games are extracted excelent from this file but lines that should be headers are messed up:
> GERMANY~HENDIKEP~DRUGO~UKUPNO~UKUPNO~PRVI~ZADNJI
> DOMAC
>  GOLOVA~ GOLOVA~GOST
> UKUPNO~DUPLA  GOLOVA~VIŠE DAJE~WINNER 1~0:1~POLUVREME~PRVO~DRUGO~LOVA POLUVREME~ POLUVREME~GO~JE~GIONL~GOL~POBEDA DA
> And FILE3.pdf has some real issues. Nothing is extracted as it should be. Here are some lines:
> g~I~ o~ d~s~r~v~p~o~h
> .~S~a~.~K~ v~"~o~L~v~i~"~r~e~s~.~a~n~o~K~č~a~ n~i~d~s~o~h~l~p~u~a~D~ š~a~s~n~a~r~e~m~l~P~v~e~ u~o~-~ r~a~K~j~r~e~m~l~ v~e~u~o~p~a~o~  v~l~g~r~v~p~o~n~p~u~o~o~k~U~ o~n~p~u~k~U~a~r~e~m~l~ v~l~g~ g~r~u~d~v~e~u~o~p~o~o~o
>  g~A~ p~F~a~d~n~C~E~l~u~n
> r~e~m~l~a~v~e~n~u~o~p
> 9~0
> I hope this will help you people improve pdfbox.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-951) Text extraction has issues on some pdfs

Posted by "Dusan Radojevic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988103#action_12988103 ] 

Dusan Radojevic commented on PDFBOX-951:
----------------------------------------

File3 now has same problem as File2.
The text in the headers is shuffled.


> Text extraction has issues on some pdfs
> ---------------------------------------
>
>                 Key: PDFBOX-951
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-951
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Dusan Radojevic
>            Priority: Minor
>         Attachments: file1.pdf, file2.pdf, file3.pdf, PDFBOX951-file3.txt
>
>
> Hi,
> i have noticed a big improvement in latest releases. Extraction is better but still has some problems.
> I have attached some files where i had problems.
> This is the code i use when extracting text:
> PDFTextStripper stripper = new PDFTextStripper(); 
> stripper.setStartPage( 1 );
> stripper.setEndPage( 1 );
> stripper.setSortByPosition(true);
> stripper.setWordSeparator("~");
> stripper.writeText(doc, sw);
> And here are some extracted lines from file1.pdf (I have skipped few lines and made them shorter because the problem is on the beggining of the line):
> 01. ENGLESKA CARLING KUP~KONA^AN ISHOD~DUPLA [ANSA~PRVO POL....
> 02. Dan~^as~R.B.~1/2 finala~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-...
> 03. Uto~20:45~2101~ARSENAL      0~1      IPSWICH~1.15~6.25~12.00~1.00~4.03~1.30~3.10~11.00~1.35~29.0~50~4.50~1....
> 04. Sre~20:45~2102~BIRMINGHAM     1~2    WEST HAM~2.10~3.10~3.15~1.25~1.26~1.56~2.60~1.95~3.85~3.....
> 05. ENGLESKA 1~KONA^AN GOLOVI ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL.~...
> 06. Dan~^as~R.B.~igre bez uslova~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-....
> 07. Uto~20:30~2001~BLACKPOOL~MANCHESTER UTD~7.50~4.20~1.36~2.69~1.12~1.00~7.00~2.35~1.75~13.0.....
> 08. Uto~20:45~2002~WIGAN ASTON VILLA~2.80~3.00~2.35~1.45~1.28~1.32~3.65~1.93~2.9.....
> 09. Sre~21:00~2003~LIVERPOOL~FULHAM~1.50~3.50~6.05~1.05~1.20~2.22~1.95~2.13~6.55...
> 10. ENGLESKA 2~KONA^AN GOLOVI A TIMA DUPLA ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL....
> 11. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2~1p2+2p2+~0-1.....
> 12. Uto~20:45~2171~DONCASTER BARNSLEY~1.95~3.10~3.55~1.20~1.26~1.65~2.50~2.00~4.25~3.15~13.0~30~4.70~....
> 13. Uto~20:45~2172NOTTINGHAM FOREST~BRISTOL CITY~1.55~3.45~5.50~1.07~1.21~2.12~2.00~2.12~....
> 14. ENGLESKA 3~KONA^AN DUPLA [ANSA~PRVO POL.~45' / 90'~GOLOVI LA ISHOD~POL.~....
> 15. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2....
> 16. Uto~20:45~2201~BRIGHTON COLCHESTER 1.70~3.30~4.45~1.12~1.23~1.89~2.15~2.10~5.20~.....
> 17. Uto~20:45~2202~HARTLEPOOL~NOTTS COUNTY~2.50~3.05~2.60~1.37~1.27.....
> 18. Uto~20:45~2203~LEYTON MK DONS~2.20~3.10~2.95~1.29~1.26~1.51~2.75~1.98~3.....
> 19. Uto~20:45~2204~SHEFFIELD WED~YEOVIL 1.67~3.35~4.70~1.11~1.22~1.96~2.10~2.10.....
> Lines 07, 09, 17 are extracted well and well formated.
> Lines 03 and 04 share the same problem, there are unnecessary spaces which should be line separators (in my case "~" separates words). I have seen this in other documents.
> Lines 08 and 18 for example doesn't have word separator ("~") between two team names. The space in the document between "Wigan" and "Aston Villa"  words is realy big.
> Lines 16 and 19 doesn't have word separator between second team name and first quota (COLCHESTER 1.70 and YEOVIL 1.67)
> FILE2.pdf has another problem. Words are shuffled. Here is top line from this file:
> G~KONA^ANERMANY~DUPLA~PRVO~GOL GOL45' / 90'~UKUPNO GOLOVA~KOMBI 1~ISHOD~ŠANSA~POLUVREME~NE GOL
> games are extracted excelent from this file but lines that should be headers are messed up:
> GERMANY~HENDIKEP~DRUGO~UKUPNO~UKUPNO~PRVI~ZADNJI
> DOMAC
>  GOLOVA~ GOLOVA~GOST
> UKUPNO~DUPLA  GOLOVA~VIŠE DAJE~WINNER 1~0:1~POLUVREME~PRVO~DRUGO~LOVA POLUVREME~ POLUVREME~GO~JE~GIONL~GOL~POBEDA DA
> And FILE3.pdf has some real issues. Nothing is extracted as it should be. Here are some lines:
> g~I~ o~ d~s~r~v~p~o~h
> .~S~a~.~K~ v~"~o~L~v~i~"~r~e~s~.~a~n~o~K~č~a~ n~i~d~s~o~h~l~p~u~a~D~ š~a~s~n~a~r~e~m~l~P~v~e~ u~o~-~ r~a~K~j~r~e~m~l~ v~e~u~o~p~a~o~  v~l~g~r~v~p~o~n~p~u~o~o~k~U~ o~n~p~u~k~U~a~r~e~m~l~ v~l~g~ g~r~u~d~v~e~u~o~p~o~o~o
>  g~A~ p~F~a~d~n~C~E~l~u~n
> r~e~m~l~a~v~e~n~u~o~p
> 9~0
> I hope this will help you people improve pdfbox.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-951) Text extraction has issues on some pdfs

Posted by "Dusan Radojevic (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dusan Radojevic updated PDFBOX-951:
-----------------------------------

    Attachment: file3.pdf
                file2.pdf

> Text extraction has issues on some pdfs
> ---------------------------------------
>
>                 Key: PDFBOX-951
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-951
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Dusan Radojevic
>            Priority: Minor
>             Fix For: 1.5.0
>
>         Attachments: file1.pdf, file2.pdf, file3.pdf
>
>
> Hi,
> i have noticed a big improvement in latest releases. Extraction is better but still has some problems.
> I have attached some files where i had problems.
> This is the code i use when extracting text:
> PDFTextStripper stripper = new PDFTextStripper(); 
> stripper.setStartPage( 1 );
> stripper.setEndPage( i );
> stripper.setSortByPosition(true);
> stripper.setWordSeparator("~");
> stripper.writeText(doc, sw);
> And here are some extracted lines from file1.pdf (I have skipped few lines and made them shorter because the problem is on the beggining of the line):
> 01. ENGLESKA CARLING KUP~KONA^AN ISHOD~DUPLA [ANSA~PRVO POL....
> 02. Dan~^as~R.B.~1/2 finala~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-...
> 03. Uto~20:45~2101~ARSENAL      0~1      IPSWICH~1.15~6.25~12.00~1.00~4.03~1.30~3.10~11.00~1.35~29.0~50~4.50~1....
> 04. Sre~20:45~2102~BIRMINGHAM     1~2    WEST HAM~2.10~3.10~3.15~1.25~1.26~1.56~2.60~1.95~3.85~3.....
> 05. ENGLESKA 1~KONA^AN GOLOVI ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL.~...
> 06. Dan~^as~R.B.~igre bez uslova~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-....
> 07. Uto~20:30~2001~BLACKPOOL~MANCHESTER UTD~7.50~4.20~1.36~2.69~1.12~1.00~7.00~2.35~1.75~13.0.....
> 08. Uto~20:45~2002~WIGAN ASTON VILLA~2.80~3.00~2.35~1.45~1.28~1.32~3.65~1.93~2.9.....
> 09. Sre~21:00~2003~LIVERPOOL~FULHAM~1.50~3.50~6.05~1.05~1.20~2.22~1.95~2.13~6.55...
> 10. ENGLESKA 2~KONA^AN GOLOVI A TIMA DUPLA ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL....
> 11. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2~1p2+2p2+~0-1.....
> 12. Uto~20:45~2171~DONCASTER BARNSLEY~1.95~3.10~3.55~1.20~1.26~1.65~2.50~2.00~4.25~3.15~13.0~30~4.70~....
> 13. Uto~20:45~2172NOTTINGHAM FOREST~BRISTOL CITY~1.55~3.45~5.50~1.07~1.21~2.12~2.00~2.12~....
> 14. ENGLESKA 3~KONA^AN DUPLA [ANSA~PRVO POL.~45' / 90'~GOLOVI LA ISHOD~POL.~....
> 15. Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2....
> 16. Uto~20:45~2201~BRIGHTON COLCHESTER 1.70~3.30~4.45~1.12~1.23~1.89~2.15~2.10~5.20~.....
> 17. Uto~20:45~2202~HARTLEPOOL~NOTTS COUNTY~2.50~3.05~2.60~1.37~1.27.....
> 18. Uto~20:45~2203~LEYTON MK DONS~2.20~3.10~2.95~1.29~1.26~1.51~2.75~1.98~3.....
> 19. Uto~20:45~2204~SHEFFIELD WED~YEOVIL 1.67~3.35~4.70~1.11~1.22~1.96~2.10~2.10.....
> Lines 07, 09, 17 are extracted well and well formated.
> Lines 03 and 04 share the same problem, there are unnecessary spaces which should be line separators (in my case "~" separates words). I have seen this in other documents.
> Lines 08 and 18 for example doesn't have word separator ("~") between two team names. The space in the document between "Wigan" and "Aston Villa"  words is realy big.
> Lines 16 and 19 doesn't have word separator between second team name and first quota (COLCHESTER 1.70 and YEOVIL 1.67)
> FILE2.pdf has another problem. Words are shuffled. Here is top line from this file:
> G~KONA^ANERMANY~DUPLA~PRVO~GOL GOL45' / 90'~UKUPNO GOLOVA~KOMBI 1~ISHOD~ŠANSA~POLUVREME~NE GOL
> games are extracted excelent from this file but lines that should be headers are messed up:
> GERMANY~HENDIKEP~DRUGO~UKUPNO~UKUPNO~PRVI~ZADNJI
> DOMAC
>  GOLOVA~ GOLOVA~GOST
> UKUPNO~DUPLA  GOLOVA~VIŠE DAJE~WINNER 1~0:1~POLUVREME~PRVO~DRUGO~LOVA POLUVREME~ POLUVREME~GO~JE~GIONL~GOL~POBEDA DA
> And FILE3.pdf has some real issues. Nothing is extracted as it should be. Here are some lines:
> g~I~ o~ d~s~r~v~p~o~h
> .~S~a~.~K~ v~"~o~L~v~i~"~r~e~s~.~a~n~o~K~č~a~ n~i~d~s~o~h~l~p~u~a~D~ š~a~s~n~a~r~e~m~l~P~v~e~ u~o~-~ r~a~K~j~r~e~m~l~ v~e~u~o~p~a~o~  v~l~g~r~v~p~o~n~p~u~o~o~k~U~ o~n~p~u~k~U~a~r~e~m~l~ v~l~g~ g~r~u~d~v~e~u~o~p~o~o~o
>  g~A~ p~F~a~d~n~C~E~l~u~n
> r~e~m~l~a~v~e~n~u~o~p
> 9~0
> I hope this will help you people improve pdfbox.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.