You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/10/10 23:42:33 UTC

[jira] [Closed] (PDFBOX-316) Extracting number show empty string

     [ https://issues.apache.org/jira/browse/PDFBOX-316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Hewson closed PDFBOX-316.
------------------------------
    Resolution: Cannot Reproduce

> Extracting number show empty string
> -----------------------------------
>
>                 Key: PDFBOX-316
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-316
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Priority: Minor
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1818588
> Originally submitted by astonishing1 on 2007-10-23 07:20.
> hi,
> I want to extract the text which is a number 10 digit long and is at fix place on each page of PDF file.
> I used  PrintTextLocations & PDFTextStripper to extrac t   that id number from the PDFs .
> The PDF is Arabic but I want the number to extract only.
> The problem is that when I use PrintTextLocations utility when it prints the number it always misses one or two numbers and insert empty space instead of that numbers.
> Example 
> String[730.10004,116.75003 ft=Times-New-Roman+2 fs=200.0 xscale=0.05 height=5.000001 width=911.2002]    text: 16/10/2007
> String[775.7,32.75 ft=Times-New-Roman-Bold+1 fs=200.0 xscale=0.05 height=5.000001 width=933.4004]  text: RBKPI011
> String[786.15,116.75003 ft=Times-New-Roman-Bold+1 fs=200.0 xscale=0.05 height=5.000001 width=739.0] text:????? ?????
> String[375.85,89.10004 ft=Times-New-Roman-Bold+1 fs=240.0 xscale=0.05 height=6.000001 width=1057.6797] text:?????? - 004
> String[330.9,101.10004 ft=Times-New-Roman-Bold+1 fs=240.0 xscale=0.05 height=6.000001 width=3023.04]  text:??????  ??? ???? ????  -  1 4 58           (the number is 194758, 9 & 7 is missing)  
> The last number is  some Arabic word after – is this 194758 number but 9 and 7 is missing 
> Similarly as the big PDF file is generated daily so I parsed the new one as following 
> String[329.75,101.10004 ft=Times-New-Roman-Bold+1 fs=240.0 xscale=0.05 height=6.000001 width=3068.6406]?????? ????  ??  ????  -  1 06 14     No.is  1906914, 9  is missing)  
> So it is not fixed .
> So can anyone help ,tanks in advance .
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1818588&file_id=251007
> 194758.pdf (application/pdf), 103183 bytes
> pdf file to extract data using PrintTextLocations utility



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)