You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by fb <be...@gmail.com> on 2008/12/05 15:14:33 UTC

converting a location in an image to pdf coordinates

Hello,
I'm loosing my hair on coordinates conversion and image extraction.

Here is what I'm trying to do :

I want to perform keyword search on non-searchable pdf or pdfs where 
text layer is not well positioned behind images (and then underline the 
results using annots) using PDFBOX and an OCR:

I've extended printImageLocation  the following way :
On a given page I extract all images and generate png images with JAI 
for better quality (tried getting a sole image for the whole page but 
results are not good enough with the OCR due to layout issues I think, 
with JAI I expect to be able to posterize, reduce noise if necessary, 
etc...to make the ocr happy).
I externally run an ocr on them (ocropus/tesseract. it's c++, so I have 
some "Process p = Runtime.getRuntime().exec(cmd); " code) which 
produces  hOCR files giving  text and coordinates for each characters.
I'm then able to determine the coordinates of  a keyword parsing the 
hOCR file.
At this point, I have the coordinates of the keyword in the image, the 
position of the image on the page and the size of the image.
I then try to "translate to" coordinates in the pdf page from the ones I 
have got from the parsed image.
First I invert the bounding box as the OCR gives me a UpperLeft/ 
LowerRight couple of points.
then ...I'm stucked : I expected the origin to be lowerleft in a pdf 
page but it seems to be upperLeft here.
and to be honest, I hardly figure out which corner of the image is used 
to determine its location and what is the metric used.
Inside the image, I retrieve coordinates in dot.

For example, here are the images I've found :
[I0] at 571.26746,71.80139 size=796.0658,93.23215 (small logo)
[I1] at 368.0984,85.12024 size=92.90973,196.4537 (small logo)
[I2] at 583.11694,707.5416 size=12841.42,15587.612 (the scanned article)
[I3] at 176.53192,341.2494 size=402.6675,1046.7035 (image attached to 
the article)

visually, [I0] is upperLeft, [I1] is at [I0] right side, [I3] is upper 
right but below [I0] and i1 line.
[I2] is the "body" of the page actually a press article, where I find 
the keyword's occurences.

here is a set of coordinates retrieved from the ocr processing (upper 
left / lower right):
keyword: (2056.0/2484.0) (2193.0/2501.0)

which gives (lower left / upper right):
(2056.0/2501.0) (2193.0/2484.0)

here are the coordinates of the same occurence in the pdf (the result I 
would find after a conversion lowerleft / upper right.  Provided here 
parsing the text layer hopefully well positionned) :
START : String[ xy=511.5022,665.7338 fontsize=33.0 xscale=0.24686399 
yscale=0.225744 height=5.579715 space=2.2647307 width=128.40302] = keyword
END : String[ xy=511.5022,665.7338 fontsize=33.0 xscale=0.24686399 
yscale=0.225744 height=5.579715 space=2.2647307 width=128.40302] = keyword
--------->keyword   : 1790.4882,2322.2808, 1881.345,2348.561 (the 
bounding box converted in a suitable metric system to put annotations on it)

I guess I have to set up a transformation matrix but I don't know what 
parameters I have to take into accounts (and if they are available in a 
way or another !).
Could someone provide some advices ?
Thanks
fb.



AW: AW: converting a location in an image to pdf coordinates

Posted by An...@rwe.com.
>Additional questions :
># Does the PrintImageLocation class perform a transformation of the
>coordinates to a system where 0.0 is on the upperLeft corner  near this
>code (in processOperator() method ) :
>                    float ph = page.findMediaBox().getHeight();
>                    float pw = page.findMediaBox().getWidth();
>                    Matrix ctm =
>getGraphicsState().getCurrentTransformationMatrix();
>                    double rotationInRadians =(page.findRotation() *
>Math.PI)/180;  ...
>Or is it just in case of a rotation ?
PrintImageLocation computes the scaling-factors by inverting the currentRotationMatrix and
of course the rotation is taken into account for this. But usually the page-rotation isn't
relevant for pages with portrait orientation.

>Are the scaling factors taken into account at this level too ? So,  I
>could bypass this and deal with my problem easier.
I guess the output contains the image-dimensions on the page.

># As I want to draw annotation on what I've found, I guess I have to
>transform to a system where 0.0 is LowerRight corner of the PDF page. Am
>I right ? (and probably do some metric conversions also)
On pages without rotation the 0,0-reference is the lower left corner.

BR,
Andreas





----------------------------------------------------------------
- Geschaeftsfuehrung: Chittur Ramakrishnan (Vorsitzender), 
Stefan Niehusmann - 
- Sitz der Gesellschaft: Dortmund - 
- Eingetragen beim Amtsgericht Dortmund - 
- Handelsregister-Nr. HR B 21222 - 
- USt.-IdNr. DE 2588 96 719 - 

Re: AW: converting a location in an image to pdf coordinates

Posted by fb <be...@gmail.com>.
Thanks for this quick answer.
Sorry if it's not clear (it's as clear as in my head right now !).
the coordinates I get from the OCR are in px with 0.0 in the UpperLeft 
corner of the image. I verify this using Gimp and it's OK.

Additional questions :
# Does the PrintImageLocation class perform a transformation of the 
coordinates to a system where 0.0 is on the upperLeft corner  near this 
code (in processOperator() method ) :
                    float ph = page.findMediaBox().getHeight();
                    float pw = page.findMediaBox().getWidth();
                    Matrix ctm = 
getGraphicsState().getCurrentTransformationMatrix();
                    double rotationInRadians =(page.findRotation() * 
Math.PI)/180;  ...
Or is it just in case of a rotation ? 
Are the scaling factors taken into account at this level too ? So,  I 
could bypass this and deal with my problem easier .

# As I want to draw annotation on what I've found, I guess I have to 
transform to a system where 0.0 is LowerRight corner of the PDF page. Am 
I right ? (and probably do some metric conversions also)






Andreas.Lehmkuehler@rwe.com a écrit :
>> I'm loosing my hair on coordinates conversion and image extraction.
>>
>> Here is what I'm trying to do :
>>
>> I want to perform keyword search on non-searchable pdf or pdfs where
>> text layer is not well positioned behind images (and then underline the
>> results using annots) using PDFBOX and an OCR:
>>
>> I've extended printImageLocation  the following way :
>> On a given page I extract all images and generate png images with JAI
>> for better quality (tried getting a sole image for the whole page but
>> results are not good enough with the OCR due to layout issues I think,
>> with JAI I expect to be able to posterize, reduce noise if necessary,
>> etc...to make the ocr happy).
>> I externally run an ocr on them (ocropus/tesseract. it's c++, so I have
>> some "Process p = Runtime.getRuntime().exec(cmd); " code) which
>> produces  hOCR files giving  text and coordinates for each characters. I'm then able to determine the coordinates of  a keyword parsing the
>> hOCR file.
>> At this point, I have the coordinates of the keyword in the image, the
>> position of the image on the page and the size of the image.
>> I then try to "translate to" coordinates in the pdf page from the ones I
>> have got from the parsed image.
>> First I invert the bounding box as the OCR gives me a UpperLeft/
>> LowerRight couple of points.
>> then ...I'm stucked : I expected the origin to be lowerleft in a pdf
>> page but it seems to be upperLeft here.
>> and to be honest, I hardly figure out which corner of the image is used
>> to determine its location and what is the metric used.
>> Inside the image, I retrieve coordinates in dot.
>>
>> For example, here are the images I've found :
>> [I0] at 571.26746,71.80139 size=796.0658,93.23215 (small logo) [I1] at 368.0984,85.12024 size=92.90973,196.4537 (small logo) [I2] at 583.11694,707.5416 size=12841.42,15587.612 (the scanned article) [I3] at 176.53192,341.2494 size=402.6675,1046.7035 (image attached to
>> the article)
>>
>> visually, [I0] is upperLeft, [I1] is at [I0] right side, [I3] is upper
>> right but below [I0] and i1 line.
>> [I2] is the "body" of the page actually a press article, where I find
>> the keyword's occurences.
>>
>> here is a set of coordinates retrieved from the ocr processing (upper
>> left / lower right):
>> keyword: (2056.0/2484.0) (2193.0/2501.0)
>>
>> which gives (lower left / upper right):
>> (2056.0/2501.0) (2193.0/2484.0)
>>
>> here are the coordinates of the same occurence in the pdf (the result I
>> would find after a conversion lowerleft / upper right.  Provided here
>> parsing the text layer hopefully well positionned) :
>> START : String[ xy=511.5022,665.7338 fontsize=33.0 xscale=0.24686399
>> yscale=0.225744 height=5.579715 space=2.2647307 width=128.40302] = keyword END : String[ xy=511.5022,665.7338 fontsize=33.0 xscale=0.24686399
>> yscale=0.225744 height=5.579715 space=2.2647307 width=128.40302] = keyword
>> --------->keyword   : 1790.4882,2322.2808, 1881.345,2348.561 (the
>> bounding box converted in a suitable metric system to put annotations on it)
>>
>> I guess I have to set up a transformation matrix but I don't know what
>> parameters I have to take into accounts (and if they are available in a
>> way or another !).
>> Could someone provide some advices ?
>>     
>
> I don't understand every point of your problem, but here are some details you are perhaps looking for:
>
> - the pdf-0,0 reference is the lower left corner (as you already mentioned)
> - a possible dimension of a page is something like this: 612, 792 (Letter) or 596, 843 (DINA4) both portrait
> - images are drawn starting at their lower left corner, with the given width and height in the pdf
> - the image may be stored in the pdf-document with a larger/smaller dimension than used for displaying/printing
>
> If you want to compare your ocr-results with the pdf, you have to have a look at the possible scaling of the image.
>
> HTH,
> Andreas
>
> ----------------------------------------------------------------
> - Geschaeftsfuehrung: Chittur Ramakrishnan (Vorsitzender), 
> Stefan Niehusmann - 
> - Sitz der Gesellschaft: Dortmund - 
> - Eingetragen beim Amtsgericht Dortmund - 
> - Handelsregister-Nr. HR B 21222 - 
> - USt.-IdNr. DE 2588 96 719 - 
>
>   



AW: converting a location in an image to pdf coordinates

Posted by An...@rwe.com.
>I'm loosing my hair on coordinates conversion and image extraction.
>
>Here is what I'm trying to do :
>
>I want to perform keyword search on non-searchable pdf or pdfs where
>text layer is not well positioned behind images (and then underline the
>results using annots) using PDFBOX and an OCR:
>
>I've extended printImageLocation  the following way :
>On a given page I extract all images and generate png images with JAI
>for better quality (tried getting a sole image for the whole page but
>results are not good enough with the OCR due to layout issues I think,
>with JAI I expect to be able to posterize, reduce noise if necessary,
>etc...to make the ocr happy).
>I externally run an ocr on them (ocropus/tesseract. it's c++, so I have
>some "Process p = Runtime.getRuntime().exec(cmd); " code) which
>produces  hOCR files giving  text and coordinates for each characters. I'm then able to determine the coordinates of  a keyword parsing the
>hOCR file.
>At this point, I have the coordinates of the keyword in the image, the
>position of the image on the page and the size of the image.
>I then try to "translate to" coordinates in the pdf page from the ones I
>have got from the parsed image.
>First I invert the bounding box as the OCR gives me a UpperLeft/
>LowerRight couple of points.
>then ...I'm stucked : I expected the origin to be lowerleft in a pdf
>page but it seems to be upperLeft here.
>and to be honest, I hardly figure out which corner of the image is used
>to determine its location and what is the metric used.
>Inside the image, I retrieve coordinates in dot.
>
>For example, here are the images I've found :
>[I0] at 571.26746,71.80139 size=796.0658,93.23215 (small logo) [I1] at 368.0984,85.12024 size=92.90973,196.4537 (small logo) [I2] at 583.11694,707.5416 size=12841.42,15587.612 (the scanned article) [I3] at 176.53192,341.2494 size=402.6675,1046.7035 (image attached to
>the article)
>
>visually, [I0] is upperLeft, [I1] is at [I0] right side, [I3] is upper
>right but below [I0] and i1 line.
>[I2] is the "body" of the page actually a press article, where I find
>the keyword's occurences.
>
>here is a set of coordinates retrieved from the ocr processing (upper
>left / lower right):
>keyword: (2056.0/2484.0) (2193.0/2501.0)
>
>which gives (lower left / upper right):
>(2056.0/2501.0) (2193.0/2484.0)
>
>here are the coordinates of the same occurence in the pdf (the result I
>would find after a conversion lowerleft / upper right.  Provided here
>parsing the text layer hopefully well positionned) :
>START : String[ xy=511.5022,665.7338 fontsize=33.0 xscale=0.24686399
>yscale=0.225744 height=5.579715 space=2.2647307 width=128.40302] = keyword END : String[ xy=511.5022,665.7338 fontsize=33.0 xscale=0.24686399
>yscale=0.225744 height=5.579715 space=2.2647307 width=128.40302] = keyword
>--------->keyword   : 1790.4882,2322.2808, 1881.345,2348.561 (the
>bounding box converted in a suitable metric system to put annotations on it)
>
>I guess I have to set up a transformation matrix but I don't know what
>parameters I have to take into accounts (and if they are available in a
>way or another !).
>Could someone provide some advices ?

I don't understand every point of your problem, but here are some details you are perhaps looking for:

- the pdf-0,0 reference is the lower left corner (as you already mentioned)
- a possible dimension of a page is something like this: 612, 792 (Letter) or 596, 843 (DINA4) both portrait
- images are drawn starting at their lower left corner, with the given width and height in the pdf
- the image may be stored in the pdf-document with a larger/smaller dimension than used for displaying/printing

If you want to compare your ocr-results with the pdf, you have to have a look at the possible scaling of the image.

HTH,
Andreas

----------------------------------------------------------------
- Geschaeftsfuehrung: Chittur Ramakrishnan (Vorsitzender), 
Stefan Niehusmann - 
- Sitz der Gesellschaft: Dortmund - 
- Eingetragen beim Amtsgericht Dortmund - 
- Handelsregister-Nr. HR B 21222 - 
- USt.-IdNr. DE 2588 96 719 -