You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by STAMPF Lukas <lu...@bat.at> on 2019/09/20 08:07:03 UTC

Finding a Box containing text

Hello,

I am trying to find (x,y,widht,height) of a box containing a text within an PDF document. Locating the text by inheriting the TextPosition was pretty straightforward, but I had to realize that I don't know PDF Operators well enough to locate the box.

Can somebody please have a look at the PDF I attached and tell me which "q" - "Q" block represents my "FIND ME" Box. Can I subclass PDFRenderer to get the Box position?

Regards,
Lukas

AW: AW: Finding a Box containing text

Posted by STAMPF Lukas <lu...@bat.at>.
Thx a lot for every answer. I managed to do it for relevant cases. 

Regards,
Lukas

-----Ursprüngliche Nachricht-----
Von: Peter Murray-Rust <pe...@googlemail.com.INVALID> 
Gesendet: Freitag, 20. September 2019 18:58
An: users@pdfbox.apache.org
Betreff: Re: AW: Finding a Box containing text

I do a lot of this and there is no generic way. The rect might be a rect or
4 lines or a polyline 3  or  4 (or 5 for overlaps). It migh be drawn twice for emplhasis .
I have have some heuristics for creating probable rects.
in http://github.com/petermr/ami3
If you are serious and doing a *lot*  I can show you where to find them.


On Fri, 20 Sep 2019, 15:41 PDF Developer, <pd...@yahoo.com.invalid> wrote:

>  Lukas.
> Quick answer:
> I looked at the page content stream using the PDFBox Debugger and 
> appendRectangle isn't triggering because there isn't a rectangle in 
> the page content stream. What is rendered is made up from a move and 
> lines. I also had to handle this in my project. One way would be to 
> Override other methods so that you catch a moveTo, lineTo, closePath 
> strokeAndFill etc and store the points, to see when closePath is called if they form a rectangle.
>
> If I have time, I am about to go on a business trip, I will see if I 
> can cut down my code to illustrate this.
>
> PDFDev
>
>     On Friday, September 20, 2019, 2:58:58 PM GMT+1, STAMPF Lukas < 
> lukas.stampf@bat.at> wrote:
>
>  Hello,
>
> Thanks for the input.
> https://filebox.batmen.at/index.php/s/R2PA4HB6eIXkc8c
>
> Seems like I cant use the appendRectangle method. It does not trigger.
>
> Regards,
> Lukas
>
> -----Ursprüngliche Nachricht-----
> Von: PDF Developer <pd...@yahoo.com.INVALID>
> Gesendet: Freitag, 20. September 2019 11:02
> An: users@pdfbox.apache.org
> Betreff: Re: Finding a Box containing text
>
>  Hello Lukas,
> This mailing list doesn't accept attachments; you probably want to use 
> a hosting site instead.
>
> I am currently working on a project that needs to identify text on a 
> page within a rectangle.
>
> This may or may not be appropriate but to do this I Overrride 
> "PDFGraphicsStreamEngine"; Which has a method appendRectangle, if your 
> PDF creation application is well behaved you can just use that. That 
> said in the real world a rectangle can be made up of lines and moves, 
> so you may have a bit more work to do.  If you have the coordinates of 
> the start of the string, then you could enumerate the rectangles to 
> see if the point was in a rectangle. Or you could use do things 
> slightly in reverse and use the bounds of the rectangle and use the 
> TextStripperByArea to get the text in the rectangle and identify if the string is what you are looking for.
> Unfortunately I can't share my project code but if you can find 
> somewhere to host the PDF, I will see if I can use it as a test for my 
> code and if that is successful provide something by way of a slimmed down example.
> PDFDev
>
>     On Friday, September 20, 2019, 9:07:20 AM GMT+1, STAMPF Lukas < 
> lukas.stampf@bat.at> wrote:
>
>   <!--#yiv9876807336 _filtered #yiv9876807336 {font-family:"Cambria
> Math";panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv9876807336
> {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv9876807336
> #yiv9876807336 p.yiv9876807336MsoNormal, #yiv9876807336 
> li.yiv9876807336MsoNormal, #yiv9876807336 div.yiv9876807336MsoNormal 
> {margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;font-family:"Calibr
> i",
> sans-serif;}#yiv9876807336 a:link, #yiv9876807336 
> span.yiv9876807336MsoHyperlink
> {color:#0563C1;text-decoration:underline;}#yiv9876807336 a:visited,
> #yiv9876807336 span.yiv9876807336MsoHyperlinkFollowed
> {color:#954F72;text-decoration:underline;}#yiv9876807336
> span.yiv9876807336E-MailFormatvorlage17 {font-family:"Calibri",
> sans-serif;color:windowtext;}#yiv9876807336 
> .yiv9876807336MsoChpDefault {font-family:"Calibri", sans-serif;} 
> _filtered #yiv9876807336 {margin:70.85pt 70.85pt 2.0cm 
> 70.85pt;}#yiv9876807336
> div.yiv9876807336WordSection1 {}--> Hello,
>
>
>
> I am trying to find (x,y,widht,height) of a box containing a text 
> within an PDF document. Locating the text by inheriting the 
> TextPosition was pretty straightforward, but I had to realize that I 
> don’t know PDF Operators well enough to locate the box.
>
>
>
> Can somebody please have a look at the PDF I attached and tell me 
> which „q“ – „Q“ block represents my „FIND ME“ Box. Can I subclass 
> PDFRenderer to get the Box position?
>
>
>
> Regards,
>
> Lukas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: AW: Finding a Box containing text

Posted by Peter Murray-Rust <pe...@googlemail.com.INVALID>.
I do a lot of this and there is no generic way. The rect might be a rect or
4 lines or a polyline 3  or  4 (or 5 for overlaps). It migh be drawn twice
for emplhasis .
I have have some heuristics for creating probable rects.
in http://github.com/petermr/ami3
If you are serious and doing a *lot*  I can show you where to find them.


On Fri, 20 Sep 2019, 15:41 PDF Developer, <pd...@yahoo.com.invalid> wrote:

>  Lukas.
> Quick answer:
> I looked at the page content stream using the PDFBox Debugger and
> appendRectangle isn't triggering because there isn't a rectangle in the
> page content stream. What is rendered is made up from a move and lines. I
> also had to handle this in my project. One way would be to Override other
> methods so that you catch a moveTo, lineTo, closePath strokeAndFill etc and
> store the points, to see when closePath is called if they form a rectangle.
>
> If I have time, I am about to go on a business trip, I will see if I can
> cut down my code to illustrate this.
>
> PDFDev
>
>     On Friday, September 20, 2019, 2:58:58 PM GMT+1, STAMPF Lukas <
> lukas.stampf@bat.at> wrote:
>
>  Hello,
>
> Thanks for the input.
> https://filebox.batmen.at/index.php/s/R2PA4HB6eIXkc8c
>
> Seems like I cant use the appendRectangle method. It does not trigger.
>
> Regards,
> Lukas
>
> -----Ursprüngliche Nachricht-----
> Von: PDF Developer <pd...@yahoo.com.INVALID>
> Gesendet: Freitag, 20. September 2019 11:02
> An: users@pdfbox.apache.org
> Betreff: Re: Finding a Box containing text
>
>  Hello Lukas,
> This mailing list doesn't accept attachments; you probably want to use a
> hosting site instead.
>
> I am currently working on a project that needs to identify text on a page
> within a rectangle.
>
> This may or may not be appropriate but to do this I Overrride
> "PDFGraphicsStreamEngine"; Which has a method appendRectangle, if your PDF
> creation application is well behaved you can just use that. That said in
> the real world a rectangle can be made up of lines and moves, so you may
> have a bit more work to do.  If you have the coordinates of the start of
> the string, then you could enumerate the rectangles to see if the point was
> in a rectangle. Or you could use do things slightly in reverse and use the
> bounds of the rectangle and use the TextStripperByArea to get the text in
> the rectangle and identify if the string is what you are looking for.
> Unfortunately I can't share my project code but if you can find somewhere
> to host the PDF, I will see if I can use it as a test for my code and if
> that is successful provide something by way of a slimmed down example.
> PDFDev
>
>     On Friday, September 20, 2019, 9:07:20 AM GMT+1, STAMPF Lukas <
> lukas.stampf@bat.at> wrote:
>
>   <!--#yiv9876807336 _filtered #yiv9876807336 {font-family:"Cambria
> Math";panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv9876807336
> {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv9876807336
> #yiv9876807336 p.yiv9876807336MsoNormal, #yiv9876807336
> li.yiv9876807336MsoNormal, #yiv9876807336 div.yiv9876807336MsoNormal
> {margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;font-family:"Calibri",
> sans-serif;}#yiv9876807336 a:link, #yiv9876807336
> span.yiv9876807336MsoHyperlink
> {color:#0563C1;text-decoration:underline;}#yiv9876807336 a:visited,
> #yiv9876807336 span.yiv9876807336MsoHyperlinkFollowed
> {color:#954F72;text-decoration:underline;}#yiv9876807336
> span.yiv9876807336E-MailFormatvorlage17 {font-family:"Calibri",
> sans-serif;color:windowtext;}#yiv9876807336 .yiv9876807336MsoChpDefault
> {font-family:"Calibri", sans-serif;} _filtered #yiv9876807336
> {margin:70.85pt 70.85pt 2.0cm 70.85pt;}#yiv9876807336
> div.yiv9876807336WordSection1 {}--> Hello,
>
>
>
> I am trying to find (x,y,widht,height) of a box containing a text within
> an PDF document. Locating the text by inheriting the TextPosition was
> pretty straightforward, but I had to realize that I don’t know PDF
> Operators well enough to locate the box.
>
>
>
> Can somebody please have a look at the PDF I attached and tell me which
> „q“ – „Q“ block represents my „FIND ME“ Box. Can I subclass PDFRenderer to
> get the Box position?
>
>
>
> Regards,
>
> Lukas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: AW: Finding a Box containing text

Posted by PDF Developer <pd...@yahoo.com.INVALID>.
 Lukas.
Quick answer: 
I looked at the page content stream using the PDFBox Debugger and appendRectangle isn't triggering because there isn't a rectangle in the page content stream. What is rendered is made up from a move and lines. I also had to handle this in my project. One way would be to Override other methods so that you catch a moveTo, lineTo, closePath strokeAndFill etc and store the points, to see when closePath is called if they form a rectangle. 

If I have time, I am about to go on a business trip, I will see if I can cut down my code to illustrate this. 

PDFDev

    On Friday, September 20, 2019, 2:58:58 PM GMT+1, STAMPF Lukas <lu...@bat.at> wrote:  
 
 Hello,

Thanks for the input. 
https://filebox.batmen.at/index.php/s/R2PA4HB6eIXkc8c

Seems like I cant use the appendRectangle method. It does not trigger. 

Regards,
Lukas

-----Ursprüngliche Nachricht-----
Von: PDF Developer <pd...@yahoo.com.INVALID> 
Gesendet: Freitag, 20. September 2019 11:02
An: users@pdfbox.apache.org
Betreff: Re: Finding a Box containing text

 Hello Lukas,
This mailing list doesn't accept attachments; you probably want to use a hosting site instead. 

I am currently working on a project that needs to identify text on a page within a rectangle. 

This may or may not be appropriate but to do this I Overrride "PDFGraphicsStreamEngine"; Which has a method appendRectangle, if your PDF creation application is well behaved you can just use that. That said in the real world a rectangle can be made up of lines and moves, so you may have a bit more work to do.  If you have the coordinates of the start of the string, then you could enumerate the rectangles to see if the point was in a rectangle. Or you could use do things slightly in reverse and use the bounds of the rectangle and use the TextStripperByArea to get the text in the rectangle and identify if the string is what you are looking for.
Unfortunately I can't share my project code but if you can find somewhere to host the PDF, I will see if I can use it as a test for my code and if that is successful provide something by way of a slimmed down example.
PDFDev

    On Friday, September 20, 2019, 9:07:20 AM GMT+1, STAMPF Lukas <lu...@bat.at> wrote:  
 
  <!--#yiv9876807336 _filtered #yiv9876807336 {font-family:"Cambria Math";panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv9876807336 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv9876807336 #yiv9876807336 p.yiv9876807336MsoNormal, #yiv9876807336 li.yiv9876807336MsoNormal, #yiv9876807336 div.yiv9876807336MsoNormal {margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;font-family:"Calibri", sans-serif;}#yiv9876807336 a:link, #yiv9876807336 span.yiv9876807336MsoHyperlink {color:#0563C1;text-decoration:underline;}#yiv9876807336 a:visited, #yiv9876807336 span.yiv9876807336MsoHyperlinkFollowed {color:#954F72;text-decoration:underline;}#yiv9876807336 span.yiv9876807336E-MailFormatvorlage17 {font-family:"Calibri", sans-serif;color:windowtext;}#yiv9876807336 .yiv9876807336MsoChpDefault {font-family:"Calibri", sans-serif;} _filtered #yiv9876807336 {margin:70.85pt 70.85pt 2.0cm 70.85pt;}#yiv9876807336 div.yiv9876807336WordSection1 {}--> Hello,
 
  
 
I am trying to find (x,y,widht,height) of a box containing a text within an PDF document. Locating the text by inheriting the TextPosition was pretty straightforward, but I had to realize that I don’t know PDF Operators well enough to locate the box. 
 
  
 
Can somebody please have a look at the PDF I attached and tell me which „q“ – „Q“ block represents my „FIND ME“ Box. Can I subclass PDFRenderer to get the Box position?
 
  
 
Regards,
 
Lukas
 
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org  

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

  

AW: Finding a Box containing text

Posted by STAMPF Lukas <lu...@bat.at>.
Hello,

Thanks for the input. 
https://filebox.batmen.at/index.php/s/R2PA4HB6eIXkc8c

Seems like I cant use the appendRectangle method. It does not trigger. 

Regards,
Lukas

-----Ursprüngliche Nachricht-----
Von: PDF Developer <pd...@yahoo.com.INVALID> 
Gesendet: Freitag, 20. September 2019 11:02
An: users@pdfbox.apache.org
Betreff: Re: Finding a Box containing text

 Hello Lukas,
This mailing list doesn't accept attachments; you probably want to use a hosting site instead. 

I am currently working on a project that needs to identify text on a page within a rectangle. 

This may or may not be appropriate but to do this I Overrride "PDFGraphicsStreamEngine"; Which has a method appendRectangle, if your PDF creation application is well behaved you can just use that. That said in the real world a rectangle can be made up of lines and moves, so you may have a bit more work to do.  If you have the coordinates of the start of the string, then you could enumerate the rectangles to see if the point was in a rectangle. Or you could use do things slightly in reverse and use the bounds of the rectangle and use the TextStripperByArea to get the text in the rectangle and identify if the string is what you are looking for.
Unfortunately I can't share my project code but if you can find somewhere to host the PDF, I will see if I can use it as a test for my code and if that is successful provide something by way of a slimmed down example.
PDFDev

    On Friday, September 20, 2019, 9:07:20 AM GMT+1, STAMPF Lukas <lu...@bat.at> wrote:  
 
  <!--#yiv9876807336 _filtered #yiv9876807336 {font-family:"Cambria Math";panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv9876807336 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv9876807336 #yiv9876807336 p.yiv9876807336MsoNormal, #yiv9876807336 li.yiv9876807336MsoNormal, #yiv9876807336 div.yiv9876807336MsoNormal {margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;font-family:"Calibri", sans-serif;}#yiv9876807336 a:link, #yiv9876807336 span.yiv9876807336MsoHyperlink {color:#0563C1;text-decoration:underline;}#yiv9876807336 a:visited, #yiv9876807336 span.yiv9876807336MsoHyperlinkFollowed {color:#954F72;text-decoration:underline;}#yiv9876807336 span.yiv9876807336E-MailFormatvorlage17 {font-family:"Calibri", sans-serif;color:windowtext;}#yiv9876807336 .yiv9876807336MsoChpDefault {font-family:"Calibri", sans-serif;} _filtered #yiv9876807336 {margin:70.85pt 70.85pt 2.0cm 70.85pt;}#yiv9876807336 div.yiv9876807336WordSection1 {}--> Hello,
 
  
 
I am trying to find (x,y,widht,height) of a box containing a text within an PDF document. Locating the text by inheriting the TextPosition was pretty straightforward, but I had to realize that I don’t know PDF Operators well enough to locate the box. 
 
  
 
Can somebody please have a look at the PDF I attached and tell me which „q“ – „Q“ block represents my „FIND ME“ Box. Can I subclass PDFRenderer to get the Box position?
 
  
 
Regards,
 
Lukas
 
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org  

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Finding a Box containing text

Posted by PDF Developer <pd...@yahoo.com.INVALID>.
 Hello Lukas,
This mailing list doesn't accept attachments; you probably want to use a hosting site instead. 

I am currently working on a project that needs to identify text on a page within a rectangle. 

This may or may not be appropriate but to do this I Overrride "PDFGraphicsStreamEngine"; Which has a method appendRectangle, if your PDF creation application is well behaved you can just use that. That said in the real world a rectangle can be made up of lines and moves, so you may have a bit more work to do.  If you have the coordinates of the start of the string, then you could enumerate the rectangles to see if the point was in a rectangle. Or you could use do things slightly in reverse and use the bounds of the rectangle and use the TextStripperByArea to get the text in the rectangle and identify if the string is what you are looking for.
Unfortunately I can't share my project code but if you can find somewhere to host the PDF, I will see if I can use it as a test for my code and if that is successful provide something by way of a slimmed down example.
PDFDev

    On Friday, September 20, 2019, 9:07:20 AM GMT+1, STAMPF Lukas <lu...@bat.at> wrote:  
 
  <!--#yiv9876807336 _filtered #yiv9876807336 {font-family:"Cambria Math";panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv9876807336 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv9876807336 #yiv9876807336 p.yiv9876807336MsoNormal, #yiv9876807336 li.yiv9876807336MsoNormal, #yiv9876807336 div.yiv9876807336MsoNormal {margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;font-family:"Calibri", sans-serif;}#yiv9876807336 a:link, #yiv9876807336 span.yiv9876807336MsoHyperlink {color:#0563C1;text-decoration:underline;}#yiv9876807336 a:visited, #yiv9876807336 span.yiv9876807336MsoHyperlinkFollowed {color:#954F72;text-decoration:underline;}#yiv9876807336 span.yiv9876807336E-MailFormatvorlage17 {font-family:"Calibri", sans-serif;color:windowtext;}#yiv9876807336 .yiv9876807336MsoChpDefault {font-family:"Calibri", sans-serif;} _filtered #yiv9876807336 {margin:70.85pt 70.85pt 2.0cm 70.85pt;}#yiv9876807336 div.yiv9876807336WordSection1 {}-->
Hello,
 
  
 
I am trying to find (x,y,widht,height) of a box containing a text within an PDF document. Locating the text by inheriting the TextPosition was pretty straightforward, but I had to realize that I don’t know PDF Operators well enough to locate the box. 
 
  
 
Can somebody please have a look at the PDF I attached and tell me which „q“ – „Q“ block represents my „FIND ME“ Box. Can I subclass PDFRenderer to get the Box position?
 
  
 
Regards,
 
Lukas
 
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org  

Re: Finding a Box containing text

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 20.09.2019 um 10:07 schrieb STAMPF Lukas:
> but I had to realize that I don’t know PDF Operators well enough to 
> locate the box. 

Please have a look at this answer
https://stackoverflow.com/questions/38931422/
it shows how to catch the lines of a PDF.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org