You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by Pranay Pramod <pr...@gmail.com> on 2009/09/30 17:43:28 UTC

deducing table cells in a PDF document

I had a requirement of being able to pull out along with normal text, text
within table cells with their bounding box information.
I needed to map the text with their table cell cor-ordinates. something like
this:

<table>
<cell bottomLeft="200,200", topRight="400,400"> some text </cell>
<cell bottomLeft="200,100", topRight="400,200"> some more text </cell>
<table>

Implementing the graphics operator moveTo, lineTo, Rectangle and my own
cell-deduction-algorithm from a given set of points/lines, I have been able
to get most of what I wanted.

The problem I face now is:
I see many PDF documents where the tables are constructed with varying
co-ordinate systems. Sometimes they are created using standard coordinate
system (bottom left 0,0 and top right maxX,maxY)
but sometimes it just keeps moving the origin (0,0) to plot each cell or a
different coordinate system.

I am in the process of studying the coordinate systems discusssed in
PDF-reference1.7, but would love to know if anyone has ever come across a
similar requirement/problem.
Down the line, I wish to contribute the code to PDFBox.

Thanks!,
Pranay

Re: deducing table cells in a PDF document

Posted by Pranay Pramod <pr...@gmail.com>.

Hello Andreas,

Thanks again for the pointer. I have been able to address the issue.
I did not have to do much as
PDGraphicsState.getCurrentTransformationMatrix()
is already there. All I needed was to append the currentCTM to the points I
needed to process.
Thanks!

Pranay

On Thu, Oct 1, 2009 at 12:40 PM, Pranay Pramod <pr...@gmail.com>wrote:

> Thanks Andreas! I would follow up on your suggestion.
>
> 2009/10/1 Andreas Lehmkühler <an...@lehmi.de>
>
> Hi,
>>
>> Pranay Pramod schrieb:
>> > Thanks Andreas for showing up your interest.
>> > I am trying to extract text including the table information from PDF
>> > documents.
>> > The current capability of PDFBox extracts only plain text.
>> >
>> > using the graphics operator moveTo (m), lineTo(l), Rectangle(re), I am
>> able
>> > to deduce the lines forming the table in a PDF document page. Finally my
>> > algorithm can make out the individual cells of the table. My code
>> assumes
>> > the standard coordinate system being used. Whenever I encounter a
>> different
>> > coordinate system or a different way of rendering the lines of the
>> table(
>> > shifting the origin for every line draw???), my code breaks for the
>> obvious
>> > reason.
>> >
>> > The pdf-reference1.7 hints at pre-processing the CTM or the graphic
>> state to
>> > fetch standard coordinate to my module.
>> Yes, that's the point. You have to have a look at the ctm AND the
>> graphics state (see chapter 4.3 of the pdf1.7 reference).
>>
>> The ctm is used to scale, rotate and shift the coordinates. It is a
>> little bit too complex to describe hte whole thing in two sentences.
>> Have a look at the usage of
>> PDGraphicsState.getCurrentTransformationMatrix() espacially in
>> PageDrawer.transformedPoint().
>>
>> Looking at the graphics state the stack is important. It is possible to
>> save the state to that stack and get it back from the stack. So that you
>> have to implement that behaviour also, otherwise the graphics states
>> will be mixed up. In PDFBox the PDFStreamEngine holds this stack.
>>
>> HTH
>> Andreas Lehmkühler
>>
>>
>>
>
>
> --
>
> Regards,
> Pranay
>



-- 

Regards,
Pranay

Re: deducing table cells in a PDF document

Posted by Pranay Pramod <pr...@gmail.com>.

Thanks Andreas! I would follow up on your suggestion.

2009/10/1 Andreas Lehmkühler <an...@lehmi.de>

> Hi,
>
> Pranay Pramod schrieb:
> > Thanks Andreas for showing up your interest.
> > I am trying to extract text including the table information from PDF
> > documents.
> > The current capability of PDFBox extracts only plain text.
> >
> > using the graphics operator moveTo (m), lineTo(l), Rectangle(re), I am
> able
> > to deduce the lines forming the table in a PDF document page. Finally my
> > algorithm can make out the individual cells of the table. My code assumes
> > the standard coordinate system being used. Whenever I encounter a
> different
> > coordinate system or a different way of rendering the lines of the table(
> > shifting the origin for every line draw???), my code breaks for the
> obvious
> > reason.
> >
> > The pdf-reference1.7 hints at pre-processing the CTM or the graphic state
> to
> > fetch standard coordinate to my module.
> Yes, that's the point. You have to have a look at the ctm AND the
> graphics state (see chapter 4.3 of the pdf1.7 reference).
>
> The ctm is used to scale, rotate and shift the coordinates. It is a
> little bit too complex to describe hte whole thing in two sentences.
> Have a look at the usage of
> PDGraphicsState.getCurrentTransformationMatrix() espacially in
> PageDrawer.transformedPoint().
>
> Looking at the graphics state the stack is important. It is possible to
> save the state to that stack and get it back from the stack. So that you
> have to implement that behaviour also, otherwise the graphics states
> will be mixed up. In PDFBox the PDFStreamEngine holds this stack.
>
> HTH
> Andreas Lehmkühler
>
>
>


-- 

Regards,
Pranay

Re: deducing table cells in a PDF document

Posted by Andreas Lehmkühler <an...@lehmi.de>.

Hi,

Pranay Pramod schrieb:
> Thanks Andreas for showing up your interest.
> I am trying to extract text including the table information from PDF
> documents.
> The current capability of PDFBox extracts only plain text.
> 
> using the graphics operator moveTo (m), lineTo(l), Rectangle(re), I am able
> to deduce the lines forming the table in a PDF document page. Finally my
> algorithm can make out the individual cells of the table. My code assumes
> the standard coordinate system being used. Whenever I encounter a different
> coordinate system or a different way of rendering the lines of the table(
> shifting the origin for every line draw???), my code breaks for the obvious
> reason.
> 
> The pdf-reference1.7 hints at pre-processing the CTM or the graphic state to
> fetch standard coordinate to my module.
Yes, that's the point. You have to have a look at the ctm AND the
graphics state (see chapter 4.3 of the pdf1.7 reference).

The ctm is used to scale, rotate and shift the coordinates. It is a
little bit too complex to describe hte whole thing in two sentences.
Have a look at the usage of
PDGraphicsState.getCurrentTransformationMatrix() espacially in
PageDrawer.transformedPoint().

Looking at the graphics state the stack is important. It is possible to
save the state to that stack and get it back from the stack. So that you
have to implement that behaviour also, otherwise the graphics states
will be mixed up. In PDFBox the PDFStreamEngine holds this stack.

HTH
Andreas Lehmkühler

Re: deducing table cells in a PDF document

Posted by Pranay Pramod <pr...@gmail.com>.

Thanks Andreas for showing up your interest.
I am trying to extract text including the table information from PDF
documents.
The current capability of PDFBox extracts only plain text.

using the graphics operator moveTo (m), lineTo(l), Rectangle(re), I am able
to deduce the lines forming the table in a PDF document page. Finally my
algorithm can make out the individual cells of the table. My code assumes
the standard coordinate system being used. Whenever I encounter a different
coordinate system or a different way of rendering the lines of the table(
shifting the origin for every line draw???), my code breaks for the obvious
reason.

The pdf-reference1.7 hints at pre-processing the CTM or the graphic state to
fetch standard coordinate to my module.

thanks,
Pranay

2009/9/30 Andreas Lehmkühler <an...@lehmi.de>

> Hi,
>
>
> Pranay Pramod schrieb:
> > I had a requirement of being able to pull out along with normal text,
> text
> > within table cells with their bounding box information.
> > I needed to map the text with their table cell cor-ordinates. something
> like
> > this:
> >
> > <table>
> > <cell bottomLeft="200,200", topRight="400,400"> some text </cell>
> > <cell bottomLeft="200,100", topRight="400,200"> some more text </cell>
> > <table>
> >
> > Implementing the graphics operator moveTo, lineTo, Rectangle and my own
> > cell-deduction-algorithm from a given set of points/lines, I have been
> able
> > to get most of what I wanted.
> >
> > The problem I face now is:
> > I see many PDF documents where the tables are constructed with varying
> > co-ordinate systems. Sometimes they are created using standard coordinate
> > system (bottom left 0,0 and top right maxX,maxY)
> > but sometimes it just keeps moving the origin (0,0) to plot each cell or
> a
> > different coordinate system.
> >
> > I am in the process of studying the coordinate systems discusssed in
> > PDF-reference1.7, but would love to know if anyone has ever come across a
> > similar requirement/problem.
> > Down the line, I wish to contribute the code to PDFBox.
> Sorry for asking, but I'm a little bit confused, perhaps I missed the
> point.
> What are you talking about: creating a new pdf with tables or extracting
> the text including table informations?
>
> BR
> Andreas Lehmkühler
>



-- 

Regards,
Pranay

Re: deducing table cells in a PDF document

Posted by Andreas Lehmkühler <an...@lehmi.de>.

Hi,


Pranay Pramod schrieb:
> I had a requirement of being able to pull out along with normal text, text
> within table cells with their bounding box information.
> I needed to map the text with their table cell cor-ordinates. something like
> this:
> 
> <table>
> <cell bottomLeft="200,200", topRight="400,400"> some text </cell>
> <cell bottomLeft="200,100", topRight="400,200"> some more text </cell>
> <table>
> 
> Implementing the graphics operator moveTo, lineTo, Rectangle and my own
> cell-deduction-algorithm from a given set of points/lines, I have been able
> to get most of what I wanted.
> 
> The problem I face now is:
> I see many PDF documents where the tables are constructed with varying
> co-ordinate systems. Sometimes they are created using standard coordinate
> system (bottom left 0,0 and top right maxX,maxY)
> but sometimes it just keeps moving the origin (0,0) to plot each cell or a
> different coordinate system.
> 
> I am in the process of studying the coordinate systems discusssed in
> PDF-reference1.7, but would love to know if anyone has ever come across a
> similar requirement/problem.
> Down the line, I wish to contribute the code to PDFBox.
Sorry for asking, but I'm a little bit confused, perhaps I missed the
point.
What are you talking about: creating a new pdf with tables or extracting
the text including table informations?

BR
Andreas Lehmkühler