You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Andreas Lehmkühler <an...@lehmi.de> on 2009/10/01 07:48:22 UTC

Re: deducing table cells in a PDF document

Hi,

Pranay Pramod schrieb:
> Thanks Andreas for showing up your interest.
> I am trying to extract text including the table information from PDF
> documents.
> The current capability of PDFBox extracts only plain text.
> 
> using the graphics operator moveTo (m), lineTo(l), Rectangle(re), I am able
> to deduce the lines forming the table in a PDF document page. Finally my
> algorithm can make out the individual cells of the table. My code assumes
> the standard coordinate system being used. Whenever I encounter a different
> coordinate system or a different way of rendering the lines of the table(
> shifting the origin for every line draw???), my code breaks for the obvious
> reason.
> 
> The pdf-reference1.7 hints at pre-processing the CTM or the graphic state to
> fetch standard coordinate to my module.
Yes, that's the point. You have to have a look at the ctm AND the
graphics state (see chapter 4.3 of the pdf1.7 reference).

The ctm is used to scale, rotate and shift the coordinates. It is a
little bit too complex to describe hte whole thing in two sentences.
Have a look at the usage of
PDGraphicsState.getCurrentTransformationMatrix() espacially in
PageDrawer.transformedPoint().

Looking at the graphics state the stack is important. It is possible to
save the state to that stack and get it back from the stack. So that you
have to implement that behaviour also, otherwise the graphics states
will be mixed up. In PDFBox the PDFStreamEngine holds this stack.

HTH
Andreas Lehmkühler



Re: deducing table cells in a PDF document

Posted by Pranay Pramod <pr...@gmail.com>.
Hello Andreas,

Thanks again for the pointer. I have been able to address the issue.
I did not have to do much as
PDGraphicsState.getCurrentTransformationMatrix()
is already there. All I needed was to append the currentCTM to the points I
needed to process.
Thanks!

Pranay

On Thu, Oct 1, 2009 at 12:40 PM, Pranay Pramod <pr...@gmail.com>wrote:

> Thanks Andreas! I would follow up on your suggestion.
>
> 2009/10/1 Andreas Lehmkühler <an...@lehmi.de>
>
> Hi,
>>
>> Pranay Pramod schrieb:
>> > Thanks Andreas for showing up your interest.
>> > I am trying to extract text including the table information from PDF
>> > documents.
>> > The current capability of PDFBox extracts only plain text.
>> >
>> > using the graphics operator moveTo (m), lineTo(l), Rectangle(re), I am
>> able
>> > to deduce the lines forming the table in a PDF document page. Finally my
>> > algorithm can make out the individual cells of the table. My code
>> assumes
>> > the standard coordinate system being used. Whenever I encounter a
>> different
>> > coordinate system or a different way of rendering the lines of the
>> table(
>> > shifting the origin for every line draw???), my code breaks for the
>> obvious
>> > reason.
>> >
>> > The pdf-reference1.7 hints at pre-processing the CTM or the graphic
>> state to
>> > fetch standard coordinate to my module.
>> Yes, that's the point. You have to have a look at the ctm AND the
>> graphics state (see chapter 4.3 of the pdf1.7 reference).
>>
>> The ctm is used to scale, rotate and shift the coordinates. It is a
>> little bit too complex to describe hte whole thing in two sentences.
>> Have a look at the usage of
>> PDGraphicsState.getCurrentTransformationMatrix() espacially in
>> PageDrawer.transformedPoint().
>>
>> Looking at the graphics state the stack is important. It is possible to
>> save the state to that stack and get it back from the stack. So that you
>> have to implement that behaviour also, otherwise the graphics states
>> will be mixed up. In PDFBox the PDFStreamEngine holds this stack.
>>
>> HTH
>> Andreas Lehmkühler
>>
>>
>>
>
>
> --
>
> Regards,
> Pranay
>



-- 

Regards,
Pranay

Re: deducing table cells in a PDF document

Posted by Pranay Pramod <pr...@gmail.com>.
Thanks Andreas! I would follow up on your suggestion.

2009/10/1 Andreas Lehmkühler <an...@lehmi.de>

> Hi,
>
> Pranay Pramod schrieb:
> > Thanks Andreas for showing up your interest.
> > I am trying to extract text including the table information from PDF
> > documents.
> > The current capability of PDFBox extracts only plain text.
> >
> > using the graphics operator moveTo (m), lineTo(l), Rectangle(re), I am
> able
> > to deduce the lines forming the table in a PDF document page. Finally my
> > algorithm can make out the individual cells of the table. My code assumes
> > the standard coordinate system being used. Whenever I encounter a
> different
> > coordinate system or a different way of rendering the lines of the table(
> > shifting the origin for every line draw???), my code breaks for the
> obvious
> > reason.
> >
> > The pdf-reference1.7 hints at pre-processing the CTM or the graphic state
> to
> > fetch standard coordinate to my module.
> Yes, that's the point. You have to have a look at the ctm AND the
> graphics state (see chapter 4.3 of the pdf1.7 reference).
>
> The ctm is used to scale, rotate and shift the coordinates. It is a
> little bit too complex to describe hte whole thing in two sentences.
> Have a look at the usage of
> PDGraphicsState.getCurrentTransformationMatrix() espacially in
> PageDrawer.transformedPoint().
>
> Looking at the graphics state the stack is important. It is possible to
> save the state to that stack and get it back from the stack. So that you
> have to implement that behaviour also, otherwise the graphics states
> will be mixed up. In PDFBox the PDFStreamEngine holds this stack.
>
> HTH
> Andreas Lehmkühler
>
>
>


-- 

Regards,
Pranay