You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Gilad Denneboom <gi...@gmail.com> on 2017/05/22 20:07:12 UTC

Help identifying hair-lines in PDFs using PDFBox and tabula

Hi all,

So I'm trying to identify hair-lines in my PDFs. I came across tabula,
which seems to be able to do it, but I can't get it to quite work with my
files in the way I need it to, so any help is greatly appreciated!

Here's what I've been doing so far: I used the Ruling object from tabula to
extract both the horizontal and vertical rules from a stripped version of
the PDF page (ie, after removing all the text in it).
I'm getting results but now I want to relate them back to the original PDF
page, and that's proving difficult. If I add a text field using the
coordinates of the Ruling objects they are way off then where I would
expect them to be. I think it has to do with the DPI setting used to
convert the PDF page to an image, which is necessary for the rulings
extraction.
So my question is: How can I take these Ruling objects and convert them
back to the original coordinates of the PDF?
I would also like to be able to only identify lines of a certain width and
height, but if I get the rectangles to work correctly I think I can do that
in post-processing.

Thanks in advance!
Gilad

Re: Help identifying hair-lines in PDFs using PDFBox and tabula

Posted by Gilad Denneboom <gi...@gmail.com>.
I've found that if I set the dpi to 72 the locations of the Rulings match
the original PDF page.

On Tue, May 23, 2017 at 12:02 PM, Gilad Denneboom <gilad.denneboom@gmail.com
> wrote:

> PS. I'm also happy to hear any ideas on how to achieve it using PDFBox on
> its own, without tabula...
>
> On Tue, May 23, 2017 at 12:01 PM, Gilad Denneboom <
> gilad.denneboom@gmail.com> wrote:
>
>> There doesn't seem to be one... I guess I can try StackOverflow.
>>
>> On Tue, May 23, 2017 at 11:54 AM, Andreas Lehmkühler <an...@lehmi.de>
>> wrote:
>>
>>> > Gilad Denneboom <gi...@gmail.com> hat am 22. Mai 2017 um
>>> 22:07 geschrieben:
>>> >
>>> >
>>> > Hi all,
>>> >
>>> > So I'm trying to identify hair-lines in my PDFs. I came across tabula,
>>> > which seems to be able to do it, but I can't get it to quite work with
>>> my
>>> > files in the way I need it to, so any help is greatly appreciated!
>>> >
>>> > Here's what I've been doing so far: I used the Ruling object from
>>> tabula to
>>> > extract both the horizontal and vertical rules from a stripped version
>>> of
>>> > the PDF page (ie, after removing all the text in it).
>>> > I'm getting results but now I want to relate them back to the original
>>> PDF
>>> > page, and that's proving difficult. If I add a text field using the
>>> > coordinates of the Ruling objects they are way off then where I would
>>> > expect them to be. I think it has to do with the DPI setting used to
>>> > convert the PDF page to an image, which is necessary for the rulings
>>> > extraction.
>>> > So my question is: How can I take these Ruling objects and convert them
>>> > back to the original coordinates of the PDF?
>>> > I would also like to be able to only identify lines of a certain width
>>> and
>>> > height, but if I get the rectangles to work correctly I think I can do
>>> that
>>> > in post-processing.
>>> Sounds like a question for the tabulapdf community ...
>>>
>>> Andreas
>>> >
>>> > Thanks in advance!
>>> > Gilad
>>>
>>
>>
>

Re: Help identifying hair-lines in PDFs using PDFBox and tabula

Posted by Gilad Denneboom <gi...@gmail.com>.
PS. I'm also happy to hear any ideas on how to achieve it using PDFBox on
its own, without tabula...

On Tue, May 23, 2017 at 12:01 PM, Gilad Denneboom <gilad.denneboom@gmail.com
> wrote:

> There doesn't seem to be one... I guess I can try StackOverflow.
>
> On Tue, May 23, 2017 at 11:54 AM, Andreas Lehmkühler <an...@lehmi.de>
> wrote:
>
>> > Gilad Denneboom <gi...@gmail.com> hat am 22. Mai 2017 um
>> 22:07 geschrieben:
>> >
>> >
>> > Hi all,
>> >
>> > So I'm trying to identify hair-lines in my PDFs. I came across tabula,
>> > which seems to be able to do it, but I can't get it to quite work with
>> my
>> > files in the way I need it to, so any help is greatly appreciated!
>> >
>> > Here's what I've been doing so far: I used the Ruling object from
>> tabula to
>> > extract both the horizontal and vertical rules from a stripped version
>> of
>> > the PDF page (ie, after removing all the text in it).
>> > I'm getting results but now I want to relate them back to the original
>> PDF
>> > page, and that's proving difficult. If I add a text field using the
>> > coordinates of the Ruling objects they are way off then where I would
>> > expect them to be. I think it has to do with the DPI setting used to
>> > convert the PDF page to an image, which is necessary for the rulings
>> > extraction.
>> > So my question is: How can I take these Ruling objects and convert them
>> > back to the original coordinates of the PDF?
>> > I would also like to be able to only identify lines of a certain width
>> and
>> > height, but if I get the rectangles to work correctly I think I can do
>> that
>> > in post-processing.
>> Sounds like a question for the tabulapdf community ...
>>
>> Andreas
>> >
>> > Thanks in advance!
>> > Gilad
>>
>
>

Re: Help identifying hair-lines in PDFs using PDFBox and tabula

Posted by Gilad Denneboom <gi...@gmail.com>.
There doesn't seem to be one... I guess I can try StackOverflow.

On Tue, May 23, 2017 at 11:54 AM, Andreas Lehmkühler <an...@lehmi.de>
wrote:

> > Gilad Denneboom <gi...@gmail.com> hat am 22. Mai 2017 um
> 22:07 geschrieben:
> >
> >
> > Hi all,
> >
> > So I'm trying to identify hair-lines in my PDFs. I came across tabula,
> > which seems to be able to do it, but I can't get it to quite work with my
> > files in the way I need it to, so any help is greatly appreciated!
> >
> > Here's what I've been doing so far: I used the Ruling object from tabula
> to
> > extract both the horizontal and vertical rules from a stripped version of
> > the PDF page (ie, after removing all the text in it).
> > I'm getting results but now I want to relate them back to the original
> PDF
> > page, and that's proving difficult. If I add a text field using the
> > coordinates of the Ruling objects they are way off then where I would
> > expect them to be. I think it has to do with the DPI setting used to
> > convert the PDF page to an image, which is necessary for the rulings
> > extraction.
> > So my question is: How can I take these Ruling objects and convert them
> > back to the original coordinates of the PDF?
> > I would also like to be able to only identify lines of a certain width
> and
> > height, but if I get the rectangles to work correctly I think I can do
> that
> > in post-processing.
> Sounds like a question for the tabulapdf community ...
>
> Andreas
> >
> > Thanks in advance!
> > Gilad
>

Re: Help identifying hair-lines in PDFs using PDFBox and tabula

Posted by Andreas Lehmkühler <an...@lehmi.de>.
> Gilad Denneboom <gi...@gmail.com> hat am 22. Mai 2017 um 22:07 geschrieben:
> 
> 
> Hi all,
> 
> So I'm trying to identify hair-lines in my PDFs. I came across tabula,
> which seems to be able to do it, but I can't get it to quite work with my
> files in the way I need it to, so any help is greatly appreciated!
> 
> Here's what I've been doing so far: I used the Ruling object from tabula to
> extract both the horizontal and vertical rules from a stripped version of
> the PDF page (ie, after removing all the text in it).
> I'm getting results but now I want to relate them back to the original PDF
> page, and that's proving difficult. If I add a text field using the
> coordinates of the Ruling objects they are way off then where I would
> expect them to be. I think it has to do with the DPI setting used to
> convert the PDF page to an image, which is necessary for the rulings
> extraction.
> So my question is: How can I take these Ruling objects and convert them
> back to the original coordinates of the PDF?
> I would also like to be able to only identify lines of a certain width and
> height, but if I get the rectangles to work correctly I think I can do that
> in post-processing.
Sounds like a question for the tabulapdf community ...

Andreas
> 
> Thanks in advance!
> Gilad

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org