You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by jorgeeflorez <jo...@gmail.com> on 2018/11/12 18:56:42 UTC

Text extraction example

Hi all,

first, I want to thank Tilman for his effort getting the text from a page
regardless its rotation.
(https://issues.apache.org/jira/browse/PDFBOX-4371).

second, I want to share with you a small application I created using C#. It
uses ITextSharp library and a custom text extraction strategy to get the
text.

Application: here
<https://drive.google.com/file/d/1CmKvkib_ONTytwaoIrrmMdVyICXO1IPd/view?usp=sharing>
Class that process text: here
<https://drive.google.com/file/d/1u3VykdQR8Eh9ooRiqxc4q2_20w3lw8gw/view?usp=sharing>
Sample PDF files: here
<https://drive.google.com/file/d/1KdpQEIEbIl5ZETq33C2X8JVM5qfMXlDg/view?usp=sharing>

I was trying to port the code to Java and make it work using PDFBox
objects, but so far, it has been not possible to me.

Basically, the magic occurs in method RenderText (Based on other code I
found in a web page I don't remember :( ). It uses vectors (origin is lower
left corner of the page) to determine stuff like if there is a line break,
or if a whitespace must be put between glyphs.

I just hope this code gives you some light to adjust or improve (if you
consider it necessary) text extraction.

That's it.

Thank you.
Best Regards.

Jorge Eduardo Flórez

Re: Text extraction example

Posted by jorgeeflorez <jo...@gmail.com>.

Hi. Ok. I understand. Nevermind :)
Thanks.

El lun., 12 de nov. de 2018 11:16 p. m., Tilman Hausherr <
THausherr@t-online.de> escribió:

> Am 12.11.2018 um 19:56 schrieb jorgeeflorez:
> > Hi all,
> >
> > first, I want to thank Tilman for his effort getting the text from a page
> > regardless its rotation.
> > (https://issues.apache.org/jira/browse/PDFBOX-4371).
> >
> > second, I want to share with you a small application I created using C#.
> It
> > uses ITextSharp library and a custom text extraction strategy to get the
> > text.
> >
> > Application: here
> > <
> https://drive.google.com/file/d/1CmKvkib_ONTytwaoIrrmMdVyICXO1IPd/view?usp=sharing
> >
> > Class that process text: here
> > <
> https://drive.google.com/file/d/1u3VykdQR8Eh9ooRiqxc4q2_20w3lw8gw/view?usp=sharing
> >
> > Sample PDF files: here
> > <
> https://drive.google.com/file/d/1KdpQEIEbIl5ZETq33C2X8JVM5qfMXlDg/view?usp=sharing
> >
> >
> > I was trying to port the code to Java and make it work using PDFBox
> > objects, but so far, it has been not possible to me.
> >
> > Basically, the magic occurs in method RenderText (Based on other code I
> > found in a web page I don't remember :( ). It uses vectors (origin is
> lower
> > left corner of the page) to determine stuff like if there is a line
> break,
> > or if a whitespace must be put between glyphs.
> >
> > I just hope this code gives you some light to adjust or improve (if you
> > consider it necessary) text extraction.
>
>
> Hi, thanks but sorry, but there are several reasons that I can't use it:
> 1) I don't know itext, 2) I can't use code "found in a web page I don't
> remember" (license!), 3) I don't run exe files.
>
> I think our TextStripper code is similar that it uses some algorithms to
> decide where to insert blanks, and whether glyphs are on a line or not.
>
> Tilman
>
>
> >
> > That's it.
> >
> > Thank you.
> > Best Regards.
> >
> > Jorge Eduardo Flórez
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Text extraction example

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 12.11.2018 um 19:56 schrieb jorgeeflorez:
> Hi all,
>
> first, I want to thank Tilman for his effort getting the text from a page
> regardless its rotation.
> (https://issues.apache.org/jira/browse/PDFBOX-4371).
>
> second, I want to share with you a small application I created using C#. It
> uses ITextSharp library and a custom text extraction strategy to get the
> text.
>
> Application: here
> <https://drive.google.com/file/d/1CmKvkib_ONTytwaoIrrmMdVyICXO1IPd/view?usp=sharing>
> Class that process text: here
> <https://drive.google.com/file/d/1u3VykdQR8Eh9ooRiqxc4q2_20w3lw8gw/view?usp=sharing>
> Sample PDF files: here
> <https://drive.google.com/file/d/1KdpQEIEbIl5ZETq33C2X8JVM5qfMXlDg/view?usp=sharing>
>
> I was trying to port the code to Java and make it work using PDFBox
> objects, but so far, it has been not possible to me.
>
> Basically, the magic occurs in method RenderText (Based on other code I
> found in a web page I don't remember :( ). It uses vectors (origin is lower
> left corner of the page) to determine stuff like if there is a line break,
> or if a whitespace must be put between glyphs.
>
> I just hope this code gives you some light to adjust or improve (if you
> consider it necessary) text extraction.


Hi, thanks but sorry, but there are several reasons that I can't use it: 
1) I don't know itext, 2) I can't use code "found in a web page I don't 
remember" (license!), 3) I don't run exe files.

I think our TextStripper code is similar that it uses some algorithms to 
decide where to insert blanks, and whether glyphs are on a line or not.

Tilman


>
> That's it.
>
> Thank you.
> Best Regards.
>
> Jorge Eduardo Flórez
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org