You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Kay_Lee <he...@hotmail.com> on 2016/05/18 02:21:00 UTC

Hello, I have a question in extracting Texts from PDF file.

Hello,
 
I'm living in South Korea in Far-East Asia and I'm usinig Apache PDFBox in extracting Texts from PDF files.
Name: Su-Sang, Lee (English name: Kay Lee)
Cell Phone: +82-10-3180-7976
Residence: Seoul, South Korea, Asia
E-mail: herurider@hotmail.com (or herurider@gmail.com)
 
My software development environment is,
 
Windows10, Visual Studio2015, C#, PDFBox version 1.1.1(Build of Apache PDFBOX library for .NET binaries, available as Nuget pacakage.)
 
I can extract Texts (our Korean language) from PDF file with many thanks to Apache Foundation.
 
However, what I concern most is that PDFBox takes little bit longer time in extracting than iTextSharp and other competitors.
 
What I need is only extracting Korean Text from PDF file and no more purposes.

I tried to research on internet like google and stackoverflow but no specific solution and limited cases.

1) How can I extract text faster?
 
2) And do I need all the library wtih more than 30 MB files, if I only need to extract Texts ?
If I only need some specific dll library files among all PDFBOX dll library files, could you please kindly let me know which ones ?

3) Is it still ok to use PDFBOX 1.1.1 ? There seems recent versions like 1.8.12 and 2.0.1.
 
I don't belong to any company and organization but just a private person and developing a software to be distributed and used for free for 5 years as public profit purpose. As my major is not software-related but just bio-chemistry, please understand kindly and explain me in detail as possible as you'd be able.

My simple code to extract Text from PDF file is,

internal static string ExtractTextFromPdf(string path)
        {
            PDDocument doc = null;
            try
            {
                doc = PDDocument.load(path);
                PDFTextStripper stripper = new PDFTextStripper();
                stripper.setSuppressDuplicateOverlappingText(false);
                return stripper.getText(doc);
            }
            finally
            {
                if (doc != null)
                {
                    doc.close();
                }
            }
        }
 
Hope kind and excellent support.

Thank you so much !

Mr. Su-Sang, Lee (Kay Lee)
+82-10-3180-7976
herurider@hotmail.com

RE: Hello, I have a question in extracting Texts from PDF file.

Posted by Kay_Lee <he...@hotmail.com>.

Dear Mr. Tilman Hausherr, 
 
Please kindly accept my deep apology.
 
And I cordially thank your quick and excellent, delightful answer.
 
So far, I analyzed only the link to stackoverflow but will check all the link suggested by you.
 
My major is not related to software but just bio-chemistry and I'm finalizing the development of my application these days.
Therefore, I must take care of from A to Z, a millions of matters....I've been really hectic. Please kindly understand.
 
While I didn't fully check all the link from you, but it doesn't make sense I need all the many dll files to only extract text from PDF.
(But I'm really satisfied with the quality of PDFBox)
 
Hope you can also develop a 'nitro turbo' button as a library(.dll)

 
Again, my deepest appreciation to you.
 
All the best !
 
Truthfully yours,

Mr. Su-Sang, Lee (Kay Lee)
+82-10-3180-7976
herurider@hotmail.com

 
> Subject: Re: Hello, I have a question in extracting Texts from PDF file.
> To: users@pdfbox.apache.org
> From: THausherr@t-online.de
> Date: Wed, 18 May 2016 09:11:08 +0200
> 
> Am 18.05.2016 um 04:21 schrieb Kay_Lee:
> > Hello,
> >   
> > I'm living in South Korea in Far-East Asia and I'm usinig Apache PDFBox in extracting Texts from PDF files.
> > Name: Su-Sang, Lee (English name: Kay Lee)
> > Cell Phone: +82-10-3180-7976
> > Residence: Seoul, South Korea, Asia
> > E-mail: herurider@hotmail.com (or herurider@gmail.com)
> >   
> > My software development environment is,
> >   
> > Windows10, Visual Studio2015, C#, PDFBox version 1.1.1(Build of Apache PDFBOX library for .NET binaries, available as Nuget pacakage.)
> >   
> > I can extract Texts (our Korean language) from PDF file with many thanks to Apache Foundation.
> >   
> > However, what I concern most is that PDFBox takes little bit longer time in extracting than iTextSharp and other competitors.
> >   
> > What I need is only extracting Korean Text from PDF file and no more purposes.
> >
> > I tried to research on internet like google and stackoverflow but no specific solution and limited cases.
> >
> > 1) How can I extract text faster?
> 
> You can't. Unless you have a "turbo" or "nitro" button on the computer.
> 
> make sure you opening the files as files and not as streams. But I see 
> below, you already do that, i.e. your code is good.
> 
> > 2) And do I need all the library wtih more than 30 MB files, if I only need to extract Texts ?
> 
> Of PDFBox itself, you need  pdfbox and fontbox and logging. If files are 
> encrypted, then also bouncy castle. You won't need xmp and the image 
> libraries. See also here
> https://pdfbox.apache.org/1.8/dependencies.html
> 
> > If I only need some specific dll library files among all PDFBOX dll library files, could you please kindly let me know which ones ?
> >
> > 3) Is it still ok to use PDFBOX 1.1.1 ? There seems recent versions like 1.8.12 and 2.0.1.
> 
> indeed. However there is no official .net release, i.e. none of the 
> "very active developers" is currently using that one (an older release 
> is here: http://pdfbox.lehmi.de/ ). And I doubt they will be faster. 
> However they'll extract better.
> 
> There is a guide from 2012 to create the dlls:
> https://web.archive.org/web/20120204060917/http://pdfbox.apache.org/userguide/dot_net.html
> but I don't know if it works.
> 
> See also this: http://www.squarepdf.net/pdfbox-in-net
> https://stackoverflow.com/questions/8441991/how-to-build-pdfbox-for-net
> 
> >   
> > I don't belong to any company and organization but just a private person and developing a software to be distributed and used for free for 5 years as public profit purpose. As my major is not software-related but just bio-chemistry, please understand kindly and explain me in detail as possible as you'd be able.
> 
> If you're non profit and willing to distribute the source code, you can 
> use iText, see here: http://itextpdf.com/AGPL
> 
> >
> > My simple code to extract Text from PDF file is,
> >
> > internal static string ExtractTextFromPdf(string path)
> >          {
> >              PDDocument doc = null;
> >              try
> >              {
> >                  doc = PDDocument.load(path);
> >                  PDFTextStripper stripper = new PDFTextStripper();
> >                  stripper.setSuppressDuplicateOverlappingText(false);
> >                  return stripper.getText(doc);
> >              }
> >              finally
> >              {
> >                  if (doc != null)
> >                  {
> >                      doc.close();
> >                  }
> >              }
> >          }
> 
> Yes that code is fine.
> 
> Tilman
> 
> >   
> > Hope kind and excellent support.
> >
> > Thank you so much !
> >
> > Mr. Su-Sang, Lee (Kay Lee)
> > +82-10-3180-7976
> > herurider@hotmail.com
> >   
> >   		 	   		
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>

Re: Hello, I have a question in extracting Texts from PDF file.

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 18.05.2016 um 04:21 schrieb Kay_Lee:
> Hello,
>   
> I'm living in South Korea in Far-East Asia and I'm usinig Apache PDFBox in extracting Texts from PDF files.
> Name: Su-Sang, Lee (English name: Kay Lee)
> Cell Phone: +82-10-3180-7976
> Residence: Seoul, South Korea, Asia
> E-mail: herurider@hotmail.com (or herurider@gmail.com)
>   
> My software development environment is,
>   
> Windows10, Visual Studio2015, C#, PDFBox version 1.1.1(Build of Apache PDFBOX library for .NET binaries, available as Nuget pacakage.)
>   
> I can extract Texts (our Korean language) from PDF file with many thanks to Apache Foundation.
>   
> However, what I concern most is that PDFBox takes little bit longer time in extracting than iTextSharp and other competitors.
>   
> What I need is only extracting Korean Text from PDF file and no more purposes.
>
> I tried to research on internet like google and stackoverflow but no specific solution and limited cases.
>
> 1) How can I extract text faster?

You can't. Unless you have a "turbo" or "nitro" button on the computer.

make sure you opening the files as files and not as streams. But I see 
below, you already do that, i.e. your code is good.

> 2) And do I need all the library wtih more than 30 MB files, if I only need to extract Texts ?

Of PDFBox itself, you need  pdfbox and fontbox and logging. If files are 
encrypted, then also bouncy castle. You won't need xmp and the image 
libraries. See also here
https://pdfbox.apache.org/1.8/dependencies.html

> If I only need some specific dll library files among all PDFBOX dll library files, could you please kindly let me know which ones ?
>
> 3) Is it still ok to use PDFBOX 1.1.1 ? There seems recent versions like 1.8.12 and 2.0.1.

indeed. However there is no official .net release, i.e. none of the 
"very active developers" is currently using that one (an older release 
is here: http://pdfbox.lehmi.de/ ). And I doubt they will be faster. 
However they'll extract better.

There is a guide from 2012 to create the dlls:
https://web.archive.org/web/20120204060917/http://pdfbox.apache.org/userguide/dot_net.html
but I don't know if it works.

See also this: http://www.squarepdf.net/pdfbox-in-net
https://stackoverflow.com/questions/8441991/how-to-build-pdfbox-for-net

>   
> I don't belong to any company and organization but just a private person and developing a software to be distributed and used for free for 5 years as public profit purpose. As my major is not software-related but just bio-chemistry, please understand kindly and explain me in detail as possible as you'd be able.

If you're non profit and willing to distribute the source code, you can 
use iText, see here: http://itextpdf.com/AGPL

>
> My simple code to extract Text from PDF file is,
>
> internal static string ExtractTextFromPdf(string path)
>          {
>              PDDocument doc = null;
>              try
>              {
>                  doc = PDDocument.load(path);
>                  PDFTextStripper stripper = new PDFTextStripper();
>                  stripper.setSuppressDuplicateOverlappingText(false);
>                  return stripper.getText(doc);
>              }
>              finally
>              {
>                  if (doc != null)
>                  {
>                      doc.close();
>                  }
>              }
>          }

Yes that code is fine.

Tilman

>   
> Hope kind and excellent support.
>
> Thank you so much !
>
> Mr. Su-Sang, Lee (Kay Lee)
> +82-10-3180-7976
> herurider@hotmail.com
>   
>   		 	   		



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org