You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by rahul bhalla <ur...@gmail.com> on 2013/04/16 11:35:17 UTC

Extract text using pdfbox

hi
Actually i search various site and read different forum but not able to
find a way to read a single line from specific page number and also want to
extract its property of that line.
Is there is any way to read pdf by using readLine() method of bufferReader
or some other way
Please suggest me
-- 
Regards
Rahul Bhalla

Re: Extract text using pdfbox

Posted by rahul bhalla <ur...@gmail.com>.

hello Vladimir

sir read the code  but unfortunately i am unable to understand the code

PDFStreamParser parser = new PDFStreamParser(contents.getStream() );
                parser.parse();
                List tokens = parser.getTokens();
                for( int j=0; j<tokens.size(); j++ )
                {
                    Object next = tokens.get( j );
                    if( next instanceof PDFOperator )
                    {
                        PDFOperator op = (PDFOperator)next;
                        //Tj and TJ are the two operators that display
                        //strings in a PDF
                        if( op.getOperation().equals( "Tj" ) )
                        {
                            //Tj takes one operator and that is the string
                            //to display so lets update that operator
                            COSString previous = (COSString)tokens.get( j-1 );
                            String string = previous.getString();
                            string = string.replaceFirst( strToFind, message );
                            previous.reset();
                            previous.append( string.getBytes("ISO-8859-1") );
                        }
                        else if( op.getOperation().equals( "TJ" ) )
                        {
                            COSArray previous = (COSArray)tokens.get( j-1 );
                            for( int k=0; k<previous.size(); k++ )
                            {
                                Object arrElement = previous.getObject( k );
                                if( arrElement instanceof COSString )
                                {
                                    COSString cosString = (COSString)arrElement;
                                    String string = cosString.getString();
                                    string = string.replaceFirst(
strToFind, message );
                                    cosString.reset();
                                    cosString.append(
string.getBytes("ISO-8859-1") );
                                }
                            }
                        }





On Tue, Apr 16, 2013 at 3:11 PM, Vladimir Starostenkov <
vladimir.starostenkov@gmail.com> wrote:

> Have you tried to look through
>
> http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ReplaceString.java
>



-- 
Regards
Rahul Bhalla

Re: Extract text using pdfbox

Posted by Vladimir Starostenkov <vl...@gmail.com>.

Have you tried to look through
http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ReplaceString.java
?


2013/4/16 rahul bhalla <ur...@gmail.com>

> hi
> Actually i search various site and read different forum but not able to
> find a way to read a single line from specific page number and also want to
> extract its property of that line.
> Is there is any way to read pdf by using readLine() method of bufferReader
> or some other way
> Please suggest me
> --
> Regards
> Rahul Bhalla
>

Re: Extract text using pdfbox

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Hi Rahul,

PDF is a binary format and readable text which is visible in a single line could be organized in various pieces within a PDF. I think the easiest option for you might be to use the ExtractText command line tool as a start and review the result http://pdfbox.apache.org/commandlineutilities/ExtractText.html. Use the sort option to arrange the text sorted by it's position.

BR
Maruan Sahyoun

Am 16.04.2013 um 11:35 schrieb rahul bhalla <ur...@gmail.com>:

> hi
> Actually i search various site and read different forum but not able to
> find a way to read a single line from specific page number and also want to
> extract its property of that line.
> Is there is any way to read pdf by using readLine() method of bufferReader
> or some other way
> Please suggest me
> -- 
> Regards
> Rahul Bhalla