You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Jack Bush <ne...@yahoo.com.au> on 2011/05/26 16:12:20 UTC

How to keep PDF format when extracting text

Hi All,

I have no problem extracting text from pdf document using pdfbox-app-1.5.0.jar 
but found that the format has been lost. Also downloaded fontbox-1.5.0.jar and 
jempbox-1.5.0.jar but not sure how to use them to improve the format of the 
extracted text file to be as close to the orginial pdf file as possible.

Are there any good document around on this topic on using recent jars. I found 
some material from Google but they are either using a much earlier version 
(0.8) of pdfbox or the explanantion is insufficient to follow. It is not in 
PDDFBox FAQ.

Do you have an archived mailing list I could lookup?

Many thanks,

Jack

Re: How to keep PDF format when extracting text

Posted by Jack Bush <ne...@yahoo.com.au>.

OK Elbin,

I will take up your suggestion.

Thank you once again,

Jack



----- Original Message ----
From: Elbin Elias <el...@gmail.com>
To: users@pdfbox.apache.org
Sent: Sun, 29 May, 2011 12:17:34 AM
Subject: Re: How to keep PDF format when extracting text

Hi Jack

Glad to hear that this works for you. For the issue with delimiter, i would
suggest you to use space as the delimitter instead of pipe.

Thanks
Elbin

On Sat, May 28, 2011 at 4:10 PM, Jack Bush <ne...@yahoo.com.au> wrote:

> Hi Elbin,
>
> Excellent. Below is the code that has successfully converted only the
> required
> rows of PDF data to text:
>
>            parser.parse();
>            cosDoc = parser.getDocument();
>            pdDoc = new PDDocument(cosDoc);
>            pdfStripper = new PDFTextStripperByArea();
>            pdfStripper.setSortByPosition( true );
>            Rectangle rect = new Rectangle( 10, 250, 750, 550 );
>            pdfStripper.addRegion( "class1", rect );
>            List allPages = pdDoc.getDocumentCatalog().getAllPages();
>            PDPage firstPage = (PDPage)allPages.get( 0 );
>            pdfStripper.extractRegions( firstPage );
>            System.out.println( "Text in the area:" + rect );
>            System.out.println( pdfStripper.getTextForRegion( "class1" ) );
>
> However, I need to go a step further by splitting up each row of data with
> pipe
> ('|') delimited, to capture the values (make up of words and spaces)
> which represent the content of each column. Below is an example:
>
> Current data
> -----------------
> Suburb  Address            Type  Price      Result  Agent
> Fairyland 10 Rochester St 3 br h  $500,000    VB    My Real Estate Agent
>
> Desire outcome
> ----------------------
> Suburb  Address            Type  Price      Result  Agent
> Fairyland|10 Rochester St|3 br h|$500,000|VB|My Real Estate Agent
> Is this possible? Or do I need to define additional layers of rectangles
> for
> each columns. If so, any suggestion on how this could be achieved?
>
> We are nearly there.
>
> Many thanks again,
>
> Jack
>
>
> ----- Original Message ----
> From: Elbin Elias <el...@gmail.com>
> To: users@pdfbox.apache.org
>  Sent: Fri, 27 May, 2011 11:54:48 PM
> Subject: Re: How to keep PDF format when extracting text
>
> pdfbox\examples\util
>
> On Fri, May 27, 2011 at 3:51 PM, Jack Bush <ne...@yahoo.com.au>
> wrote:
>
> > Hi Elbin,
> >
> > Is it too much to ask if you could point me to where the sample code are
> on
> > this
> > area?
> >
> > Thanks a lot,
> >
> > Jack
> >
> >
> > ----- Original Message ----
> > From: Elbin Elias <el...@gmail.com>
> > To: users@pdfbox.apache.org
> >  Sent: Fri, 27 May, 2011 11:18:28 PM
> > Subject: Re: How to keep PDF format when extracting text
> >
> > Hi Jack
> >
> > Try extractByArea instead of getText. There is also sample code
> explaining
> > the same
> >
> > Regards
> > Elbin
> >
> > On Fri, May 27, 2011 at 3:02 PM, Jack Bush <ne...@yahoo.com.au>
> > wrote:
> >
> > > Hi Eric,
> > >
> > > Thanks for responding back to my call for assistance.
> > >
> > > I am extracting text from a PDF file only. The rows of data has been
> > moved
> > > around and the heading is down the bottom of the rows of data, possibly
> > > from a
> > > table. The order of the page has also gone out of sync.
> > >
> > > Here is an example of the file that I am try to extract from
> > > http://www.homepriceguide.com.au/saturday_auction_results/Adelaide.pdf
> > >
> > > I am only interested in the stats in the middle of the page.
> > >
> > > Thanks again,
> > >
> > > Jack
> > > ----- Original Message ----
> > > From: Eric Douglas <ed...@blockhouse.com>
> > > To: users@pdfbox.apache.org
> > > Sent: Fri, 27 May, 2011 12:28:52 AM
> > > Subject: RE: How to keep PDF format when extracting text
> > >
> > > This sounds a bit vague.  PDF format sounds like you're creating a PDF,
> > but
> > > your
> > > description sounds more like you're getting text from a PDF trying to
> > make
> > > it
> > > look like it does in the PDF.  Are you trying to modify a PDF or are
> you
> > > just
> > > losing font information on etracted text?
> > > Is the font information embedded?
> > > Do you have any samples of your text extraction code or a PDF you're
> > > extracting?
> > >
> > >
> > > -----Original Message-----
> > > From: Jack Bush [mailto:netbeansfan@yahoo.com.au]
> > > Sent: Thursday, May 26, 2011 10:12 AM
> > > To: users@pdfbox.apache.org
> > > Subject: How to keep PDF format when extracting text
> > >
> > > Hi All,
> > >
> > > I have no problem extracting text from pdf document using
> > > pdfbox-app-1.5.0.jar
> > > but found that the format has been lost. Also downloaded
> > fontbox-1.5.0.jar
> > > and
> > > jempbox-1.5.0.jar but not sure how to use them to improve the format of
> > the
> > > extracted text file to be as close to the orginial pdf file as
> possible.
> > >
> > > Are there any good document around on this topic on using recent jars.
> I
> > > found
> > > some material from Google but they are either using a much earlier
> > version
> > > (0.8) of pdfbox or the explanantion is insufficient to follow. It is
> not
> > in
> > > PDDFBox FAQ.
> > >
> > > Do you have an archived mailing list I could lookup?
> > >
> > > Many thanks,
> > >
> > > Jack
> > >
> > >
> > >
> >
> >
> > --
> > Thanks & Regards
> > Elbin K Elias
> >
> >
>
>
> --
> Thanks & Regards
> Elbin K Elias
>
>


-- 
Thanks & Regards
Elbin K Elias

Re: How to keep PDF format when extracting text

Posted by Elbin Elias <el...@gmail.com>.

Hi Jack

Glad to hear that this works for you. For the issue with delimiter, i would
suggest you to use space as the delimitter instead of pipe.

Thanks
Elbin

On Sat, May 28, 2011 at 4:10 PM, Jack Bush <ne...@yahoo.com.au> wrote:

> Hi Elbin,
>
> Excellent. Below is the code that has successfully converted only the
> required
> rows of PDF data to text:
>
>             parser.parse();
>             cosDoc = parser.getDocument();
>             pdDoc = new PDDocument(cosDoc);
>             pdfStripper = new PDFTextStripperByArea();
>             pdfStripper.setSortByPosition( true );
>             Rectangle rect = new Rectangle( 10, 250, 750, 550 );
>             pdfStripper.addRegion( "class1", rect );
>             List allPages = pdDoc.getDocumentCatalog().getAllPages();
>             PDPage firstPage = (PDPage)allPages.get( 0 );
>             pdfStripper.extractRegions( firstPage );
>             System.out.println( "Text in the area:" + rect );
>             System.out.println( pdfStripper.getTextForRegion( "class1" ) );
>
> However, I need to go a step further by splitting up each row of data with
> pipe
> ('|') delimited, to capture the values (make up of words and spaces)
> which represent the content of each column. Below is an example:
>
> Current data
> -----------------
> Suburb   Address            Type   Price       Result  Agent
> Fairyland 10 Rochester St 3 br h  $500,000    VB     My Real Estate Agent
>
> Desire outcome
> ----------------------
> Suburb   Address            Type   Price       Result  Agent
> Fairyland|10 Rochester St|3 br h|$500,000|VB|My Real Estate Agent
> Is this possible? Or do I need to define additional layers of rectangles
> for
> each columns. If so, any suggestion on how this could be achieved?
>
> We are nearly there.
>
> Many thanks again,
>
> Jack
>
>
> ----- Original Message ----
> From: Elbin Elias <el...@gmail.com>
> To: users@pdfbox.apache.org
>  Sent: Fri, 27 May, 2011 11:54:48 PM
> Subject: Re: How to keep PDF format when extracting text
>
> pdfbox\examples\util
>
> On Fri, May 27, 2011 at 3:51 PM, Jack Bush <ne...@yahoo.com.au>
> wrote:
>
> > Hi Elbin,
> >
> > Is it too much to ask if you could point me to where the sample code are
> on
> > this
> > area?
> >
> > Thanks a lot,
> >
> > Jack
> >
> >
> > ----- Original Message ----
> > From: Elbin Elias <el...@gmail.com>
> > To: users@pdfbox.apache.org
> >  Sent: Fri, 27 May, 2011 11:18:28 PM
> > Subject: Re: How to keep PDF format when extracting text
> >
> > Hi Jack
> >
> > Try extractByArea instead of getText. There is also sample code
> explaining
> > the same
> >
> > Regards
> > Elbin
> >
> > On Fri, May 27, 2011 at 3:02 PM, Jack Bush <ne...@yahoo.com.au>
> > wrote:
> >
> > > Hi Eric,
> > >
> > > Thanks for responding back to my call for assistance.
> > >
> > > I am extracting text from a PDF file only. The rows of data has been
> > moved
> > > around and the heading is down the bottom of the rows of data, possibly
> > > from a
> > > table. The order of the page has also gone out of sync.
> > >
> > > Here is an example of the file that I am try to extract from
> > > http://www.homepriceguide.com.au/saturday_auction_results/Adelaide.pdf
> > >
> > > I am only interested in the stats in the middle of the page.
> > >
> > > Thanks again,
> > >
> > > Jack
> > > ----- Original Message ----
> > > From: Eric Douglas <ed...@blockhouse.com>
> > > To: users@pdfbox.apache.org
> > > Sent: Fri, 27 May, 2011 12:28:52 AM
> > > Subject: RE: How to keep PDF format when extracting text
> > >
> > > This sounds a bit vague.  PDF format sounds like you're creating a PDF,
> > but
> > > your
> > > description sounds more like you're getting text from a PDF trying to
> > make
> > > it
> > > look like it does in the PDF.  Are you trying to modify a PDF or are
> you
> > > just
> > > losing font information on etracted text?
> > > Is the font information embedded?
> > > Do you have any samples of your text extraction code or a PDF you're
> > > extracting?
> > >
> > >
> > > -----Original Message-----
> > > From: Jack Bush [mailto:netbeansfan@yahoo.com.au]
> > > Sent: Thursday, May 26, 2011 10:12 AM
> > > To: users@pdfbox.apache.org
> > > Subject: How to keep PDF format when extracting text
> > >
> > > Hi All,
> > >
> > > I have no problem extracting text from pdf document using
> > > pdfbox-app-1.5.0.jar
> > > but found that the format has been lost. Also downloaded
> > fontbox-1.5.0.jar
> > > and
> > > jempbox-1.5.0.jar but not sure how to use them to improve the format of
> > the
> > > extracted text file to be as close to the orginial pdf file as
> possible.
> > >
> > > Are there any good document around on this topic on using recent jars.
> I
> > > found
> > > some material from Google but they are either using a much earlier
> > version
> > > (0.8) of pdfbox or the explanantion is insufficient to follow. It is
> not
> > in
> > > PDDFBox FAQ.
> > >
> > > Do you have an archived mailing list I could lookup?
> > >
> > > Many thanks,
> > >
> > > Jack
> > >
> > >
> > >
> >
> >
> > --
> > Thanks & Regards
> > Elbin K Elias
> >
> >
>
>
> --
> Thanks & Regards
> Elbin K Elias
>
>


-- 
Thanks & Regards
Elbin K Elias

Re: How to keep PDF format when extracting text

Posted by Jack Bush <ne...@yahoo.com.au>.

Hi Elbin,

Excellent. Below is the code that has successfully converted only the required 
rows of PDF data to text:

            parser.parse();
            cosDoc = parser.getDocument();
            pdDoc = new PDDocument(cosDoc);
            pdfStripper = new PDFTextStripperByArea();
            pdfStripper.setSortByPosition( true );
            Rectangle rect = new Rectangle( 10, 250, 750, 550 );
            pdfStripper.addRegion( "class1", rect );
            List allPages = pdDoc.getDocumentCatalog().getAllPages();
            PDPage firstPage = (PDPage)allPages.get( 0 );
            pdfStripper.extractRegions( firstPage );
            System.out.println( "Text in the area:" + rect );
            System.out.println( pdfStripper.getTextForRegion( "class1" ) );

However, I need to go a step further by splitting up each row of data with pipe 
('|') delimited, to capture the values (make up of words and spaces) 
which represent the content of each column. Below is an example:

Current data
-----------------
Suburb   Address            Type   Price       Result  Agent
Fairyland 10 Rochester St 3 br h  $500,000    VB     My Real Estate Agent

Desire outcome
----------------------
Suburb   Address            Type   Price       Result  Agent
Fairyland|10 Rochester St|3 br h|$500,000|VB|My Real Estate Agent  
Is this possible? Or do I need to define additional layers of rectangles for 
each columns. If so, any suggestion on how this could be achieved?

We are nearly there.

Many thanks again,

Jack


----- Original Message ----
From: Elbin Elias <el...@gmail.com>
To: users@pdfbox.apache.org
Sent: Fri, 27 May, 2011 11:54:48 PM
Subject: Re: How to keep PDF format when extracting text

pdfbox\examples\util

On Fri, May 27, 2011 at 3:51 PM, Jack Bush <ne...@yahoo.com.au> wrote:

> Hi Elbin,
>
> Is it too much to ask if you could point me to where the sample code are on
> this
> area?
>
> Thanks a lot,
>
> Jack
>
>
> ----- Original Message ----
> From: Elbin Elias <el...@gmail.com>
> To: users@pdfbox.apache.org
>  Sent: Fri, 27 May, 2011 11:18:28 PM
> Subject: Re: How to keep PDF format when extracting text
>
> Hi Jack
>
> Try extractByArea instead of getText. There is also sample code explaining
> the same
>
> Regards
> Elbin
>
> On Fri, May 27, 2011 at 3:02 PM, Jack Bush <ne...@yahoo.com.au>
> wrote:
>
> > Hi Eric,
> >
> > Thanks for responding back to my call for assistance.
> >
> > I am extracting text from a PDF file only. The rows of data has been
> moved
> > around and the heading is down the bottom of the rows of data, possibly
> > from a
> > table. The order of the page has also gone out of sync.
> >
> > Here is an example of the file that I am try to extract from
> > http://www.homepriceguide.com.au/saturday_auction_results/Adelaide.pdf
> >
> > I am only interested in the stats in the middle of the page.
> >
> > Thanks again,
> >
> > Jack
> > ----- Original Message ----
> > From: Eric Douglas <ed...@blockhouse.com>
> > To: users@pdfbox.apache.org
> > Sent: Fri, 27 May, 2011 12:28:52 AM
> > Subject: RE: How to keep PDF format when extracting text
> >
> > This sounds a bit vague.  PDF format sounds like you're creating a PDF,
> but
> > your
> > description sounds more like you're getting text from a PDF trying to
> make
> > it
> > look like it does in the PDF.  Are you trying to modify a PDF or are you
> > just
> > losing font information on etracted text?
> > Is the font information embedded?
> > Do you have any samples of your text extraction code or a PDF you're
> > extracting?
> >
> >
> > -----Original Message-----
> > From: Jack Bush [mailto:netbeansfan@yahoo.com.au]
> > Sent: Thursday, May 26, 2011 10:12 AM
> > To: users@pdfbox.apache.org
> > Subject: How to keep PDF format when extracting text
> >
> > Hi All,
> >
> > I have no problem extracting text from pdf document using
> > pdfbox-app-1.5.0.jar
> > but found that the format has been lost. Also downloaded
> fontbox-1.5.0.jar
> > and
> > jempbox-1.5.0.jar but not sure how to use them to improve the format of
> the
> > extracted text file to be as close to the orginial pdf file as possible.
> >
> > Are there any good document around on this topic on using recent jars. I
> > found
> > some material from Google but they are either using a much earlier
> version
> > (0.8) of pdfbox or the explanantion is insufficient to follow. It is not
> in
> > PDDFBox FAQ.
> >
> > Do you have an archived mailing list I could lookup?
> >
> > Many thanks,
> >
> > Jack
> >
> >
> >
>
>
> --
> Thanks & Regards
> Elbin K Elias
>
>


-- 
Thanks & Regards
Elbin K Elias

Re: How to keep PDF format when extracting text

Posted by Elbin Elias <el...@gmail.com>.

pdfbox\examples\util

On Fri, May 27, 2011 at 3:51 PM, Jack Bush <ne...@yahoo.com.au> wrote:

> Hi Elbin,
>
> Is it too much to ask if you could point me to where the sample code are on
> this
> area?
>
> Thanks a lot,
>
> Jack
>
>
> ----- Original Message ----
> From: Elbin Elias <el...@gmail.com>
> To: users@pdfbox.apache.org
>  Sent: Fri, 27 May, 2011 11:18:28 PM
> Subject: Re: How to keep PDF format when extracting text
>
> Hi Jack
>
> Try extractByArea instead of getText. There is also sample code explaining
> the same
>
> Regards
> Elbin
>
> On Fri, May 27, 2011 at 3:02 PM, Jack Bush <ne...@yahoo.com.au>
> wrote:
>
> > Hi Eric,
> >
> > Thanks for responding back to my call for assistance.
> >
> > I am extracting text from a PDF file only. The rows of data has been
> moved
> > around and the heading is down the bottom of the rows of data, possibly
> > from a
> > table. The order of the page has also gone out of sync.
> >
> > Here is an example of the file that I am try to extract from
> > http://www.homepriceguide.com.au/saturday_auction_results/Adelaide.pdf
> >
> > I am only interested in the stats in the middle of the page.
> >
> > Thanks again,
> >
> > Jack
> > ----- Original Message ----
> > From: Eric Douglas <ed...@blockhouse.com>
> > To: users@pdfbox.apache.org
> > Sent: Fri, 27 May, 2011 12:28:52 AM
> > Subject: RE: How to keep PDF format when extracting text
> >
> > This sounds a bit vague.  PDF format sounds like you're creating a PDF,
> but
> > your
> > description sounds more like you're getting text from a PDF trying to
> make
> > it
> > look like it does in the PDF.  Are you trying to modify a PDF or are you
> > just
> > losing font information on etracted text?
> > Is the font information embedded?
> > Do you have any samples of your text extraction code or a PDF you're
> > extracting?
> >
> >
> > -----Original Message-----
> > From: Jack Bush [mailto:netbeansfan@yahoo.com.au]
> > Sent: Thursday, May 26, 2011 10:12 AM
> > To: users@pdfbox.apache.org
> > Subject: How to keep PDF format when extracting text
> >
> > Hi All,
> >
> > I have no problem extracting text from pdf document using
> > pdfbox-app-1.5.0.jar
> > but found that the format has been lost. Also downloaded
> fontbox-1.5.0.jar
> > and
> > jempbox-1.5.0.jar but not sure how to use them to improve the format of
> the
> > extracted text file to be as close to the orginial pdf file as possible.
> >
> > Are there any good document around on this topic on using recent jars. I
> > found
> > some material from Google but they are either using a much earlier
> version
> > (0.8) of pdfbox or the explanantion is insufficient to follow. It is not
> in
> > PDDFBox FAQ.
> >
> > Do you have an archived mailing list I could lookup?
> >
> > Many thanks,
> >
> > Jack
> >
> >
> >
>
>
> --
> Thanks & Regards
> Elbin K Elias
>
>


-- 
Thanks & Regards
Elbin K Elias

Re: How to keep PDF format when extracting text

Posted by Jack Bush <ne...@yahoo.com.au>.

Hi Elbin,

Is it too much to ask if you could point me to where the sample code are on this 
area?

Thanks a lot,

Jack


----- Original Message ----
From: Elbin Elias <el...@gmail.com>
To: users@pdfbox.apache.org
Sent: Fri, 27 May, 2011 11:18:28 PM
Subject: Re: How to keep PDF format when extracting text

Hi Jack

Try extractByArea instead of getText. There is also sample code explaining
the same

Regards
Elbin

On Fri, May 27, 2011 at 3:02 PM, Jack Bush <ne...@yahoo.com.au> wrote:

> Hi Eric,
>
> Thanks for responding back to my call for assistance.
>
> I am extracting text from a PDF file only. The rows of data has been moved
> around and the heading is down the bottom of the rows of data, possibly
> from a
> table. The order of the page has also gone out of sync.
>
> Here is an example of the file that I am try to extract from
> http://www.homepriceguide.com.au/saturday_auction_results/Adelaide.pdf
>
> I am only interested in the stats in the middle of the page.
>
> Thanks again,
>
> Jack
> ----- Original Message ----
> From: Eric Douglas <ed...@blockhouse.com>
> To: users@pdfbox.apache.org
> Sent: Fri, 27 May, 2011 12:28:52 AM
> Subject: RE: How to keep PDF format when extracting text
>
> This sounds a bit vague.  PDF format sounds like you're creating a PDF, but
> your
> description sounds more like you're getting text from a PDF trying to make
> it
> look like it does in the PDF.  Are you trying to modify a PDF or are you
> just
> losing font information on etracted text?
> Is the font information embedded?
> Do you have any samples of your text extraction code or a PDF you're
> extracting?
>
>
> -----Original Message-----
> From: Jack Bush [mailto:netbeansfan@yahoo.com.au]
> Sent: Thursday, May 26, 2011 10:12 AM
> To: users@pdfbox.apache.org
> Subject: How to keep PDF format when extracting text
>
> Hi All,
>
> I have no problem extracting text from pdf document using
> pdfbox-app-1.5.0.jar
> but found that the format has been lost. Also downloaded fontbox-1.5.0.jar
> and
> jempbox-1.5.0.jar but not sure how to use them to improve the format of the
> extracted text file to be as close to the orginial pdf file as possible.
>
> Are there any good document around on this topic on using recent jars. I
> found
> some material from Google but they are either using a much earlier version
> (0.8) of pdfbox or the explanantion is insufficient to follow. It is not in
> PDDFBox FAQ.
>
> Do you have an archived mailing list I could lookup?
>
> Many thanks,
>
> Jack
>
>
>


-- 
Thanks & Regards
Elbin K Elias

Re: How to keep PDF format when extracting text

Posted by Elbin Elias <el...@gmail.com>.

Hi Jack

Try extractByArea instead of getText. There is also sample code explaining
the same

Regards
Elbin

On Fri, May 27, 2011 at 3:02 PM, Jack Bush <ne...@yahoo.com.au> wrote:

> Hi Eric,
>
> Thanks for responding back to my call for assistance.
>
> I am extracting text from a PDF file only. The rows of data has been moved
> around and the heading is down the bottom of the rows of data, possibly
> from a
> table. The order of the page has also gone out of sync.
>
> Here is an example of the file that I am try to extract from
> http://www.homepriceguide.com.au/saturday_auction_results/Adelaide.pdf
>
> I am only interested in the stats in the middle of the page.
>
> Thanks again,
>
> Jack
> ----- Original Message ----
> From: Eric Douglas <ed...@blockhouse.com>
> To: users@pdfbox.apache.org
> Sent: Fri, 27 May, 2011 12:28:52 AM
> Subject: RE: How to keep PDF format when extracting text
>
> This sounds a bit vague.  PDF format sounds like you're creating a PDF, but
> your
> description sounds more like you're getting text from a PDF trying to make
> it
> look like it does in the PDF.  Are you trying to modify a PDF or are you
> just
> losing font information on etracted text?
> Is the font information embedded?
> Do you have any samples of your text extraction code or a PDF you're
> extracting?
>
>
> -----Original Message-----
> From: Jack Bush [mailto:netbeansfan@yahoo.com.au]
> Sent: Thursday, May 26, 2011 10:12 AM
> To: users@pdfbox.apache.org
> Subject: How to keep PDF format when extracting text
>
> Hi All,
>
> I have no problem extracting text from pdf document using
> pdfbox-app-1.5.0.jar
> but found that the format has been lost. Also downloaded fontbox-1.5.0.jar
> and
> jempbox-1.5.0.jar but not sure how to use them to improve the format of the
> extracted text file to be as close to the orginial pdf file as possible.
>
> Are there any good document around on this topic on using recent jars. I
> found
> some material from Google but they are either using a much earlier version
> (0.8) of pdfbox or the explanantion is insufficient to follow. It is not in
> PDDFBox FAQ.
>
> Do you have an archived mailing list I could lookup?
>
> Many thanks,
>
> Jack
>
>
>


-- 
Thanks & Regards
Elbin K Elias

Re: How to keep PDF format when extracting text

Posted by Jack Bush <ne...@yahoo.com.au>.

Hi Eric,

Thanks for responding back to my call for assistance.

I am extracting text from a PDF file only. The rows of data has been moved 
around and the heading is down the bottom of the rows of data, possibly from a 
table. The order of the page has also gone out of sync.

Here is an example of the file that I am try to extract from 
http://www.homepriceguide.com.au/saturday_auction_results/Adelaide.pdf

I am only interested in the stats in the middle of the page.

Thanks again,

Jack
----- Original Message ----
From: Eric Douglas <ed...@blockhouse.com>
To: users@pdfbox.apache.org
Sent: Fri, 27 May, 2011 12:28:52 AM
Subject: RE: How to keep PDF format when extracting text

This sounds a bit vague.  PDF format sounds like you're creating a PDF, but your 
description sounds more like you're getting text from a PDF trying to make it 
look like it does in the PDF.  Are you trying to modify a PDF or are you just 
losing font information on etracted text?
Is the font information embedded?
Do you have any samples of your text extraction code or a PDF you're extracting?


-----Original Message-----
From: Jack Bush [mailto:netbeansfan@yahoo.com.au] 
Sent: Thursday, May 26, 2011 10:12 AM
To: users@pdfbox.apache.org
Subject: How to keep PDF format when extracting text

Hi All,

I have no problem extracting text from pdf document using pdfbox-app-1.5.0.jar 
but found that the format has been lost. Also downloaded fontbox-1.5.0.jar and 
jempbox-1.5.0.jar but not sure how to use them to improve the format of the 
extracted text file to be as close to the orginial pdf file as possible.

Are there any good document around on this topic on using recent jars. I found 
some material from Google but they are either using a much earlier version
(0.8) of pdfbox or the explanantion is insufficient to follow. It is not in 
PDDFBox FAQ.

Do you have an archived mailing list I could lookup?

Many thanks,

Jack

RE: How to keep PDF format when extracting text

Posted by Eric Douglas <ed...@blockhouse.com>.

This sounds a bit vague.  PDF format sounds like you're creating a PDF, but your description sounds more like you're getting text from a PDF trying to make it look like it does in the PDF.  Are you trying to modify a PDF or are you just losing font information on etracted text?
Is the font information embedded?
Do you have any samples of your text extraction code or a PDF you're extracting?

-----Original Message-----
From: Jack Bush [mailto:netbeansfan@yahoo.com.au] 
Sent: Thursday, May 26, 2011 10:12 AM
To: users@pdfbox.apache.org
Subject: How to keep PDF format when extracting text

Hi All,

I have no problem extracting text from pdf document using pdfbox-app-1.5.0.jar but found that the format has been lost. Also downloaded fontbox-1.5.0.jar and jempbox-1.5.0.jar but not sure how to use them to improve the format of the extracted text file to be as close to the orginial pdf file as possible.

Are there any good document around on this topic on using recent jars. I found some material from Google but they are either using a much earlier version
(0.8) of pdfbox or the explanantion is insufficient to follow. It is not in PDDFBox FAQ.

Do you have an archived mailing list I could lookup?

Many thanks,

Jack