You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Mark Kerzner <ma...@gmail.com> on 2009/07/23 20:44:54 UTC

PDFBox to pdftohtml comparison?

Hi,
I have compared the PDFBox-to-text to the pdftohtml (in Linux) - then to
text conversion, and I found the second one a little clearer. For example,
the bottom lines in a PDF (Copyrights, etc) were combined into one line by
the PDFBox conversion, and had three separate pieces in the other way.

I am using the last stable PDFBox jar, which dates back to 2006, and the
pdftohtml utility is from about the same time, so I can understand this.

My question then is twofold: does the comparison make sense, and should I
use the pdftohtml combined with text converter, or should I try to build the
latest from SVN?

Thank you,
Mark

Re: PDFBox to pdftohtml comparison?

Posted by Mark Kerzner <ma...@gmail.com>.
Thanks, will do

On Thu, Jul 23, 2009 at 11:51 PM, Daniel Wilson <
williamstonconsulting@gmail.com> wrote:

> Iain is absolutely right.  The latest code in SVN has LOTS of improvements
> -- and is pretty stable in its own right.
> Daniel
>
> On Thu, Jul 23, 2009 at 5:01 PM, Iain Clapham
> <ia...@googlemail.com>wrote:
>
> > Mark,
> >
> > Have you upgraded to the latest FontBox ? now at 0.8
> >
> > I think it is a good idea to pull the latest SVN ( then you can hack away
> > at the nice Java code :~))
> >
> > Cheers +++ Iain
> >
> >
> > Mark Kerzner wrote:
> >
> >> Hi,
> >> I have compared the PDFBox-to-text to the pdftohtml (in Linux) - then to
> >> text conversion, and I found the second one a little clearer. For
> example,
> >> the bottom lines in a PDF (Copyrights, etc) were combined into one line
> by
> >> the PDFBox conversion, and had three separate pieces in the other way.
> >>
> >> I am using the last stable PDFBox jar, which dates back to 2006, and the
> >> pdftohtml utility is from about the same time, so I can understand this.
> >>
> >> My question then is twofold: does the comparison make sense, and should
> I
> >> use the pdftohtml combined with text converter, or should I try to build
> >> the
> >> latest from SVN?
> >>
> >> Thank you,
> >> Mark
> >>
> >>
> >>
> >
> >
>

Re: PDFBox to pdftohtml comparison?

Posted by Daniel Wilson <wi...@gmail.com>.
Iain is absolutely right.  The latest code in SVN has LOTS of improvements
-- and is pretty stable in its own right.
Daniel

On Thu, Jul 23, 2009 at 5:01 PM, Iain Clapham
<ia...@googlemail.com>wrote:

> Mark,
>
> Have you upgraded to the latest FontBox ? now at 0.8
>
> I think it is a good idea to pull the latest SVN ( then you can hack away
> at the nice Java code :~))
>
> Cheers +++ Iain
>
>
> Mark Kerzner wrote:
>
>> Hi,
>> I have compared the PDFBox-to-text to the pdftohtml (in Linux) - then to
>> text conversion, and I found the second one a little clearer. For example,
>> the bottom lines in a PDF (Copyrights, etc) were combined into one line by
>> the PDFBox conversion, and had three separate pieces in the other way.
>>
>> I am using the last stable PDFBox jar, which dates back to 2006, and the
>> pdftohtml utility is from about the same time, so I can understand this.
>>
>> My question then is twofold: does the comparison make sense, and should I
>> use the pdftohtml combined with text converter, or should I try to build
>> the
>> latest from SVN?
>>
>> Thank you,
>> Mark
>>
>>
>>
>
>

Re: PDFBox to pdftohtml comparison?

Posted by Iain Clapham <ia...@googlemail.com>.
Mark,

Have you upgraded to the latest FontBox ? now at 0.8

I think it is a good idea to pull the latest SVN ( then you can hack 
away at the nice Java code :~))

Cheers +++ Iain

Mark Kerzner wrote:
> Hi,
> I have compared the PDFBox-to-text to the pdftohtml (in Linux) - then to
> text conversion, and I found the second one a little clearer. For example,
> the bottom lines in a PDF (Copyrights, etc) were combined into one line by
> the PDFBox conversion, and had three separate pieces in the other way.
>
> I am using the last stable PDFBox jar, which dates back to 2006, and the
> pdftohtml utility is from about the same time, so I can understand this.
>
> My question then is twofold: does the comparison make sense, and should I
> use the pdftohtml combined with text converter, or should I try to build the
> latest from SVN?
>
> Thank you,
> Mark
>
>