You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Shyam Sundar <sw...@gmail.com> on 2016/07/29 10:44:17 UTC

Spacing between lines not retained

Hi,

While converting a particular pdf to txt, spacing between lines and
paragraphs is not retained, output is just a flat text.

Sample file : ftp://PfXxyEhxh:h7hHhpOh7O@ftp.emc.com/Sample.zip

Looks like a file specific issue. Can you pls check ?

Thanks.

Re: Spacing between lines not retained

Posted by Gregor Kovač <ko...@gmail.com>.
Hi!

Yes, if it works like that, it seems like a bug to me too. But for that one
of the developers would have to look at it.

Best regards,
    Kovi

2016-07-29 13:19 GMT+02:00 Shyam Sundar <sw...@gmail.com>:

> Thanks Kovi for quick response.
>
> Well why does it fail only for a particular file, a replica of same file
> generated using another pdf library works perfectly fine with
> PDFTextStripper ... isn't it strange and look like a bug ?
>
> I hope you checked shared Sample.zip, it has both working & non-working
> files.
>
> Regards.
>
> On Fri, Jul 29, 2016 at 4:30 PM, Gregor Kovač <ko...@gmail.com> wrote:
>
> > Hi!
> >
> > API docs for PDFTextStripper (
> >
> >
> http://pdfbox.apache.org/docs/2.0.2/javadocs/org/apache/pdfbox/text/PDFTextStripper.html
> > )
> > states that "This class will take a pdf document and strip out all of the
> > text and ignore the formatting and such". Please note that you can
> > call setAddMoreFormatting (
> >
> >
> http://pdfbox.apache.org/docs/2.0.2/javadocs/org/apache/pdfbox/text/PDFTextStripper.html#setAddMoreFormatting(boolean)
> > )
> > with true and it will add a bit more formatting, but in my experience
> this
> > does not compare to using "pdftotext -layout" from Xpdf project.
> pdftotext
> > does a much better job preserving layout.
> >
> > Best regards,
> >     Kovi
> >
> > 2016-07-29 12:44 GMT+02:00 Shyam Sundar <sw...@gmail.com>:
> >
> > > Hi,
> > >
> > > While converting a particular pdf to txt, spacing between lines and
> > > paragraphs is not retained, output is just a flat text.
> > >
> > > Sample file : ftp://PfXxyEhxh:h7hHhpOh7O@ftp.emc.com/Sample.zip
> > >
> > > Looks like a file specific issue. Can you pls check ?
> > >
> > > Thanks.
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> > > For additional commands, e-mail: users-help@pdfbox.apache.org
> > >
> >
> >
> >
> > --
> > -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
> > |  In A World Without Fences Who Needs Gates?  |
> > |              Experience Linux.               |
> > -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
> >
>



-- 
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
|  In A World Without Fences Who Needs Gates?  |
|              Experience Linux.               |
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~

Re: Spacing between lines not retained

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 29.07.2016 um 13:19 schrieb Shyam Sundar:
> Thanks Kovi for quick response.
>
> Well why does it fail only for a particular file, a replica of same file
> generated using another pdf library works perfectly fine with
> PDFTextStripper ... isn't it strange and look like a bug ?
>
> I hope you checked shared Sample.zip, it has both working & non-working
> files.

The "working" file has lines with one space, that is why.

That is what I'd expected. If you want a perfectly formatted text, why 
not use the PDF? Text extraction is usually for searching.

You can also use PrintTextLocations.java example, this will show the 
coordinates of every character. The DrawPrintTextLocations examples will 
show you that and also the visual location of the glyphs in an image 
rendering.

What you could also try is setParagraphStart("\n") and/or 
setParagraphEnd("\n").

Tilman

>
> Regards.
>
> On Fri, Jul 29, 2016 at 4:30 PM, Gregor Kova\u010d <ko...@gmail.com> wrote:
>
>> Hi!
>>
>> API docs for PDFTextStripper (
>>
>> http://pdfbox.apache.org/docs/2.0.2/javadocs/org/apache/pdfbox/text/PDFTextStripper.html
>> )
>> states that "This class will take a pdf document and strip out all of the
>> text and ignore the formatting and such". Please note that you can
>> call setAddMoreFormatting (
>>
>> http://pdfbox.apache.org/docs/2.0.2/javadocs/org/apache/pdfbox/text/PDFTextStripper.html#setAddMoreFormatting(boolean)
>> )
>> with true and it will add a bit more formatting, but in my experience this
>> does not compare to using "pdftotext -layout" from Xpdf project. pdftotext
>> does a much better job preserving layout.
>>
>> Best regards,
>>      Kovi
>>
>> 2016-07-29 12:44 GMT+02:00 Shyam Sundar <sw...@gmail.com>:
>>
>>> Hi,
>>>
>>> While converting a particular pdf to txt, spacing between lines and
>>> paragraphs is not retained, output is just a flat text.
>>>
>>> Sample file : ftp://PfXxyEhxh:h7hHhpOh7O@ftp.emc.com/Sample.zip
>>>
>>> Looks like a file specific issue. Can you pls check ?
>>>
>>> Thanks.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>
>>
>> --
>> -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
>> |  In A World Without Fences Who Needs Gates?  |
>> |              Experience Linux.               |
>> -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Spacing between lines not retained

Posted by Shyam Sundar <sw...@gmail.com>.
Thanks Kovi for quick response.

Well why does it fail only for a particular file, a replica of same file
generated using another pdf library works perfectly fine with
PDFTextStripper ... isn't it strange and look like a bug ?

I hope you checked shared Sample.zip, it has both working & non-working
files.

Regards.

On Fri, Jul 29, 2016 at 4:30 PM, Gregor Kovač <ko...@gmail.com> wrote:

> Hi!
>
> API docs for PDFTextStripper (
>
> http://pdfbox.apache.org/docs/2.0.2/javadocs/org/apache/pdfbox/text/PDFTextStripper.html
> )
> states that "This class will take a pdf document and strip out all of the
> text and ignore the formatting and such". Please note that you can
> call setAddMoreFormatting (
>
> http://pdfbox.apache.org/docs/2.0.2/javadocs/org/apache/pdfbox/text/PDFTextStripper.html#setAddMoreFormatting(boolean)
> )
> with true and it will add a bit more formatting, but in my experience this
> does not compare to using "pdftotext -layout" from Xpdf project. pdftotext
> does a much better job preserving layout.
>
> Best regards,
>     Kovi
>
> 2016-07-29 12:44 GMT+02:00 Shyam Sundar <sw...@gmail.com>:
>
> > Hi,
> >
> > While converting a particular pdf to txt, spacing between lines and
> > paragraphs is not retained, output is just a flat text.
> >
> > Sample file : ftp://PfXxyEhxh:h7hHhpOh7O@ftp.emc.com/Sample.zip
> >
> > Looks like a file specific issue. Can you pls check ?
> >
> > Thanks.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: users-help@pdfbox.apache.org
> >
>
>
>
> --
> -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
> |  In A World Without Fences Who Needs Gates?  |
> |              Experience Linux.               |
> -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
>

Re: Spacing between lines not retained

Posted by Gregor Kovač <ko...@gmail.com>.
Hi!

API docs for PDFTextStripper (
http://pdfbox.apache.org/docs/2.0.2/javadocs/org/apache/pdfbox/text/PDFTextStripper.html)
states that "This class will take a pdf document and strip out all of the
text and ignore the formatting and such". Please note that you can
call setAddMoreFormatting (
http://pdfbox.apache.org/docs/2.0.2/javadocs/org/apache/pdfbox/text/PDFTextStripper.html#setAddMoreFormatting(boolean))
with true and it will add a bit more formatting, but in my experience this
does not compare to using "pdftotext -layout" from Xpdf project. pdftotext
does a much better job preserving layout.

Best regards,
    Kovi

2016-07-29 12:44 GMT+02:00 Shyam Sundar <sw...@gmail.com>:

> Hi,
>
> While converting a particular pdf to txt, spacing between lines and
> paragraphs is not retained, output is just a flat text.
>
> Sample file : ftp://PfXxyEhxh:h7hHhpOh7O@ftp.emc.com/Sample.zip
>
> Looks like a file specific issue. Can you pls check ?
>
> Thanks.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>



-- 
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
|  In A World Without Fences Who Needs Gates?  |
|              Experience Linux.               |
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~