You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by "Jerkins, Devan" <DJ...@aegonusa.com> on 2010/03/09 14:59:28 UTC

RE: PDFTextStripper parsing problem IBM Linux -Dibm.stream.nio

Below is a more complete answer from IBM about the -Dibm.stream.nio

-Dibm.stream.nio=[true | false]
This option addresses the ordering of IO and NIO converters. When this
option is set to true, the NIO converters are used instead of the IO
converters.

NIO stands for New IO.  The  NIO package was introduced from 1.4.0
onwards in order to overcome some of the short comings of the IO.

By default, IBM java uses IO converter because IO converters performs
better performance wise. Customer may use NIO converter by setting the
option -Dibm.stream.nio=true.
USAGE:
java -Dibm.stream.nio=true <app-name>

The reason that SUN and IBM JDK's differs in their behavior lies in the
point to which convertor each of the JVM defaults to. IBM defaults to
use IO converters which throws exceptions on errors whereas SUN defaults
to NIO converters which donot throw exceptions. SUN made this change
from 1.4.1 onwards and we choose not to adopt it as performance wise, IO
convertors are better.

The issue you are experiencing is actually IBM VM limitation and a
result of compromise between functionality & performance.

you can refer sdkandruntimeguide.win32.en.htm from IBM java5 SDK for
details on this jvm option.

Also note that many of the customers reported the similar issue in the
past and we had suggested the same work-around.


-----Original Message-----
From: Jerkins, Devan
Sent: Thursday, February 25, 2010 7:56 AM
To: users@pdfbox.apache.org
Subject: RE: PDFTextStripper parsing problem IBM Linux

It sounds like the known issue, but I do see a difference. The PDF that I'm using can be read correctly on Linux when it isn't running in WAS and it can be read correctly when it is running on WAS in a Windows environment. The problem seems to be with IBM JVM environment on Linux. I'm planning on asking IBM about it, if I get an answer or find a work around I'll post back. If anyone has an ideals on how to solve it, let me know.

Many thanks,

Devan J

-----Original Message-----
From: Andreas Lehmkuehler [mailto:andreas@lehmi.de]
Sent: Thursday, February 25, 2010 12:40 AM
To: users@pdfbox.apache.org
Subject: Re: PDFTextStripper parsing problem IBM Linux

Hi,

Jerkins, Devan schrieb:
> I'm trouble getting the PDFTextStripper to correctly translating non word characters. It reads "1" and passes back "one", " "
> and passes back "space". Has anyone seen this before and knows how to fix it. This only happens when I run my code in IBM
 > WAS on Linux, if I run it on IBM WAS on Windows it works fine (i.e. "1"
returns "1"). The only way I was able to get it
 > to work on linux was to try a PDF that had embedded fonts.
Sounds like an already known issue [1]

BR
Andreas Lehmkühler

[1] https://issues.apache.org/jira/browse/PDFBOX-595


This e-mail and any attachments contain information belonging to the sender which may be confidential, proprietary, legally privileged, or otherwise protected from disclosure. This information is intended for the use of the addressee(s) only. If you are not the intended recipient (or authorized agent), you are hereby notified that you have received this e-mail transmission in error and that any review, retention, disclosure, copying, dissemination, printing, saving, or any other use of, or the taking of any action in reliance on the contents of this e-mail is strictly prohibited. E-mails exchanged with the sender may be retained and produced to others in compliance with applicable law. Nothing in this e-mail constitutes an electronic signature unless expressly stated otherwise. If you have received this e-mail in error, please notify us immediately by reply e-mail to the sender and delete this copy without reading it or saving it to your system.