You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by "Erik Scholtz, ArgonSoft GmbH" <es...@argonsoft.de> on 2010/02/15 13:48:22 UTC

Re: logging

Jason,

https://issues.apache.org/jira/browse/PDFBOX-581 describs this problem. 
In the upcomming release 1.0.0 (will be here within hours I think ;) the 
problem will be fixed.

Greetings,
Erik

jason franklin-stokes wrote:
> this is what I am getting through stdout on the terminal...
> 
> Jan 27, 2010 1:47:12 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: i
> Jan 27, 2010 1:47:12 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: re
> Jan 27, 2010 1:47:12 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: W
> Jan 27, 2010 1:47:12 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: n
> Jan 27, 2010 1:47:12 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: cs
> Jan 27, 2010 1:47:12 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: scn
> Jan 27, 2010 1:47:12 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: f
> Jan 27, 2010 1:47:12 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: CS
> Jan 27, 2010 1:47:12 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: SCN
> Jan 27, 2010 1:47:12 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: J
> Jan 27, 2010 1:47:12 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: j
> Jan
> 
> On Jan 27, 2010, at 6:45 PM, Erik Scholtz, ArgonSoft GmbH wrote:
> 
>> I also recognized the INFO messages via standard output using 0.8.0-incubating. For me it does not matter, but when using the STDOUT for getting back the results it is a problem.
>>
>> Cheers,
>> Erik
>>
>> Daniel Wilson wrote:
>>> Which version of PDFBox?  If you're using the released version of
>>> 0.8.0-incubating, I do not believe this should be happening.
>>> Daniel
>>> On Wed, Jan 27, 2010 at 11:34 AM, jason franklin-stokes <
>>> jasoninclass@googlemail.com> wrote:
>>>> dear all,
>>>>
>>>> I am using pdfbox (for pdf text extract) from within jruby (i.e. not via
>>>> command line). I am getting INFO: messages via standard output to my
>>>> terminal which of course i want to disable.  What do I need to do to get rid
>>>> of logging to standard output.
>>>>
>>>> any help would be very much appreciated.
>>>>
>>>> thanks a million
>>>>
>>>> jason.
>

Re: PDFbox text extraction: l for i

Posted by Villu Ruusmann <vi...@gmail.com>.

Hello there,

>
> I know about ligatures, and normally PDFBox handles them well, e.g. ﬀ ﬃ ﬁ ﬂ are quite common in TeX-produced PDF documents.
> But why should PDFBox reproduce a fi (FI) ligature as fl (FL)?
>

When does this problem occur? Are you receiving "fl" instead of "fi"
when performing text extraction (eg. PDFTextStripper utility) or are
you seeing it when performing PDF rendering (eg. the PageDrawer
utility)?

Debugging could be more or less rewarding depending on what tools you
are using and how familiar you are with font encodings and charsets.
The basic idea would be to find out the value of the "problematic"
byte in the PDF text object, and then to look up its character name.

If you could share the PDF document I might take a look at it
sometimes. Could be another Type1C font issue where I am to blame.


VR

Re: PDFbox text extraction: l for i

Posted by Thomas Fischer <fi...@aon.at>.

Hi Villu,

I know about ligatures, and normally PDFBox handles them well, e.g. ﬀ ﬃ ﬁ ﬂ are quite common in TeX-produced PDF documents.
But why should PDFBox reproduce a fi (FI) ligature as fl (FL)?

Puzzled
Thomas Fischer


Am 15.02.2010 um 23:20 schrieb Villu Ruusmann:

> Hello there,
> 
> Check out "typographic ligatures", as in
> http://en.wikipedia.org/wiki/Typographic_ligature
> 
> 
> VR
> 
> On Mon, Feb 15, 2010 at 11:37 PM, Thomas Fischer <fi...@aon.at> wrote:
>> Hello,
>> 
>> I have a perfectly normal-looking PDF file (created using TeX and AFPL Ghostscript 6.50)
>> 
>> where
>> 
>> Case 2. Assume that Z is finite, but non-trivial.
>> 
>> comes out as
>> 
>> Case 2. Assume that Z is flnite, but non-trivial.
>> 
>> with an "l" (small L) in flinite.
>> 
>> Is there any explanation?
>> I have similar problems with documents that are retrodigitized (PDF with images and text layer), but this is genuine PDF, and I get the correct text if I select and copy from the file using Preview or Skim.
>> 
>> Thanks in advance for any hints
>> Thomas

Re: PDFbox text extraction: l for i

Posted by Villu Ruusmann <vi...@gmail.com>.

Hello there,

Check out "typographic ligatures", as in
http://en.wikipedia.org/wiki/Typographic_ligature


VR

On Mon, Feb 15, 2010 at 11:37 PM, Thomas Fischer <fi...@aon.at> wrote:
> Hello,
>
> I have a perfectly normal-looking PDF file (created using TeX and AFPL Ghostscript 6.50)
>
> where
>
> Case 2. Assume that Z is finite, but non-trivial.
>
> comes out as
>
> Case 2. Assume that Z is flnite, but non-trivial.
>
> with an "l" (small L) in flinite.
>
> Is there any explanation?
> I have similar problems with documents that are retrodigitized (PDF with images and text layer), but this is genuine PDF, and I get the correct text if I select and copy from the file using Preview or Skim.
>
> Thanks in advance for any hints
> Thomas

PDFbox text extraction: l for i

Posted by Thomas Fischer <fi...@aon.at>.

Hello,

I have a perfectly normal-looking PDF file (created using TeX and AFPL Ghostscript 6.50)

where 

Case 2. Assume that Z is finite, but non-trivial.

comes out as

Case 2. Assume that Z is flnite, but non-trivial.

with an "l" (small L) in flinite.

Is there any explanation?
I have similar problems with documents that are retrodigitized (PDF with images and text layer), but this is genuine PDF, and I get the correct text if I select and copy from the file using Preview or Skim.

Thanks in advance for any hints
Thomas