You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Omar Chiyean <om...@gmail.com> on 2009/10/22 03:54:07 UTC

Help with Charset

Hi there...
I'm new with PDFBox and i'm extracting text from some pdf and letting them
in a String variable. Now my problem is the latin characters as accentued
letter are not suited as they would.

How can I set the charset or how can i see the charset returned from the
TextStripper from PDFBox??

I read it was UTF-16BE but when i get byte code with this charset and
translate it to ISO-8859-1 i get letter separated with a space and no luck
with accented letters...

So whats wrong or can you help me to correct this?? I'm using PDFBOX 0.7.3

Thanks in advance...

Re: A question about PDFText2HTML

Posted by Shen Wang <fe...@gmail.com>.

Hi Patric and Chiyean,

Thanks for your guys' reply. It definitely helps. I didn't get back to 
your guys earlier because I cannot find internet connection for the past 
days.

Chiyean: I have checked the ExtractText.java example. Actually I did 
that before I asked the question. It's just that the PDDocument 
parameter seems to be only for the writeText method. The object of 
PDFTextStripper may still have no idea about which document it's 
processing when other methods are called.

Patric: Thanks for reminding me about tracking back to the extended 
classes. But still, I got some problem. For example, if it's not the 
ExtractText.java example, I will never figure out what the parameter 
"encoding" is and what are the options. It's only mentioned in the 
javadoc that it's a string type. Another example is for the 
processStream method, one of its parameter is COSStream. However, I have 
no idea what it's about. It extends COSDictionary, which is a class 
"represents a dictionary where name/value pairs reside". But, it never 
mentions how does the COSStream and a dictionary is related to a pdf 
file and in all the method of COSStream and COSDictionary, I don't see 
anyone can let these object know which pdf file is being processed. My 
feeling is I must miss some parts but I don't what that is. However, 
this makes me feel confused about what is going on. How do you figure 
out how those things (like COSStream, COSDictionary, encoding, 
PDResources, keys...) correspond to the pdf files? Do I need to go 
through the pdf file documentation to make myself clear about that? 
Please help me out. Thanks.

Best,

Felix

Omar Chiyean wrote:
> Hi Patric, have you seen the examples
> in the distribution??
>
> Check org.apache.pdfbox.ExtracText.java
> There is the way to use this class..
>
> What I can say is that you need a PDDocument Handler.
> Check the example, it would be very helpfull.
>
> Cheers...
>
>

Re: A question about PDFText2HTML

Posted by Omar Chiyean <om...@gmail.com>.

Hi Patric, have you seen the examples
in the distribution??

Check org.apache.pdfbox.ExtracText.java
There is the way to use this class..

What I can say is that you need a PDDocument Handler.
Check the example, it would be very helpfull.

Cheers...

Re: A question about PDFText2HTML

Posted by Patrick Simon <pa...@heypatty.com>.

I think you need to look up the inheritance chain (of PDFText2HTML) at one
of the parent class(s).

*org.apache.pdfbox.util.PDFStreamEngine* has method on it with the signature of
"*processStream
<http://incubator.apache.org/pdfbox/javadoc/org/apache/pdfbox/util/PDFStreamEngine.html#processStream%28org.apache.pdfbox.pdmodel.PDPage,%20org.apache.pdfbox.pdmodel.PDResources,%20org.apache.pdfbox.cos.COSStream%29>*(PDPage
<http://incubator.apache.org/pdfbox/javadoc/org/apache/pdfbox/pdmodel/PDPage.html>
aPage,PDResources
<http://incubator.apache.org/pdfbox/javadoc/org/apache/pdfbox/pdmodel/PDResources.html>
resources, COSStream
<http://incubator.apache.org/pdfbox/javadoc/org/apache/pdfbox/cos/COSStream.html>
cosStream)" which may be what you want.

On Fri, Oct 23, 2009 at 1:14 PM, Shen Wang <fe...@gmail.com> wrote:

> Hey guys,
>
> I know this question may be silly, but I have worked on this for two days
> and got really frustrated(just jump in the javadoc back and forth and then
> get lost again and again). Does anybody know how to use the Class
> PDFText2HTML? The javadoc to this class is
> http://incubator.apache.org/pdfbox/javadoc/org/apache/pdfbox/util/PDFText2HTML.html.
> How can I let the class know which pdf I am looking at? Except the method
> endDocument(), I don't see any other way that I can pass the information of
> the file to this class. I know I must miss something, could you please help
> me out? Thanks.
>
> Best,
>
> Felix
>
>

A question about PDFText2HTML

Posted by Shen Wang <fe...@gmail.com>.

Hey guys,

I know this question may be silly, but I have worked on this for two 
days and got really frustrated(just jump in the javadoc back and forth 
and then get lost again and again). Does anybody know how to use the 
Class PDFText2HTML? The javadoc to this class is 
http://incubator.apache.org/pdfbox/javadoc/org/apache/pdfbox/util/PDFText2HTML.html. 
How can I let the class know which pdf I am looking at? Except the 
method endDocument(), I don't see any other way that I can pass the 
information of the file to this class. I know I must miss something, 
could you please help me out? Thanks.

Best,

Felix

Re: Help with Charset

Posted by Omar Chiyean <om...@gmail.com>.

Hi Andreas...

I've set the logger properties with a logger handler, but still getting
these messages in console...

23/10/2009 09:51:58 AM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: cs
23/10/2009 09:51:58 AM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: CS
23/10/2009 09:51:58 AM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: sc
23/10/2009 09:51:58 AM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: SC

Is there a way to fix them??

This is what I did...

Handler fh = new FileHandler("pdfbox.log");
Logger.getLogger("org.apache.pdfbox.util.PDFStreamEngine").addHandler(fh);
I get the logging stuff in file, but keep appearing in console.
Whats wrong??

Cheers...

2009/10/23 Andreas Lehmkühler <an...@lehmi.de>

> Hi
>
> Omar Chiyean schrieb:
> > Thanks Andreas...
> > It really worked, i've updated to 0.8.0 and check ExtractText using
> > the same sintax is working ok.
> >
> > Now I have this output of logging information, what is it??
> >
> > 22/10/2009 08:48:06 PM org.apache.pdfbox.util.PDFStreamEngine
> > processOperator
> > INFO: unsupported/disabled operation: cs
> > 22/10/2009 08:48:06 PM org.apache.pdfbox.util.PDFStreamEngine
> > processOperator
> > INFO: unsupported/disabled operation: CS
> > 22/10/2009 08:48:06 PM org.apache.pdfbox.util.PDFStreamEngine
> > processOperator
> > INFO: unsupported/disabled operation: sc
> > 22/10/2009 08:48:06 PM org.apache.pdfbox.util.PDFStreamEngine
> > processOperator
> > INFO: unsupported/disabled operation: SC
> >
> >
> > Can I disable it or How do I dissable logging???
> Yes. Have a look at [1]. The mentioned logging.properties file is part
> of the source distribution of pdfbox.
>
>
> [1] http://markmail.org/message/3wpukybujqsbfna5
>
> >
> > Thanks in advance...
> >
> > 2009/10/22 Andreas Lehmkühler <an...@lehmi.de>
> >
> >> Hi,
> >>
> >> Omar Chiyean schrieb:
> >>  > Hi there...
> >>> I'm new with PDFBox and i'm extracting text from some pdf and letting
> >> them
> >>> in a String variable. Now my problem is the latin characters as
> accentued
> >>> letter are not suited as they would.
> >>>
> >>> How can I set the charset or how can i see the charset returned from
> the
> >>> TextStripper from PDFBox??
> >>>
> >>> I read it was UTF-16BE but when i get byte code with this charset and
> >>> translate it to ISO-8859-1 i get letter separated with a space and no
> >> luck
> >>> with accented letters...
> >>>
> >>> So whats wrong or can you help me to correct this?? I'm using PDFBOX
> >> 0.7.3
> >> First of all I suggest to update to PDFBox 0.8. It includes a lot of
> >> improvements and bugfixes. Back to your question. Your are able to
> >> choose the needed charset before extraction. Have a look at ExtractText
> >> as an example how to use the text extraction.
> >>
> >> BR
> >> Andreas Lehmkühler
>
> BR
> Andreas Lehmkühler
>

Re: Help with Charset

Posted by Andreas Lehmkühler <an...@lehmi.de>.

Hi

Omar Chiyean schrieb:
> Thanks Andreas...
> It really worked, i've updated to 0.8.0 and check ExtractText using
> the same sintax is working ok.
> 
> Now I have this output of logging information, what is it??
> 
> 22/10/2009 08:48:06 PM org.apache.pdfbox.util.PDFStreamEngine
> processOperator
> INFO: unsupported/disabled operation: cs
> 22/10/2009 08:48:06 PM org.apache.pdfbox.util.PDFStreamEngine
> processOperator
> INFO: unsupported/disabled operation: CS
> 22/10/2009 08:48:06 PM org.apache.pdfbox.util.PDFStreamEngine
> processOperator
> INFO: unsupported/disabled operation: sc
> 22/10/2009 08:48:06 PM org.apache.pdfbox.util.PDFStreamEngine
> processOperator
> INFO: unsupported/disabled operation: SC
> 
> 
> Can I disable it or How do I dissable logging???
Yes. Have a look at [1]. The mentioned logging.properties file is part
of the source distribution of pdfbox.


[1] http://markmail.org/message/3wpukybujqsbfna5

> 
> Thanks in advance...
> 
> 2009/10/22 Andreas Lehmkühler <an...@lehmi.de>
> 
>> Hi,
>>
>> Omar Chiyean schrieb:
>>  > Hi there...
>>> I'm new with PDFBox and i'm extracting text from some pdf and letting
>> them
>>> in a String variable. Now my problem is the latin characters as accentued
>>> letter are not suited as they would.
>>>
>>> How can I set the charset or how can i see the charset returned from the
>>> TextStripper from PDFBox??
>>>
>>> I read it was UTF-16BE but when i get byte code with this charset and
>>> translate it to ISO-8859-1 i get letter separated with a space and no
>> luck
>>> with accented letters...
>>>
>>> So whats wrong or can you help me to correct this?? I'm using PDFBOX
>> 0.7.3
>> First of all I suggest to update to PDFBox 0.8. It includes a lot of
>> improvements and bugfixes. Back to your question. Your are able to
>> choose the needed charset before extraction. Have a look at ExtractText
>> as an example how to use the text extraction.
>>
>> BR
>> Andreas Lehmkühler

BR
Andreas Lehmkühler

Question about extracting font information

Posted by Shen Wang <fe...@gmail.com>.

Hey guys,

I am trying to extract the font information of the text in a pdf file. 
More concretely, I want to find out all the sentence which has the 
smallest font size and bold font on a given page. And then output both 
this sentence and the next sentence. I know it sounds a little wired...

By the way, I am new to PDFBox and I am struggling to get it work. Could 
you guys introduce some experience of how to get start? Do you guys just 
browse across the javadoc? Is there any better way to learn about the 
logic underneath all the PDFBox classes? I have read through the example 
codes, but still very confused...

Thanks for any suggestions!

Best,

Felix

Re: Help with Charset

Posted by Omar Chiyean <om...@gmail.com>.

Thanks Andreas...
It really worked, i've updated to 0.8.0 and check ExtractText using
the same sintax is working ok.

Now I have this output of logging information, what is it??

22/10/2009 08:48:06 PM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: cs
22/10/2009 08:48:06 PM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: CS
22/10/2009 08:48:06 PM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: sc
22/10/2009 08:48:06 PM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: SC


Can I disable it or How do I dissable logging???

Thanks in advance...

2009/10/22 Andreas Lehmkühler <an...@lehmi.de>

> Hi,
>
> Omar Chiyean schrieb:
>  > Hi there...
> > I'm new with PDFBox and i'm extracting text from some pdf and letting
> them
> > in a String variable. Now my problem is the latin characters as accentued
> > letter are not suited as they would.
> >
> > How can I set the charset or how can i see the charset returned from the
> > TextStripper from PDFBox??
> >
> > I read it was UTF-16BE but when i get byte code with this charset and
> > translate it to ISO-8859-1 i get letter separated with a space and no
> luck
> > with accented letters...
> >
> > So whats wrong or can you help me to correct this?? I'm using PDFBOX
> 0.7.3
> First of all I suggest to update to PDFBox 0.8. It includes a lot of
> improvements and bugfixes. Back to your question. Your are able to
> choose the needed charset before extraction. Have a look at ExtractText
> as an example how to use the text extraction.
>
> BR
> Andreas Lehmkühler
>

Re: Help with Charset

Posted by Andreas Lehmkühler <an...@lehmi.de>.

Hi,

Omar Chiyean schrieb:
> Hi there...
> I'm new with PDFBox and i'm extracting text from some pdf and letting them
> in a String variable. Now my problem is the latin characters as accentued
> letter are not suited as they would.
> 
> How can I set the charset or how can i see the charset returned from the
> TextStripper from PDFBox??
> 
> I read it was UTF-16BE but when i get byte code with this charset and
> translate it to ISO-8859-1 i get letter separated with a space and no luck
> with accented letters...
> 
> So whats wrong or can you help me to correct this?? I'm using PDFBOX 0.7.3
First of all I suggest to update to PDFBox 0.8. It includes a lot of
improvements and bugfixes. Back to your question. Your are able to
choose the needed charset before extraction. Have a look at ExtractText
as an example how to use the text extraction.

BR
Andreas Lehmkühler