You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Omar Chiyean <om...@gmail.com> on 2009/10/22 03:54:07 UTC
Help with Charset
Hi there...
I'm new with PDFBox and i'm extracting text from some pdf and letting them
in a String variable. Now my problem is the latin characters as accentued
letter are not suited as they would.
How can I set the charset or how can i see the charset returned from the
TextStripper from PDFBox??
I read it was UTF-16BE but when i get byte code with this charset and
translate it to ISO-8859-1 i get letter separated with a space and no luck
with accented letters...
So whats wrong or can you help me to correct this?? I'm using PDFBOX 0.7.3
Thanks in advance...
Re: A question about PDFText2HTML
Posted by Shen Wang <fe...@gmail.com>.
Hi Patric and Chiyean,
Thanks for your guys' reply. It definitely helps. I didn't get back to
your guys earlier because I cannot find internet connection for the past
days.
Chiyean: I have checked the ExtractText.java example. Actually I did
that before I asked the question. It's just that the PDDocument
parameter seems to be only for the writeText method. The object of
PDFTextStripper may still have no idea about which document it's
processing when other methods are called.
Patric: Thanks for reminding me about tracking back to the extended
classes. But still, I got some problem. For example, if it's not the
ExtractText.java example, I will never figure out what the parameter
"encoding" is and what are the options. It's only mentioned in the
javadoc that it's a string type. Another example is for the
processStream method, one of its parameter is COSStream. However, I have
no idea what it's about. It extends COSDictionary, which is a class
"represents a dictionary where name/value pairs reside". But, it never
mentions how does the COSStream and a dictionary is related to a pdf
file and in all the method of COSStream and COSDictionary, I don't see
anyone can let these object know which pdf file is being processed. My
feeling is I must miss some parts but I don't what that is. However,
this makes me feel confused about what is going on. How do you figure
out how those things (like COSStream, COSDictionary, encoding,
PDResources, keys...) correspond to the pdf files? Do I need to go
through the pdf file documentation to make myself clear about that?
Please help me out. Thanks.
Best,
Felix
Omar Chiyean wrote:
> Hi Patric, have you seen the examples
> in the distribution??
>
> Check org.apache.pdfbox.ExtracText.java
> There is the way to use this class..
>
> What I can say is that you need a PDDocument Handler.
> Check the example, it would be very helpfull.
>
> Cheers...
>
>
Re: A question about PDFText2HTML
Posted by Omar Chiyean <om...@gmail.com>.
Hi Patric, have you seen the examples
in the distribution??
Check org.apache.pdfbox.ExtracText.java
There is the way to use this class..
What I can say is that you need a PDDocument Handler.
Check the example, it would be very helpfull.
Cheers...
Re: A question about PDFText2HTML
Posted by Patrick Simon <pa...@heypatty.com>.
I think you need to look up the inheritance chain (of PDFText2HTML) at one
of the parent class(s).
*org.apache.pdfbox.util.PDFStreamEngine* has method on it with the signature of
"*processStream
<http://incubator.apache.org/pdfbox/javadoc/org/apache/pdfbox/util/PDFStreamEngine.html#processStream%28org.apache.pdfbox.pdmodel.PDPage,%20org.apache.pdfbox.pdmodel.PDResources,%20org.apache.pdfbox.cos.COSStream%29>*(PDPage
<http://incubator.apache.org/pdfbox/javadoc/org/apache/pdfbox/pdmodel/PDPage.html>
aPage,PDResources
<http://incubator.apache.org/pdfbox/javadoc/org/apache/pdfbox/pdmodel/PDResources.html>
resources, COSStream
<http://incubator.apache.org/pdfbox/javadoc/org/apache/pdfbox/cos/COSStream.html>
cosStream)" which may be what you want.
On Fri, Oct 23, 2009 at 1:14 PM, Shen Wang <fe...@gmail.com> wrote:
> Hey guys,
>
> I know this question may be silly, but I have worked on this for two days
> and got really frustrated(just jump in the javadoc back and forth and then
> get lost again and again). Does anybody know how to use the Class
> PDFText2HTML? The javadoc to this class is
> http://incubator.apache.org/pdfbox/javadoc/org/apache/pdfbox/util/PDFText2HTML.html.
> How can I let the class know which pdf I am looking at? Except the method
> endDocument(), I don't see any other way that I can pass the information of
> the file to this class. I know I must miss something, could you please help
> me out? Thanks.
>
> Best,
>
> Felix
>
>
A question about PDFText2HTML
Posted by Shen Wang <fe...@gmail.com>.
Hey guys,
I know this question may be silly, but I have worked on this for two
days and got really frustrated(just jump in the javadoc back and forth
and then get lost again and again). Does anybody know how to use the
Class PDFText2HTML? The javadoc to this class is
http://incubator.apache.org/pdfbox/javadoc/org/apache/pdfbox/util/PDFText2HTML.html.
How can I let the class know which pdf I am looking at? Except the
method endDocument(), I don't see any other way that I can pass the
information of the file to this class. I know I must miss something,
could you please help me out? Thanks.
Best,
Felix
Re: Help with Charset
Posted by Omar Chiyean <om...@gmail.com>.
Hi Andreas...
I've set the logger properties with a logger handler, but still getting
these messages in console...
23/10/2009 09:51:58 AM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: cs
23/10/2009 09:51:58 AM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: CS
23/10/2009 09:51:58 AM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: sc
23/10/2009 09:51:58 AM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: SC
Is there a way to fix them??
This is what I did...
Handler fh = new FileHandler("pdfbox.log");
Logger.getLogger("org.apache.pdfbox.util.PDFStreamEngine").addHandler(fh);
I get the logging stuff in file, but keep appearing in console.
Whats wrong??
Cheers...
2009/10/23 Andreas Lehmkühler <an...@lehmi.de>
> Hi
>
> Omar Chiyean schrieb:
> > Thanks Andreas...
> > It really worked, i've updated to 0.8.0 and check ExtractText using
> > the same sintax is working ok.
> >
> > Now I have this output of logging information, what is it??
> >
> > 22/10/2009 08:48:06 PM org.apache.pdfbox.util.PDFStreamEngine
> > processOperator
> > INFO: unsupported/disabled operation: cs
> > 22/10/2009 08:48:06 PM org.apache.pdfbox.util.PDFStreamEngine
> > processOperator
> > INFO: unsupported/disabled operation: CS
> > 22/10/2009 08:48:06 PM org.apache.pdfbox.util.PDFStreamEngine
> > processOperator
> > INFO: unsupported/disabled operation: sc
> > 22/10/2009 08:48:06 PM org.apache.pdfbox.util.PDFStreamEngine
> > processOperator
> > INFO: unsupported/disabled operation: SC
> >
> >
> > Can I disable it or How do I dissable logging???
> Yes. Have a look at [1]. The mentioned logging.properties file is part
> of the source distribution of pdfbox.
>
>
> [1] http://markmail.org/message/3wpukybujqsbfna5
>
> >
> > Thanks in advance...
> >
> > 2009/10/22 Andreas Lehmkühler <an...@lehmi.de>
> >
> >> Hi,
> >>
> >> Omar Chiyean schrieb:
> >> > Hi there...
> >>> I'm new with PDFBox and i'm extracting text from some pdf and letting
> >> them
> >>> in a String variable. Now my problem is the latin characters as
> accentued
> >>> letter are not suited as they would.
> >>>
> >>> How can I set the charset or how can i see the charset returned from
> the
> >>> TextStripper from PDFBox??
> >>>
> >>> I read it was UTF-16BE but when i get byte code with this charset and
> >>> translate it to ISO-8859-1 i get letter separated with a space and no
> >> luck
> >>> with accented letters...
> >>>
> >>> So whats wrong or can you help me to correct this?? I'm using PDFBOX
> >> 0.7.3
> >> First of all I suggest to update to PDFBox 0.8. It includes a lot of
> >> improvements and bugfixes. Back to your question. Your are able to
> >> choose the needed charset before extraction. Have a look at ExtractText
> >> as an example how to use the text extraction.
> >>
> >> BR
> >> Andreas Lehmkühler
>
> BR
> Andreas Lehmkühler
>
Re: Help with Charset
Posted by Andreas Lehmkühler <an...@lehmi.de>.
Hi
Omar Chiyean schrieb:
> Thanks Andreas...
> It really worked, i've updated to 0.8.0 and check ExtractText using
> the same sintax is working ok.
>
> Now I have this output of logging information, what is it??
>
> 22/10/2009 08:48:06 PM org.apache.pdfbox.util.PDFStreamEngine
> processOperator
> INFO: unsupported/disabled operation: cs
> 22/10/2009 08:48:06 PM org.apache.pdfbox.util.PDFStreamEngine
> processOperator
> INFO: unsupported/disabled operation: CS
> 22/10/2009 08:48:06 PM org.apache.pdfbox.util.PDFStreamEngine
> processOperator
> INFO: unsupported/disabled operation: sc
> 22/10/2009 08:48:06 PM org.apache.pdfbox.util.PDFStreamEngine
> processOperator
> INFO: unsupported/disabled operation: SC
>
>
> Can I disable it or How do I dissable logging???
Yes. Have a look at [1]. The mentioned logging.properties file is part
of the source distribution of pdfbox.
[1] http://markmail.org/message/3wpukybujqsbfna5
>
> Thanks in advance...
>
> 2009/10/22 Andreas Lehmkühler <an...@lehmi.de>
>
>> Hi,
>>
>> Omar Chiyean schrieb:
>> > Hi there...
>>> I'm new with PDFBox and i'm extracting text from some pdf and letting
>> them
>>> in a String variable. Now my problem is the latin characters as accentued
>>> letter are not suited as they would.
>>>
>>> How can I set the charset or how can i see the charset returned from the
>>> TextStripper from PDFBox??
>>>
>>> I read it was UTF-16BE but when i get byte code with this charset and
>>> translate it to ISO-8859-1 i get letter separated with a space and no
>> luck
>>> with accented letters...
>>>
>>> So whats wrong or can you help me to correct this?? I'm using PDFBOX
>> 0.7.3
>> First of all I suggest to update to PDFBox 0.8. It includes a lot of
>> improvements and bugfixes. Back to your question. Your are able to
>> choose the needed charset before extraction. Have a look at ExtractText
>> as an example how to use the text extraction.
>>
>> BR
>> Andreas Lehmkühler
BR
Andreas Lehmkühler
Question about extracting font information
Posted by Shen Wang <fe...@gmail.com>.
Hey guys,
I am trying to extract the font information of the text in a pdf file.
More concretely, I want to find out all the sentence which has the
smallest font size and bold font on a given page. And then output both
this sentence and the next sentence. I know it sounds a little wired...
By the way, I am new to PDFBox and I am struggling to get it work. Could
you guys introduce some experience of how to get start? Do you guys just
browse across the javadoc? Is there any better way to learn about the
logic underneath all the PDFBox classes? I have read through the example
codes, but still very confused...
Thanks for any suggestions!
Best,
Felix
Re: Help with Charset
Posted by Omar Chiyean <om...@gmail.com>.
Thanks Andreas...
It really worked, i've updated to 0.8.0 and check ExtractText using
the same sintax is working ok.
Now I have this output of logging information, what is it??
22/10/2009 08:48:06 PM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: cs
22/10/2009 08:48:06 PM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: CS
22/10/2009 08:48:06 PM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: sc
22/10/2009 08:48:06 PM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: SC
Can I disable it or How do I dissable logging???
Thanks in advance...
2009/10/22 Andreas Lehmkühler <an...@lehmi.de>
> Hi,
>
> Omar Chiyean schrieb:
> > Hi there...
> > I'm new with PDFBox and i'm extracting text from some pdf and letting
> them
> > in a String variable. Now my problem is the latin characters as accentued
> > letter are not suited as they would.
> >
> > How can I set the charset or how can i see the charset returned from the
> > TextStripper from PDFBox??
> >
> > I read it was UTF-16BE but when i get byte code with this charset and
> > translate it to ISO-8859-1 i get letter separated with a space and no
> luck
> > with accented letters...
> >
> > So whats wrong or can you help me to correct this?? I'm using PDFBOX
> 0.7.3
> First of all I suggest to update to PDFBox 0.8. It includes a lot of
> improvements and bugfixes. Back to your question. Your are able to
> choose the needed charset before extraction. Have a look at ExtractText
> as an example how to use the text extraction.
>
> BR
> Andreas Lehmkühler
>
Re: Help with Charset
Posted by Andreas Lehmkühler <an...@lehmi.de>.
Hi,
Omar Chiyean schrieb:
> Hi there...
> I'm new with PDFBox and i'm extracting text from some pdf and letting them
> in a String variable. Now my problem is the latin characters as accentued
> letter are not suited as they would.
>
> How can I set the charset or how can i see the charset returned from the
> TextStripper from PDFBox??
>
> I read it was UTF-16BE but when i get byte code with this charset and
> translate it to ISO-8859-1 i get letter separated with a space and no luck
> with accented letters...
>
> So whats wrong or can you help me to correct this?? I'm using PDFBOX 0.7.3
First of all I suggest to update to PDFBox 0.8. It includes a lot of
improvements and bugfixes. Back to your question. Your are able to
choose the needed charset before extraction. Have a look at ExtractText
as an example how to use the text extraction.
BR
Andreas Lehmkühler