You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Yavuz Nuzumlalı <ma...@gmail.com> on 2011/10/12 14:42:43 UTC

TextPosition returns some characters with wrong case

Hi,

When I try to use TextPosition to get text in a PDF file, it sometimes gives
me related character with changed case.

For example, The text in the pdf is like this:

"BEBEK RANGE ROVER "

And PDFBox returns the text like this:

"bebek RANGe ROVeR "

I'm using processTextPosition() method to get text. What could be the
problem, I can't figured out how to solve the problem.

Thanks.

Re: TextPosition returns some characters with wrong case

Posted by Kévin Sailly <ke...@gmail.com>.
no you can not use height font, if the font embended is not matching
standard unicode font (check that with a font editor) then you can try to
produce a matrix to go from the "special" font unicode mapping to the
standard one.

that's just an idea, I am just a user...

regards,
kévin

2011/10/13 Yavuz Nuzumlalı <ma...@gmail.com>

> Used font in the PDF file is "Kingfisher-Heavy", is it one of the
> unmatching
> fonts?
>
> Can I use character height values in order to solve correct this problem?
>
> For example; if I can get the height  for each character in the pdf file, I
> can compare this value with nearer characters, then I could convert a
> lowercase character to uppercase using some logic. Does PDFBox provide an
> interface to get height values for textposition objects, or characters?
>
> On Wed, Oct 12, 2011 at 8:29 PM, Kévin Sailly <kevin.sailly@gmail.com
> >wrote:
>
> > Hello,
> >
> > May be a font problem, the embended one in the pdf file is matching the
> > standard font mapping to unicode?
> >
> > Regards,
> > Kévin
> >
> > 2011/10/12 Yavuz Nuzumlalı <ma...@gmail.com>
> >
> > > Hi,
> > >
> > > When I try to use TextPosition to get text in a PDF file, it sometimes
> > > gives
> > > me related character with changed case.
> > >
> > > For example, The text in the pdf is like this:
> > >
> > > "BEBEK RANGE ROVER "
> > >
> > > And PDFBox returns the text like this:
> > >
> > > "bebek RANGe ROVeR "
> > >
> > > I'm using processTextPosition() method to get text. What could be the
> > > problem, I can't figured out how to solve the problem.
> > >
> > > Thanks.
> > >
> >
>

Re: TextPosition returns some characters with wrong case

Posted by Yavuz Nuzumlalı <ma...@gmail.com>.
Used font in the PDF file is "Kingfisher-Heavy", is it one of the unmatching
fonts?

Can I use character height values in order to solve correct this problem?

For example; if I can get the height  for each character in the pdf file, I
can compare this value with nearer characters, then I could convert a
lowercase character to uppercase using some logic. Does PDFBox provide an
interface to get height values for textposition objects, or characters?

On Wed, Oct 12, 2011 at 8:29 PM, Kévin Sailly <ke...@gmail.com>wrote:

> Hello,
>
> May be a font problem, the embended one in the pdf file is matching the
> standard font mapping to unicode?
>
> Regards,
> Kévin
>
> 2011/10/12 Yavuz Nuzumlalı <ma...@gmail.com>
>
> > Hi,
> >
> > When I try to use TextPosition to get text in a PDF file, it sometimes
> > gives
> > me related character with changed case.
> >
> > For example, The text in the pdf is like this:
> >
> > "BEBEK RANGE ROVER "
> >
> > And PDFBox returns the text like this:
> >
> > "bebek RANGe ROVeR "
> >
> > I'm using processTextPosition() method to get text. What could be the
> > problem, I can't figured out how to solve the problem.
> >
> > Thanks.
> >
>

Re: TextPosition returns some characters with wrong case

Posted by Kévin Sailly <ke...@gmail.com>.
Hello,

May be a font problem, the embended one in the pdf file is matching the
standard font mapping to unicode?

Regards,
Kévin

2011/10/12 Yavuz Nuzumlalı <ma...@gmail.com>

> Hi,
>
> When I try to use TextPosition to get text in a PDF file, it sometimes
> gives
> me related character with changed case.
>
> For example, The text in the pdf is like this:
>
> "BEBEK RANGE ROVER "
>
> And PDFBox returns the text like this:
>
> "bebek RANGe ROVeR "
>
> I'm using processTextPosition() method to get text. What could be the
> problem, I can't figured out how to solve the problem.
>
> Thanks.
>