You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Christopher Schultz <ch...@christopherschultz.net> on 2019/01/31 01:56:52 UTC

Choosing a font for non-ASCII characters

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Hello,

We are using PDFBox to generate PDFs in a very simple way and only
including fonts available from the PDType1Font class (e.g.
PDType1Font.HELVETICA). The PDFs we are generating are really only
including a few title/subtitles, text, and bulleted/numbered lists.

Everything is fine when we use what is probably in the standard Latin
alphabet, and we've had some troubles with special characters that
don't fit in there such as ≥ and ≤. We've dealt with that by simply
replacing "≤" with "<=" and so on, but we're starting to use languages
that don't use Latin script and so we can no longer replace out way
out of the problem.

For example, I need to be able to put Chinese characters into a PDF we
generate. So let's take the text "中國" which is just the word "China"
in Traditional Chinese script.

First, how can I find out that the character isn't going to fit into
the font that I'm currently using? Should I do it for every character
we try to put into the page, or should we just catch exceptions when
we try to write the text to the page and then scan at that point? I'm
trying to avoid writing hideously inefficient code to handle these
situations.

Second, once I know that I need to choose another font... how do I
know which font to choose? Should I keep a mapping of Unicode code
point ranges and the best fonts to use for them?

Finally, what fonts are actually available to PDFBox? How do I add new
ones? I have a lot of control over the environment and I get to see
failing conversions and intervene, so some trial and error is okay for
each new situation.

The recipients of our PDFs are file-size sensitive, so I'd only want
to include (bundle) a font in a PDF if it was absolutely necessary to
include the font itself. If we can get away with including a
*reference* to the font in the PDF and telling these recipients
"sorry, if you want to read the Chinese PDFs we send, you'd better
make sure you have font X installed" then that's okay with me, too.

What suggestions to people have for doing all of the above?

Thanks,
- -chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlxSVeMACgkQHPApP6U8
pFgQew/8CS1YmJs27QrD+WGV/Zcn2RAeG/ZVs5w3huMwKLY8NfXQ4Vdp3o+s+B7u
2wn9m2LJVXuWT2dfDDQzZDIfBgfqZI5sl4+hBDSos9gEVV3ddWcox1A0YSTCy5VW
DAlDZSscEdIDyMIVz2E1dQi6/p35MrSyJ/Xom6Tbnvt3ZHAp87GHZ1rB8XXrtVZS
itVE756hJ59o4tZJoM9cH1NH1w9PuLLJyrGpCsc1oTgcZTI0jXxiIC9Q4GvLbLbO
yVdExITzTVflLAo0BRGOJkb5IF1OyVf51HHas1+DMEvtSXY5J89e1dFnyo1dFxMU
MXJ5rKh/FQvJtC5Lf9QoQ3tV8r3qyWv0wc8FVgMcLUA9DHbx7QtcydQwoKf3poJz
ymlOJWH2b4d5uLbSfdjr9Nof4IRNH504cwjoth3eor3Ra/SCaem2ZrTQhY6XzoF1
vCpZChDIKzDvI7NDGbcaNvzzezNmlbdRdh3Ekwk1E/vwfrmtb4VmW7sW9PICP1o6
80sqydy6qIMtQNjr1EK55VIvD4+e10SwYWhcZinsByQkYZpoRjKWQ9kTNk10vvwk
cLB8bVeLPHC7nLe4FqJe4y3+hWBfGP25O2VdnNU1sjd4lbzQhNIgCMj0n+6ziDuU
Nh9vDuKRXEIIXHZUxrN2Td3hOw96wKHqEQ8RtxYpuGWABx4wIWw=
=aMPi
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Choosing a font for non-ASCII characters

Posted by Malcolm Vincent <ma...@gmail.com>.

I found Arial Unicode to be a great help here

On Thu, 31 Jan 2019 at 06:34, Matthew Self <ma...@mself.com> wrote:
>
> Google's Noto fonts support a huge number of Unicode characters:
> https://www.google.com/get/noto/
>
> You would need to install the ones you need.  I believe you will also need
> to figure out which fonts to use for which characters on your own --
> possibly by knowing the language/script you are targeting.
>
> On Wed, Jan 30, 2019 at 5:56 PM Christopher Schultz <
> chris@christopherschultz.net> wrote:
>
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA256
> >
> > Hello,
> >
> > We are using PDFBox to generate PDFs in a very simple way and only
> > including fonts available from the PDType1Font class (e.g.
> > PDType1Font.HELVETICA). The PDFs we are generating are really only
> > including a few title/subtitles, text, and bulleted/numbered lists.
> >
> > Everything is fine when we use what is probably in the standard Latin
> > alphabet, and we've had some troubles with special characters that
> > don't fit in there such as ≥ and ≤. We've dealt with that by simply
> > replacing "≤" with "<=" and so on, but we're starting to use languages
> > that don't use Latin script and so we can no longer replace out way
> > out of the problem.
> >
> > For example, I need to be able to put Chinese characters into a PDF we
> > generate. So let's take the text "中國" which is just the word "China"
> > in Traditional Chinese script.
> >
> > First, how can I find out that the character isn't going to fit into
> > the font that I'm currently using? Should I do it for every character
> > we try to put into the page, or should we just catch exceptions when
> > we try to write the text to the page and then scan at that point? I'm
> > trying to avoid writing hideously inefficient code to handle these
> > situations.
> >
> > Second, once I know that I need to choose another font... how do I
> > know which font to choose? Should I keep a mapping of Unicode code
> > point ranges and the best fonts to use for them?
> >
> > Finally, what fonts are actually available to PDFBox? How do I add new
> > ones? I have a lot of control over the environment and I get to see
> > failing conversions and intervene, so some trial and error is okay for
> > each new situation.
> >
> > The recipients of our PDFs are file-size sensitive, so I'd only want
> > to include (bundle) a font in a PDF if it was absolutely necessary to
> > include the font itself. If we can get away with including a
> > *reference* to the font in the PDF and telling these recipients
> > "sorry, if you want to read the Chinese PDFs we send, you'd better
> > make sure you have font X installed" then that's okay with me, too.
> >
> > What suggestions to people have for doing all of the above?
> >
> > Thanks,
> > - -chris
> > -----BEGIN PGP SIGNATURE-----
> > Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/
> >
> > iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlxSVeMACgkQHPApP6U8
> > pFgQew/8CS1YmJs27QrD+WGV/Zcn2RAeG/ZVs5w3huMwKLY8NfXQ4Vdp3o+s+B7u
> > 2wn9m2LJVXuWT2dfDDQzZDIfBgfqZI5sl4+hBDSos9gEVV3ddWcox1A0YSTCy5VW
> > DAlDZSscEdIDyMIVz2E1dQi6/p35MrSyJ/Xom6Tbnvt3ZHAp87GHZ1rB8XXrtVZS
> > itVE756hJ59o4tZJoM9cH1NH1w9PuLLJyrGpCsc1oTgcZTI0jXxiIC9Q4GvLbLbO
> > yVdExITzTVflLAo0BRGOJkb5IF1OyVf51HHas1+DMEvtSXY5J89e1dFnyo1dFxMU
> > MXJ5rKh/FQvJtC5Lf9QoQ3tV8r3qyWv0wc8FVgMcLUA9DHbx7QtcydQwoKf3poJz
> > ymlOJWH2b4d5uLbSfdjr9Nof4IRNH504cwjoth3eor3Ra/SCaem2ZrTQhY6XzoF1
> > vCpZChDIKzDvI7NDGbcaNvzzezNmlbdRdh3Ekwk1E/vwfrmtb4VmW7sW9PICP1o6
> > 80sqydy6qIMtQNjr1EK55VIvD4+e10SwYWhcZinsByQkYZpoRjKWQ9kTNk10vvwk
> > cLB8bVeLPHC7nLe4FqJe4y3+hWBfGP25O2VdnNU1sjd4lbzQhNIgCMj0n+6ziDuU
> > Nh9vDuKRXEIIXHZUxrN2Td3hOw96wKHqEQ8RtxYpuGWABx4wIWw=
> > =aMPi
> > -----END PGP SIGNATURE-----
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: users-help@pdfbox.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Choosing a font for non-ASCII characters

Posted by Matthew Self <ma...@mself.com>.

Google's Noto fonts support a huge number of Unicode characters:
https://www.google.com/get/noto/

You would need to install the ones you need.  I believe you will also need
to figure out which fonts to use for which characters on your own --
possibly by knowing the language/script you are targeting.

On Wed, Jan 30, 2019 at 5:56 PM Christopher Schultz <
chris@christopherschultz.net> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> Hello,
>
> We are using PDFBox to generate PDFs in a very simple way and only
> including fonts available from the PDType1Font class (e.g.
> PDType1Font.HELVETICA). The PDFs we are generating are really only
> including a few title/subtitles, text, and bulleted/numbered lists.
>
> Everything is fine when we use what is probably in the standard Latin
> alphabet, and we've had some troubles with special characters that
> don't fit in there such as ≥ and ≤. We've dealt with that by simply
> replacing "≤" with "<=" and so on, but we're starting to use languages
> that don't use Latin script and so we can no longer replace out way
> out of the problem.
>
> For example, I need to be able to put Chinese characters into a PDF we
> generate. So let's take the text "中國" which is just the word "China"
> in Traditional Chinese script.
>
> First, how can I find out that the character isn't going to fit into
> the font that I'm currently using? Should I do it for every character
> we try to put into the page, or should we just catch exceptions when
> we try to write the text to the page and then scan at that point? I'm
> trying to avoid writing hideously inefficient code to handle these
> situations.
>
> Second, once I know that I need to choose another font... how do I
> know which font to choose? Should I keep a mapping of Unicode code
> point ranges and the best fonts to use for them?
>
> Finally, what fonts are actually available to PDFBox? How do I add new
> ones? I have a lot of control over the environment and I get to see
> failing conversions and intervene, so some trial and error is okay for
> each new situation.
>
> The recipients of our PDFs are file-size sensitive, so I'd only want
> to include (bundle) a font in a PDF if it was absolutely necessary to
> include the font itself. If we can get away with including a
> *reference* to the font in the PDF and telling these recipients
> "sorry, if you want to read the Chinese PDFs we send, you'd better
> make sure you have font X installed" then that's okay with me, too.
>
> What suggestions to people have for doing all of the above?
>
> Thanks,
> - -chris
> -----BEGIN PGP SIGNATURE-----
> Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/
>
> iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlxSVeMACgkQHPApP6U8
> pFgQew/8CS1YmJs27QrD+WGV/Zcn2RAeG/ZVs5w3huMwKLY8NfXQ4Vdp3o+s+B7u
> 2wn9m2LJVXuWT2dfDDQzZDIfBgfqZI5sl4+hBDSos9gEVV3ddWcox1A0YSTCy5VW
> DAlDZSscEdIDyMIVz2E1dQi6/p35MrSyJ/Xom6Tbnvt3ZHAp87GHZ1rB8XXrtVZS
> itVE756hJ59o4tZJoM9cH1NH1w9PuLLJyrGpCsc1oTgcZTI0jXxiIC9Q4GvLbLbO
> yVdExITzTVflLAo0BRGOJkb5IF1OyVf51HHas1+DMEvtSXY5J89e1dFnyo1dFxMU
> MXJ5rKh/FQvJtC5Lf9QoQ3tV8r3qyWv0wc8FVgMcLUA9DHbx7QtcydQwoKf3poJz
> ymlOJWH2b4d5uLbSfdjr9Nof4IRNH504cwjoth3eor3Ra/SCaem2ZrTQhY6XzoF1
> vCpZChDIKzDvI7NDGbcaNvzzezNmlbdRdh3Ekwk1E/vwfrmtb4VmW7sW9PICP1o6
> 80sqydy6qIMtQNjr1EK55VIvD4+e10SwYWhcZinsByQkYZpoRjKWQ9kTNk10vvwk
> cLB8bVeLPHC7nLe4FqJe4y3+hWBfGP25O2VdnNU1sjd4lbzQhNIgCMj0n+6ziDuU
> Nh9vDuKRXEIIXHZUxrN2Td3hOw96wKHqEQ8RtxYpuGWABx4wIWw=
> =aMPi
> -----END PGP SIGNATURE-----
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Choosing a font for non-ASCII characters

Posted by Marc Kaufman <ma...@eeph.com>.

If you look at a Type 1 font with CharacterMap, you will see that it has 
many more than 255 characters. These characters can be accessed by using 
a Differences entry in the Font Dictionary to remap one of the 255 to 
the desired character. Characters like 'lessequal' can be reached in 
this way. See the PDF Reference, section 9.

Chinese (Asian fonts in general) requires Type 0 fonts.

Marc

On 1/30/2019 5:56 PM, Christopher Schultz wrote:
> Hello,
>
> We are using PDFBox to generate PDFs in a very simple way and only
> including fonts available from the PDType1Font class (e.g.
> PDType1Font.HELVETICA). The PDFs we are generating are really only
> including a few title/subtitles, text, and bulleted/numbered lists.
>
> Everything is fine when we use what is probably in the standard Latin
> alphabet, and we've had some troubles with special characters that
> don't fit in there such as ≥ and ≤. We've dealt with that by simply
> replacing "≤" with "<=" and so on, but we're starting to use languages
> that don't use Latin script and so we can no longer replace out way
> out of the problem.
>
> For example, I need to be able to put Chinese characters into a PDF we
> generate. So let's take the text "中國" which is just the word "China"
> in Traditional Chinese script.
>
> First, how can I find out that the character isn't going to fit into
> the font that I'm currently using? Should I do it for every character
> we try to put into the page, or should we just catch exceptions when
> we try to write the text to the page and then scan at that point? I'm
> trying to avoid writing hideously inefficient code to handle these
> situations.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Choosing a font for non-ASCII characters

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 03.03.2019 um 14:46 schrieb Christopher Schultz:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> Tilman,
>
> On 3/2/19 10:00, Tilman Hausherr wrote:
>> Am 02.03.2019 um 15:54 schrieb Christopher Schultz:
>>> Is there a good way to probe text to determine whether or not an
>>> alternate font will be necessary and only load/bundle it then?
>>  From the new EmbeddedMultipleFonts.java example (in the source
>> code download):
>>
>>
>> boolean isWinAnsiEncoding(int unicode) { String name =
>> GlyphList.getAdobeGlyphList().codePointToName(unicode); if
>> (".notdef".equals(name)) { return false; } return
>> WinAnsiEncoding.INSTANCE.contains(name); }
>>
>>
>> When that one returns true, you can use the built-in fonts.
> Okay, I see that. Is there any reason not to do this?
>
>      boolean isWinAnsiEncoding(int unicode)
>      {
>          return WinAnsiEncoding.INSTANCE.contains(unicode);
>      }
>
> ?

I haven't tried because the numbers are not unicode. Some match, some 
don't. These "codes" are the codes used in the PDF.

>
> Is there nothing like PDFont.isSupportedCodePoint(unicode) available?


No. I agree that it is sortof annoying the way it is done now, but for 
some reason it hasn't been improved. Maybe because each of the font 
types has a different approach to find out whether it works... until 
then, catching IllegalArgumentException is the way to go.

Did you get your application to work, or should PDFBox be redesigned first?


> I didn't see anything. It looks more like the standard way to check is t
> o:
>
> try {
>    page.showText(text);
> } catch (IllegalArgumentException iae) {
>    page.setFont(alternateFont);
>    page.showText(text);
> }

You can also call encode instead, as done in the example.


>
> If that's SOP, then maybe there is no real reason to bother checking
> whether the String will work in the first place... just try it and try
> again if the operation fails?
>
> Catching IllegalArgumentException seems ugly, though. Maybe PDFBox
> could subclass IllegalArgumentException with something more narrow
> like IllegalCodePointException and throw that instead? It would be
> backward-compatible and also one could determine the root cause
> without parsing the exception message to see what the problem was.

IIRC it's always that cause if it throws.

Tilman


>
> I'm happy to provide a patch.
>
> - -chris
> -----BEGIN PGP SIGNATURE-----
> Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/
>
> iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlx72rsACgkQHPApP6U8
> pFho0Q//cH4SX5tWsb/JX782EJ622/h3XCumnrWuMT/yiunSyinsd26Jz3tquxU9
> /tL9hZ8a57j20dKoqf5vm8EorlpYBrSgNAOjlRxuKqY2CLdnA9EsWX9Uux7R5PjF
> FUeE8yKGRyycUBazfNm0Ijv4oZt7A26/irmZrKUwbx73gbIxJMggFGQoMiAWMwgM
> hoX4MeJiBdxmJYf/XnHVZJs1LBX9pDnizIHEU26/bK7B2wb3H2+PSWe4TKf0eb7v
> n1UVjX+12U+CzlF9kx4AnMSDaTo3zmCxSQbzygOqVmaQsc2yAk7mksb7Tt79JzZ/
> s1aatZRtmLEuRhbrF8knt3oWlat4Z1KKQD/Onol3pX+CQ/vKVmFgp9TLBitkiOm+
> CZC949jfg3386akxeixQxBNLxMoo826NYfNLzKb6x0rYSnz4mgqyrvEPzEw/CltT
> Sn7Fo5RSvMH1aCa45KoPmQzCE0okUQN74XaqGaob6pFuerlHcYxhS/DefP+QtO93
> ZRxWyGMJMw81+AEk7eIBeLVxh4gTCdA2bOJwR4I4n5oJZi0VCXOLy8p6wBlQrvDx
> rtRhcHW/HidVeiOeQ9kYoEDqAbg6Rvc4Wi/TkM0LxgeV0d/D9YW+gUWFw3NyiiNk
> IONjKQBxKpowgzXsq0Ug/DcKGu/Za7De9tp0jD5MVZU9i3e96Ag=
> =bMpZ
> -----END PGP SIGNATURE-----
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Choosing a font for non-ASCII characters

Posted by Christopher Schultz <ch...@christopherschultz.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Tilman,

On 3/2/19 10:00, Tilman Hausherr wrote:
> Am 02.03.2019 um 15:54 schrieb Christopher Schultz:
>> Is there a good way to probe text to determine whether or not an
>> alternate font will be necessary and only load/bundle it then?
> 
> From the new EmbeddedMultipleFonts.java example (in the source
> code download):
> 
> 
> boolean isWinAnsiEncoding(int unicode) { String name = 
> GlyphList.getAdobeGlyphList().codePointToName(unicode); if
> (".notdef".equals(name)) { return false; } return
> WinAnsiEncoding.INSTANCE.contains(name); }
> 
> 
> When that one returns true, you can use the built-in fonts.

Okay, I see that. Is there any reason not to do this?

    boolean isWinAnsiEncoding(int unicode)
    {
        return WinAnsiEncoding.INSTANCE.contains(unicode);
    }

?

Is there nothing like PDFont.isSupportedCodePoint(unicode) available?
I didn't see anything. It looks more like the standard way to check is t
o:

try {
  page.showText(text);
} catch (IllegalArgumentException iae) {
  page.setFont(alternateFont);
  page.showText(text);
}

If that's SOP, then maybe there is no real reason to bother checking
whether the String will work in the first place... just try it and try
again if the operation fails?

Catching IllegalArgumentException seems ugly, though. Maybe PDFBox
could subclass IllegalArgumentException with something more narrow
like IllegalCodePointException and throw that instead? It would be
backward-compatible and also one could determine the root cause
without parsing the exception message to see what the problem was.

I'm happy to provide a patch.

- -chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlx72rsACgkQHPApP6U8
pFho0Q//cH4SX5tWsb/JX782EJ622/h3XCumnrWuMT/yiunSyinsd26Jz3tquxU9
/tL9hZ8a57j20dKoqf5vm8EorlpYBrSgNAOjlRxuKqY2CLdnA9EsWX9Uux7R5PjF
FUeE8yKGRyycUBazfNm0Ijv4oZt7A26/irmZrKUwbx73gbIxJMggFGQoMiAWMwgM
hoX4MeJiBdxmJYf/XnHVZJs1LBX9pDnizIHEU26/bK7B2wb3H2+PSWe4TKf0eb7v
n1UVjX+12U+CzlF9kx4AnMSDaTo3zmCxSQbzygOqVmaQsc2yAk7mksb7Tt79JzZ/
s1aatZRtmLEuRhbrF8knt3oWlat4Z1KKQD/Onol3pX+CQ/vKVmFgp9TLBitkiOm+
CZC949jfg3386akxeixQxBNLxMoo826NYfNLzKb6x0rYSnz4mgqyrvEPzEw/CltT
Sn7Fo5RSvMH1aCa45KoPmQzCE0okUQN74XaqGaob6pFuerlHcYxhS/DefP+QtO93
ZRxWyGMJMw81+AEk7eIBeLVxh4gTCdA2bOJwR4I4n5oJZi0VCXOLy8p6wBlQrvDx
rtRhcHW/HidVeiOeQ9kYoEDqAbg6Rvc4Wi/TkM0LxgeV0d/D9YW+gUWFw3NyiiNk
IONjKQBxKpowgzXsq0Ug/DcKGu/Za7De9tp0jD5MVZU9i3e96Ag=
=bMpZ
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Choosing a font for non-ASCII characters

Posted by Christopher Schultz <ch...@christopherschultz.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Tilman,

On 3/20/19 03:55, Tilman Hausherr wrote:
> Am 19.03.2019 um 22:08 schrieb Christopher Schultz: Tilman,
> 
> On 3/19/19 16:23, Tilman Hausherr wrote:
>>>> Am 19.03.2019 um 19:45 schrieb Christopher Schultz: Tilman,
>>>> 
>>>> So I'm starting to look toward making my code better now that
>>>> it's actually working. Right now, my code looks like this:
>>>> 
>>>> if(!isAnsiEncoding(strippedText)) { font =
>>>> getFullUnicodeFont(); }
>>>> 
>>>> Where one font simply replaces the other for strings that
>>>> aren't available the the built-in font(s).
>>>> 
>>>> I'd like to support emoji and stuff like that. I can find a
>>>> font (or fonts) for that, but I think the only way I can do
>>>> that with the existing API is something like this:
>>>> 
>>>> Font[] fonts = new Font[] { builtIn, arialUnicode, emoji };
>>>> 
>>>> for(Font font : fonts) { try { page.setFont(font); 
>>>> page.showText(text); } catch (IllegalArgumentException iae) {
>>>> // Try the next font } }
>>>> 
>>>> That will "work" but it will not work if, for example, I need
>>>> to print text that includes both Chinese characters (from
>>>> arialUnicode font) and also emoji (from the hypothetical
>>>> "emoji" font).
>>>> 
>>>> If there any way to tell PDFBox to "pick the right font (from
>>>> some list) for each character"?
>>>> 
>>>> 
>>>>> No, that is why I created the EmbeddedMultipleFonts.java
>>>>> example which I mentioned earlier in the thread. That one
>>>>> can switch within strings.
> Right, it basically does the same thing as I have above, but for a 
> bunch of increasingly-widening substrings, and it uses exceptions
> for flow control. Yuck.
> 
> I'd have to look more into what PDFont.encode does, but I'm
> guessing that it wouldn't be too hard to build methods into the
> PDFFont class that look something like this:
> 
> /** * Returns true if this PDFont can render the whole string. */ 
> public boolean canEncode(String s);
> 
> /** * Returns the longest String that can be successfully encoded
> by this * PDFont, beginning at the beginning of {s}. If the whole
> String {s} * is encodable, then {s} will be returned. If only a
> part of {s} * is encodable, then the return value of this method
> will be such that: * *
> s.startsWith(getLongestEncodablePrefix(s)) == true * * * If the
> first character of the string is not encodable in this PDFont, * an
> empty string (or null?) will be returned. */ public String
> getLongestEncodablePrefix(String s);
> 
> 
>> That would just push what you called "Yuck" further downwards, or
>> we would have to maintain code twice, one for checking whether
>> something can encoded, and one for actually doing it. And this
>> for all the 6, maybe 7 font types.

Code reuse?

>> Instead of going forward with your project with the working code 
>> provided, you're arguing about design issues.

You are operating under the impression that I haven't already modified
my own code to work. I have.

I'm volunteering to help improve your product. You don't have to get
so upset when someone offers help.

- -chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlySVXQACgkQHPApP6U8
pFgBRQ/+NR6U1Btl12Oof9fM4tn77UNUgQ7qVPmrsW4ev/He1J/TlqNXcxUGhnG6
ZYZYlrjCmzLQ9oB2mMqfuG55gN/FPziYZwegVDFiU1O/40Rsdan1aW5BQnM14qWN
z1+kBW0awOABdguMvpwjsMaGpxVFBMdMeHsxVQmmMD8LozOOuI2yJBEvCna8mvqS
iFiPUC53sIxdTAKvnFZHIUoDYLlXTuuwd28gbJSDC+6G6YpeF+aRBqUj0vqc2bfk
9abJ4BZYOztysPrc/NWE97HBLxsYIhROZGsdVUTVhs8VgBsdzG7qXg9UhrWzTYPy
YdtrldUFxb1WuJ/UQZZIPlAikPwlbI6S45Hzy1YlnBkWa8vqR4f0QLh3X458Zzxc
/ZF+CbKaNe/BWDkBANZANmUf1TjArnIQp5jo4QsYgq2m6BfTbLeMfYDTRap92NpA
M3kJQ0fU8gl39VWKk6DubeOWdkD+o/BusN/gOpg4z3YINH2TeHIf1w1u6k+lsg6B
fGg4e71Hg556LkuT5eDgChXfMj35PXOVJ6qnhM+HZ2Z2bgY3U+bV/Hnrk9bKOVFa
MlHPt48V/M1/AuTJ4PLBjXp9XNak0vxIRI0YMaUnQ3oZZgabVkG0SPAsdrYwEGuZ
cQyMPMciLQIjQcExVGVwtaUD+ooMDAfQMHHRb9qeBJ0c/E30ung=
=QRFg
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Choosing a font for non-ASCII characters

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 19.03.2019 um 22:08 schrieb Christopher Schultz:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> Tilman,
>
> On 3/19/19 16:23, Tilman Hausherr wrote:
>> Am 19.03.2019 um 19:45 schrieb Christopher Schultz: Tilman,
>>
>> So I'm starting to look toward making my code better now that it's
>> actually working. Right now, my code looks like this:
>>
>> if(!isAnsiEncoding(strippedText)) { font = getFullUnicodeFont(); }
>>
>> Where one font simply replaces the other for strings that aren't
>> available the the built-in font(s).
>>
>> I'd like to support emoji and stuff like that. I can find a font
>> (or fonts) for that, but I think the only way I can do that with
>> the existing API is something like this:
>>
>> Font[] fonts = new Font[] { builtIn, arialUnicode, emoji };
>>
>> for(Font font : fonts) { try { page.setFont(font);
>> page.showText(text); } catch (IllegalArgumentException iae) { //
>> Try the next font } }
>>
>> That will "work" but it will not work if, for example, I need to
>> print text that includes both Chinese characters (from arialUnicode
>> font) and also emoji (from the hypothetical "emoji" font).
>>
>> If there any way to tell PDFBox to "pick the right font (from some
>> list) for each character"?
>>
>>
>>> No, that is why I created the EmbeddedMultipleFonts.java example
>>> which I mentioned earlier in the thread. That one can switch
>>> within strings.
> Right, it basically does the same thing as I have above, but for a
> bunch of increasingly-widening substrings, and it uses exceptions for
> flow control. Yuck.
>
> I'd have to look more into what PDFont.encode does, but I'm guessing
> that it wouldn't be too hard to build methods into the PDFFont class
> that look something like this:
>
> /**
>   * Returns true if this PDFont can render the whole string.
>   */
> public boolean canEncode(String s);
>
> /**
>   * Returns the longest String that can be successfully encoded by this
>   * PDFont, beginning at the beginning of {s}. If the whole String {s}
>   * is encodable, then {s} will be returned. If only a part of {s}
>   * is encodable, then the return value of this method will be such that:
>   *
>   *       s.startsWith(getLongestEncodablePrefix(s)) == true
>   *
>   *
>   * If the first character of the string is not encodable in this PDFont,
>   * an empty string (or null?) will be returned.
>   */
> public String getLongestEncodablePrefix(String s);


That would just push what you called "Yuck" further downwards, or we 
would have to maintain code twice, one for checking whether something 
can encoded, and one for actually doing it. And this for all the 6, 
maybe 7 font types.

Instead of going forward with your project with the working code 
provided, you're arguing about design issues.

Tilman



>
> WDYT?
>
> If this must be implemented initially by using exceptions for
> flow-control, so be it. But theoretically, it can be improved in the
> future by performing faster checks... possibly by each type of PDFont
> subclass in a different way.
>
> - -chris
> -----BEGIN PGP SIGNATURE-----
> Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/
>
> iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlyRWlwACgkQHPApP6U8
> pFh9ThAAoHG1hK2SnjLv0ibDvZaG3ZI79NAgoIz7+bowPbi4BvPfKYfuubF0QSNH
> l2lvk657H+0PDFUU5UepyB4JsjItXKG3sgNbQBB0E+G84PF896M/3r61TMgTKmT4
> 1pEqkHMXJoBA/4/Gnh9HLMGyKTY623R60Jhgsxocm78KR4zSjiZuvLpWsSvrqC57
> 4vR4YZ8Od4FvC0NFiGrI4w7KCpRvhT15IiOS77Qitgm3CMTyDaOulcjrcQx2rk0B
> sZY5q+S2huG8INR2vqjjkA/iQjJOTvI7hGJco/PemKWZm6x0/NmATeA7bSYZ9FZ/
> ylJgahUKyCh2b/iJG5oOl/7iuFKrBpeO95/KO0ETTgrM/dZLbNnvDqQsdAfBOZYv
> MTzqk36rf7vMUZtr4i9XW4la4tol5MZTidUGJBgryhaE4VQDrfsnpI3R78LKJA2a
> +QHVLGA5N/fnCyG9/sxX3dwr3+K4daZ56YZJrkaqoO/IU95eQu8sFdATI++4uwsm
> JcWGbmK6O7RiljwqrggTJaU49BuPgnj1+RbIxBkovGEM5ReITomqZn5wsUnowbiE
> jVxSAavZ7OU8TlT+/bjFKWoV+wTvzGad671vPxt/Dy+++BFiGScVDwLM8qVmcrd1
> gf8BosKaVBHE/+YBw1wyYyYJowvrtr7T9gMMyIHG91fZiSv8Ml4=
> =6hcu
> -----END PGP SIGNATURE-----
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Choosing a font for non-ASCII characters

Posted by Christopher Schultz <ch...@christopherschultz.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Tilman,

On 3/19/19 16:23, Tilman Hausherr wrote:
> Am 19.03.2019 um 19:45 schrieb Christopher Schultz: Tilman,
> 
> So I'm starting to look toward making my code better now that it's 
> actually working. Right now, my code looks like this:
> 
> if(!isAnsiEncoding(strippedText)) { font = getFullUnicodeFont(); }
> 
> Where one font simply replaces the other for strings that aren't 
> available the the built-in font(s).
> 
> I'd like to support emoji and stuff like that. I can find a font
> (or fonts) for that, but I think the only way I can do that with
> the existing API is something like this:
> 
> Font[] fonts = new Font[] { builtIn, arialUnicode, emoji };
> 
> for(Font font : fonts) { try { page.setFont(font); 
> page.showText(text); } catch (IllegalArgumentException iae) { //
> Try the next font } }
> 
> That will "work" but it will not work if, for example, I need to
> print text that includes both Chinese characters (from arialUnicode
> font) and also emoji (from the hypothetical "emoji" font).
> 
> If there any way to tell PDFBox to "pick the right font (from some 
> list) for each character"?
> 
> 
>> No, that is why I created the EmbeddedMultipleFonts.java example
>> which I mentioned earlier in the thread. That one can switch
>> within strings.

Right, it basically does the same thing as I have above, but for a
bunch of increasingly-widening substrings, and it uses exceptions for
flow control. Yuck.

I'd have to look more into what PDFont.encode does, but I'm guessing
that it wouldn't be too hard to build methods into the PDFFont class
that look something like this:

/**
 * Returns true if this PDFont can render the whole string.
 */
public boolean canEncode(String s);

/**
 * Returns the longest String that can be successfully encoded by this
 * PDFont, beginning at the beginning of {s}. If the whole String {s}
 * is encodable, then {s} will be returned. If only a part of {s}
 * is encodable, then the return value of this method will be such that:
 *
 *       s.startsWith(getLongestEncodablePrefix(s)) == true
 *
 *
 * If the first character of the string is not encodable in this PDFont,
 * an empty string (or null?) will be returned.
 */
public String getLongestEncodablePrefix(String s);

WDYT?

If this must be implemented initially by using exceptions for
flow-control, so be it. But theoretically, it can be improved in the
future by performing faster checks... possibly by each type of PDFont
subclass in a different way.

- -chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlyRWlwACgkQHPApP6U8
pFh9ThAAoHG1hK2SnjLv0ibDvZaG3ZI79NAgoIz7+bowPbi4BvPfKYfuubF0QSNH
l2lvk657H+0PDFUU5UepyB4JsjItXKG3sgNbQBB0E+G84PF896M/3r61TMgTKmT4
1pEqkHMXJoBA/4/Gnh9HLMGyKTY623R60Jhgsxocm78KR4zSjiZuvLpWsSvrqC57
4vR4YZ8Od4FvC0NFiGrI4w7KCpRvhT15IiOS77Qitgm3CMTyDaOulcjrcQx2rk0B
sZY5q+S2huG8INR2vqjjkA/iQjJOTvI7hGJco/PemKWZm6x0/NmATeA7bSYZ9FZ/
ylJgahUKyCh2b/iJG5oOl/7iuFKrBpeO95/KO0ETTgrM/dZLbNnvDqQsdAfBOZYv
MTzqk36rf7vMUZtr4i9XW4la4tol5MZTidUGJBgryhaE4VQDrfsnpI3R78LKJA2a
+QHVLGA5N/fnCyG9/sxX3dwr3+K4daZ56YZJrkaqoO/IU95eQu8sFdATI++4uwsm
JcWGbmK6O7RiljwqrggTJaU49BuPgnj1+RbIxBkovGEM5ReITomqZn5wsUnowbiE
jVxSAavZ7OU8TlT+/bjFKWoV+wTvzGad671vPxt/Dy+++BFiGScVDwLM8qVmcrd1
gf8BosKaVBHE/+YBw1wyYyYJowvrtr7T9gMMyIHG91fZiSv8Ml4=
=6hcu
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Choosing a font for non-ASCII characters

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 19.03.2019 um 19:45 schrieb Christopher Schultz:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> Tilman,
>
> So I'm starting to look toward making my code better now that it's
> actually working. Right now, my code looks like this:
>
>          if(!isAnsiEncoding(strippedText)) {
>              font = getFullUnicodeFont();
>          }
>
> Where one font simply replaces the other for strings that aren't
> available the the built-in font(s).
>
> I'd like to support emoji and stuff like that. I can find a font (or
> fonts) for that, but I think the only way I can do that with the
> existing API is something like this:
>
> Font[] fonts = new Font[] { builtIn, arialUnicode, emoji };
>
> for(Font font : fonts) {
>      try {
>        page.setFont(font);
>         page.showText(text);
>      } catch (IllegalArgumentException iae) {
>          // Try the next font
>      }
> }
>
> That will "work" but it will not work if, for example, I need to print
> text that includes both Chinese characters (from arialUnicode font)
> and also emoji (from the hypothetical "emoji" font).
>
> If there any way to tell PDFBox to "pick the right font (from some
> list) for each character"?


No, that is why I created the EmbeddedMultipleFonts.java example which I 
mentioned earlier in the thread. That one can switch within strings.

Tilman



>
> - -chris
> -----BEGIN PGP SIGNATURE-----
> Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/
>
> iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlyRONIACgkQHPApP6U8
> pFhu3A//dObTICq7o17gNERfJKQg6dL4nFt8eHXTrw/NZkrSzMtiyYttil+o8a5o
> y3bPDQ+Nvo2FofQBFCfq480mZh1Vo8MpVNKTitUISR/14zzNPSTNa+K08bfMMYhA
> 8El2EgGAv/v/xtn7xFLNowOjbq7r3Hap1wmYpwLVM1aqFYL4wS6QNwlkmIsOqocs
> JeeQ247g/KZHm4nJ9Z+b5Dd8vS/DpoOUzs9Yyt9APNHPRAjirevq37ALf46gowDj
> GHlIGLzjNDLDLUn6sCFES2SSScHt8und/RW6K5cEJsFmtc22cFZ9RpcpeRg4BkJh
> /VPDs8Iq1KzMUXWjlJTq5bWsbE8IMCtgSkYZt0Fl9FJOGrg9aIa6SjEHxZ3KsBht
> RHquj3vblGYrrn22t+G+oelIm94iiWfwsIf/wmOke2fcv83lEX5xVMtTKLB+uCQo
> 4wwMqgkuTQiMS8KH5BlR5WCMrmGhRq4fD3gZ1Sdt4TJXiKJuUOss5sQTdDgLIyvT
> jL29R79pCdnp1v90rxM2sFR3CPr/fjUZOcF1+vYKhXwyaYSFboaxCUwtFNoA+aLc
> mztEIRurYq6MParoIrELyGaqVnmOD/ElcPiRdbNSWkfa8xRcAjHqeFCjZe6qrTOD
> nkbAzhOG4Ty0hyI/v0zaaGvJ1lS40zzaCp0hHxDcd1td3JnUzs4=
> =paYu
> -----END PGP SIGNATURE-----
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Choosing a font for non-ASCII characters

Posted by Christopher Schultz <ch...@christopherschultz.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Tilman,

So I'm starting to look toward making my code better now that it's
actually working. Right now, my code looks like this:

        if(!isAnsiEncoding(strippedText)) {
            font = getFullUnicodeFont();
        }

Where one font simply replaces the other for strings that aren't
available the the built-in font(s).

I'd like to support emoji and stuff like that. I can find a font (or
fonts) for that, but I think the only way I can do that with the
existing API is something like this:

Font[] fonts = new Font[] { builtIn, arialUnicode, emoji };

for(Font font : fonts) {
    try {
      page.setFont(font);
       page.showText(text);
    } catch (IllegalArgumentException iae) {
        // Try the next font
    }
}

That will "work" but it will not work if, for example, I need to print
text that includes both Chinese characters (from arialUnicode font)
and also emoji (from the hypothetical "emoji" font).

If there any way to tell PDFBox to "pick the right font (from some
list) for each character"?

- -chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlyRONIACgkQHPApP6U8
pFhu3A//dObTICq7o17gNERfJKQg6dL4nFt8eHXTrw/NZkrSzMtiyYttil+o8a5o
y3bPDQ+Nvo2FofQBFCfq480mZh1Vo8MpVNKTitUISR/14zzNPSTNa+K08bfMMYhA
8El2EgGAv/v/xtn7xFLNowOjbq7r3Hap1wmYpwLVM1aqFYL4wS6QNwlkmIsOqocs
JeeQ247g/KZHm4nJ9Z+b5Dd8vS/DpoOUzs9Yyt9APNHPRAjirevq37ALf46gowDj
GHlIGLzjNDLDLUn6sCFES2SSScHt8und/RW6K5cEJsFmtc22cFZ9RpcpeRg4BkJh
/VPDs8Iq1KzMUXWjlJTq5bWsbE8IMCtgSkYZt0Fl9FJOGrg9aIa6SjEHxZ3KsBht
RHquj3vblGYrrn22t+G+oelIm94iiWfwsIf/wmOke2fcv83lEX5xVMtTKLB+uCQo
4wwMqgkuTQiMS8KH5BlR5WCMrmGhRq4fD3gZ1Sdt4TJXiKJuUOss5sQTdDgLIyvT
jL29R79pCdnp1v90rxM2sFR3CPr/fjUZOcF1+vYKhXwyaYSFboaxCUwtFNoA+aLc
mztEIRurYq6MParoIrELyGaqVnmOD/ElcPiRdbNSWkfa8xRcAjHqeFCjZe6qrTOD
nkbAzhOG4Ty0hyI/v0zaaGvJ1lS40zzaCp0hHxDcd1td3JnUzs4=
=paYu
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Choosing a font for non-ASCII characters

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 04.03.2019 um 20:44 schrieb Christopher Schultz:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> Tilman,
>
> On 3/3/19 08:48, Tilman Hausherr wrote:
>>> I have no idea. The information about PDFBox seems to be mostly
>>> in example programs and not web-based documentation. Searching
>>> e.g. Google for "how to use FontBox with PDFBox" generally comes
>>> up with references into the Javadoc for "uses of FontBox
>>> interface".
>>>
>>> The Javadoc does not describe what FontBox is and none of the
>>> classes or subclasses in those related packages really have any
>>> documentation worth reading. Each class "foo" is described as
>>> "being a foo" and each "getBar" method is described as "gets the
>>> bar for the foo".
>>>
>>> So... discoverability of features is pretty much nil here.
>>>
>>> I'm quite happy with the responses I get on this mailing list,
>>> but it's nearly impossible to discover on my own what is
>>> possible, here. I shouldn't have to get you guys to tell me how
>>> to use the software... you have better things to do (like
>>> continue to write great software).
>>>
>>> Is there a good example of using FontBox with PDFBox in order to
>>> subset a font?
>> Yes, the EmbeddedFonts.java example.
> I don't see any use of FontBox in the EmbeddedFonts.java example. Am I
> missing something?


Sorry, I meant under the hood it is using fontbox. 
EmbeddedVerticalFonts.java uses fontbox directly but only to open the 
font and access some special features.

You don't need to bother about subsetting yourself. PDFBox does this for 
you. If you want to know how it is done, see TTFSubsetter.java and its 
usages, i.e. all implementations of Subsetter.java (TrueTypeEmbedder, 
PDTrueTypeFontEmbedder, PDCIDFontType2Embedder).

> :)
>
> It's less of a presence of useless documentation and more of a lack of
> existing documentation. I can file some tickets if you think it would
> be helpful. I also don't mind writing documentation and/or tutorials
> for the project.

Try to start with something small. I try to concentrate on javadoc 
improvements and having working examples. A tutorial, to be complete, 
would also need to introduce people to PDF concepts. An example has the 
advantage that it works immediately even if they don't know anything 
about the PDF specification and PDF operators and content streams and 
the PD and COS model - people just need to adjust the example it to 
their needs.


>
>> The subset thing is done by PDFBox without you having to bother
>> about it. It's "not subsetting" that would require more parameters.
>> So you need only this:
>>
>> PDType0Font font = PDType0Font.load(document, new
>> File("c:/windows/fonts/arial.ttf")); stream.setFont(font, 12);
>> stream.showText("...");
> Okay, that's exactly what we are doing (well... we are loading the
> font via the ClassLoader, but ...). And it's working. I was just a
> little worried about the ballooning file size. I realize there is
> little to be done about that at this stage.


It will grow if you open the same font several times for one PDF file. 
Of course it will also grow if you use many different glyphs. But 
seriously, protesting because a file grows from 1 KB to 18 KB? Your file 
is only 18 KB! that is what counts. That is still small. Most PDF 
invoices I get are > 100KB.


>
> At this point, I am basically doing this:
>
> [ When adding text to the document ]
> - - If the text contains anything outside of the ANSI encoding
>   - then replace the usual (default) font with the ARIALUNI.TTF
>
> It operates on a per-text-string basis, so it should only change the
> font for a single piece of text that requires it. I'm starting to
> think that I should not bother scanning the text and instead use the
> IllegalArgumentException as flow-control -- which I still don't like.
> But it means that my code will not spend a ton of time repeating
> checks that PDFBox will end up doing, anyway.
>
> I'm a little worried about what I will do the next time I have an
> issue like this -- where the ARIALUNI.TTF font doesn't include some
> character that I need... since there's no way to probe a font for
> support for a code point, I can't map code-points to fonts in a
> scalable way. It will just be trial-and-error which is no fun.


Know your clients. Have enough fonts for different languages. Use the 
multifont example I mentioned (EmbeddedMultipleFonts.java).

I assume there is a way to ask a font whether a codepoint is 
supported... You could use TrueTypeFont.hasGlyph(name). The name is the 
postscript name of that glyph. But I don't know if this works for 
characters not in the adobe glyphlist.


Tilman


>
> It also means that I need to have some kind of set of fonts that we
> just round-robin through, hoping we get a hit and we can continue...
> otherwise we just have to fail (like we do now).
>
> - -chris
> -----BEGIN PGP SIGNATURE-----
> Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/
>
> iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlx9gDYACgkQHPApP6U8
> pFia0w/+LSFIJCLtol+WZDMpcjTxI1Y4ulUFmRJxd+ZdGzbCrKss2R3p+J6VGZ0w
> SZWAUQqg48FoVu4kh3fp4j9mz9eqprF9rmZiEPqGJKtsUPnpMTd3SA6Xt2eucY3O
> VMOEbsy66/wC3DwgIgQdrrDfuRWsvmLkE6WyvkJpf1+sDIgFkSoD57y3YpHQdB4/
> o6+WXg1FSVjQAiND/XYAGZUHmV2o5JGFJVJJNlnmC6m11j/0zZvv4ZS1v3NX4DS1
> n9cwHtTEUxcz73AGzUo9A0QLfsPgEMEF8akbaLfA4UekZ0lZLCFXA36aP62KaI6b
> ICo1/qF7eEOC1XpdCZS2JWpjMQn83q2kvuIooTEyHXjOT8t27f0+455e3PgYuLkh
> kV9xMutmkJxXKv5VO3ohTmDWydQiwt/90M9ToTKonGeYWXTEEWzHpHr6BD95/2rZ
> +yAbY3S0vTb1J0uQmlDaK6dd1pU+SSMxIV6Gi1tYi1kMVboiiQAMxJ9eqEhjt21+
> W3x4oGPLUoJ6q1TSTh0BOnXVnEUeci/Srbp+GWXvhmXtVC5H9V6dggb94yaKI3nC
> KLW+87OYaU+Pd4GQNMI+2KipGAbeQ/8OhHEq63cFoKLzhKk/V/50w3Bo9/CLGyZ3
> W0E7lAZWV5cnu/AoKHC9KdSIPf+Qn6c//CtDmyWbjAr8g1yOzZc=
> =TScO
> -----END PGP SIGNATURE-----
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Choosing a font for non-ASCII characters

Posted by Christopher Schultz <ch...@christopherschultz.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Tilman,

On 3/3/19 08:48, Tilman Hausherr wrote:
>> I have no idea. The information about PDFBox seems to be mostly
>> in example programs and not web-based documentation. Searching
>> e.g. Google for "how to use FontBox with PDFBox" generally comes
>> up with references into the Javadoc for "uses of FontBox
>> interface".
>> 
>> The Javadoc does not describe what FontBox is and none of the
>> classes or subclasses in those related packages really have any
>> documentation worth reading. Each class "foo" is described as
>> "being a foo" and each "getBar" method is described as "gets the
>> bar for the foo".
>> 
>> So... discoverability of features is pretty much nil here.
>> 
>> I'm quite happy with the responses I get on this mailing list,
>> but it's nearly impossible to discover on my own what is
>> possible, here. I shouldn't have to get you guys to tell me how
>> to use the software... you have better things to do (like
>> continue to write great software).
>> 
>> Is there a good example of using FontBox with PDFBox in order to 
>> subset a font?
> 
> Yes, the EmbeddedFonts.java example.

I don't see any use of FontBox in the EmbeddedFonts.java example. Am I
missing something?

> We are a small team and don't have the time to write tutorials.
> There are many working examples and also many answers on
> stackoverflow.

Understood.

> You don't need fontbox unless for advanced things, e.g. reuse a
> font for several files. For normal use cases, fontbox remains under
> the hood.
> 
> If you think some class documentation is useless, name it, and I'll
> see if it can be improved.

:)

It's less of a presence of useless documentation and more of a lack of
existing documentation. I can file some tickets if you think it would
be helpful. I also don't mind writing documentation and/or tutorials
for the project.

> The subset thing is done by PDFBox without you having to bother
> about it. It's "not subsetting" that would require more parameters.
> So you need only this:
> 
> PDType0Font font = PDType0Font.load(document, new 
> File("c:/windows/fonts/arial.ttf")); stream.setFont(font, 12); 
> stream.showText("...");

Okay, that's exactly what we are doing (well... we are loading the
font via the ClassLoader, but ...). And it's working. I was just a
little worried about the ballooning file size. I realize there is
little to be done about that at this stage.

At this point, I am basically doing this:

[ When adding text to the document ]
- - If the text contains anything outside of the ANSI encoding
 - then replace the usual (default) font with the ARIALUNI.TTF

It operates on a per-text-string basis, so it should only change the
font for a single piece of text that requires it. I'm starting to
think that I should not bother scanning the text and instead use the
IllegalArgumentException as flow-control -- which I still don't like.
But it means that my code will not spend a ton of time repeating
checks that PDFBox will end up doing, anyway.

I'm a little worried about what I will do the next time I have an
issue like this -- where the ARIALUNI.TTF font doesn't include some
character that I need... since there's no way to probe a font for
support for a code point, I can't map code-points to fonts in a
scalable way. It will just be trial-and-error which is no fun.

It also means that I need to have some kind of set of fonts that we
just round-robin through, hoping we get a hit and we can continue...
otherwise we just have to fail (like we do now).

- -chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlx9gDYACgkQHPApP6U8
pFia0w/+LSFIJCLtol+WZDMpcjTxI1Y4ulUFmRJxd+ZdGzbCrKss2R3p+J6VGZ0w
SZWAUQqg48FoVu4kh3fp4j9mz9eqprF9rmZiEPqGJKtsUPnpMTd3SA6Xt2eucY3O
VMOEbsy66/wC3DwgIgQdrrDfuRWsvmLkE6WyvkJpf1+sDIgFkSoD57y3YpHQdB4/
o6+WXg1FSVjQAiND/XYAGZUHmV2o5JGFJVJJNlnmC6m11j/0zZvv4ZS1v3NX4DS1
n9cwHtTEUxcz73AGzUo9A0QLfsPgEMEF8akbaLfA4UekZ0lZLCFXA36aP62KaI6b
ICo1/qF7eEOC1XpdCZS2JWpjMQn83q2kvuIooTEyHXjOT8t27f0+455e3PgYuLkh
kV9xMutmkJxXKv5VO3ohTmDWydQiwt/90M9ToTKonGeYWXTEEWzHpHr6BD95/2rZ
+yAbY3S0vTb1J0uQmlDaK6dd1pU+SSMxIV6Gi1tYi1kMVboiiQAMxJ9eqEhjt21+
W3x4oGPLUoJ6q1TSTh0BOnXVnEUeci/Srbp+GWXvhmXtVC5H9V6dggb94yaKI3nC
KLW+87OYaU+Pd4GQNMI+2KipGAbeQ/8OhHEq63cFoKLzhKk/V/50w3Bo9/CLGyZ3
W0E7lAZWV5cnu/AoKHC9KdSIPf+Qn6c//CtDmyWbjAr8g1yOzZc=
=TScO
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Choosing a font for non-ASCII characters

Posted by Tilman Hausherr <TH...@t-online.de>.

> I have no idea. The information about PDFBox seems to be mostly in
> example programs and not web-based documentation. Searching e.g.
> Google for "how to use FontBox with PDFBox" generally comes up with
> references into the Javadoc for "uses of FontBox interface".
>
> The Javadoc does not describe what FontBox is and none of the classes
> or subclasses in those related packages really have any documentation
> worth reading. Each class "foo" is described as "being a foo" and each
> "getBar" method is described as "gets the bar for the foo".
>
> So... discoverability of features is pretty much nil here.
>
> I'm quite happy with the responses I get on this mailing list, but
> it's nearly impossible to discover on my own what is possible, here.
> I shouldn't have to get you guys to tell me how to use the software...
> you have better things to do (like continue to write great software).
>
> Is there a good example of using FontBox with PDFBox in order to
> subset a font?


Yes, the EmbeddedFonts.java example. We are a small team and don't have 
the time to write tutorials. There are many working examples and also 
many answers on stackoverflow.

You don't need fontbox unless for advanced things, e.g. reuse a font for 
several files. For normal use cases, fontbox remains under the hood.

If you think some class documentation is useless, name it, and I'll see 
if it can be improved.

The subset thing is done by PDFBox without you having to bother about 
it. It's "not subsetting" that would require more parameters. So you 
need only this:

PDType0Font font = PDType0Font.load(document, new 
File("c:/windows/fonts/arial.ttf"));
stream.setFont(font, 12);
stream.showText("...");

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Choosing a font for non-ASCII characters

Posted by Christopher Schultz <ch...@christopherschultz.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

John,

On 3/2/19 10:16, John Logan wrote:
> Christopher, is the font that you don’t want to embed a Type 1
> font, or a TrueType font?

I'm using PDType0Font.load to load the font, but the file is TT:

ARIALUNI.TTF: TrueType Font data, digitally signed, 20 tables, 1st
"DSIG", name offset 0x161c2f0

> If the latter, could you use Fontbox to subset the font and keep
> the file size small?

I have no idea. The information about PDFBox seems to be mostly in
example programs and not web-based documentation. Searching e.g.
Google for "how to use FontBox with PDFBox" generally comes up with
references into the Javadoc for "uses of FontBox interface".

The Javadoc does not describe what FontBox is and none of the classes
or subclasses in those related packages really have any documentation
worth reading. Each class "foo" is described as "being a foo" and each
"getBar" method is described as "gets the bar for the foo".

So... discoverability of features is pretty much nil here.

I'm quite happy with the responses I get on this mailing list, but
it's nearly impossible to discover on my own what is possible, here.
I shouldn't have to get you guys to tell me how to use the software...
you have better things to do (like continue to write great software).

Is there a good example of using FontBox with PDFBox in order to
subset a font?

- -chris

>> On Mar 2, 2019, at 7:00 AM, Tilman Hausherr
>> <TH...@t-online.de> wrote:
>> 
>> Am 02.03.2019 um 15:54 schrieb Christopher Schultz:
>>> Is there a good way to probe text to determine whether or not
>>> an alternate font will be necessary and only load/bundle it
>>> then?
>> 
>> From the new EmbeddedMultipleFonts.java example (in the source
>> code download):
>> 
>> 
>> boolean isWinAnsiEncoding(int unicode) { String name = 
>> GlyphList.getAdobeGlyphList().codePointToName(unicode); if 
>> (".notdef".equals(name)) { return false; } return 
>> WinAnsiEncoding.INSTANCE.contains(name); }
>> 
>> 
>> When that one returns true, you can use the built-in fonts.
>> 
>> Tilman
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>>
>>
>> 
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
> 
> 
> 
> ---------------------------------------------------------------------
>
>
> 
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlx72L8ACgkQHPApP6U8
pFieHQ/8CGlWfRwCNzFdZOLIz/bgquqSsVlGMSYwdMv3+Ytl5WJ8vvJj1az/YNVE
yyIXVKWWVa1aQiEMX+wEXZIhcLX1YROireFYkC6IwQaCjlfLtPTPopjwehVTfnN7
M5Fk23Rfge+Eths9alRm82hLgoKnYO70bYWfAWeYXokjPUXcQokfyG7N3CkWYaZa
Ljt8fihDGbk266v7wPwbiRef58F3NW1EfSFV4J8qFr/bOiLZsRXGY2UXe4/k6Fxn
qGSMqnV76CwWWXSYp4saKG0kAija37huAooYhksWAOO12WPJbOtCVD3C6veS/R8M
RFXOb9z9uT/yratN7KGDxuWKT28YXaoFPzJfLwx1ZOiDZCK3E39xG8d7/dqiAFrb
Edc4mBxK0wz9Ew6B1zReOG3d3kP7ksYEUsMwtLltfz4LSj17dzTuWaMCV5EQ0FRx
8oFm7xiPXBNwA8tNj/+US81jGV2u2pwxcKUi8LEygJzp7qjw5RsIQMrXUq450NWE
LKIPqUE3I8iIpCqST1IX6qMSKgUpYyKi9nTxjMXIjNL6j9kA91fzsZLluBRm2vCs
+jAgcVRImSrQ2wa0ZFtTEf3xQpEorkELgN1KhVkVRLllkisVmdqY026z7KfWwwP7
YsKBs6Si/ZOrDQO5gxlzXZIcE8AO54X7vh5V+IfKVsN+n6fwW9E=
=jXC9
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Choosing a font for non-ASCII characters

Posted by John Logan <jo...@texture.com>.

Christopher, is the font that you don’t want to embed a Type 1 font, or a TrueType font?

If the latter, could you use Fontbox to subset the font and keep the file size small?

> On Mar 2, 2019, at 7:00 AM, Tilman Hausherr <TH...@t-online.de> wrote:
> 
> Am 02.03.2019 um 15:54 schrieb Christopher Schultz:
>> Is there a good way to probe
>> text to determine whether or not an alternate font will be necessary
>> and only load/bundle it then?
> 
> From the new EmbeddedMultipleFonts.java example (in the source code download):
> 
> 
>     boolean isWinAnsiEncoding(int unicode)
>     {
>         String name = GlyphList.getAdobeGlyphList().codePointToName(unicode);
>         if (".notdef".equals(name))
>         {
>             return false;
>         }
>         return WinAnsiEncoding.INSTANCE.contains(name);
>     }
> 
> 
> When that one returns true, you can use the built-in fonts.
> 
> Tilman
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Choosing a font for non-ASCII characters

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 02.03.2019 um 15:54 schrieb Christopher Schultz:
> Is there a good way to probe
> text to determine whether or not an alternate font will be necessary
> and only load/bundle it then?

 From the new EmbeddedMultipleFonts.java example (in the source code 
download):


     boolean isWinAnsiEncoding(int unicode)
     {
         String name = 
GlyphList.getAdobeGlyphList().codePointToName(unicode);
         if (".notdef".equals(name))
         {
             return false;
         }
         return WinAnsiEncoding.INSTANCE.contains(name);
     }


When that one returns true, you can use the built-in fonts.

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Choosing a font for non-ASCII characters

Posted by Christopher Schultz <ch...@christopherschultz.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Andreas,

On 1/31/19 01:27, Andreas Lehmkuehler wrote:
> the standard pdf font (PDType1Font.HELVETICA et. al.) don't
> support anything else than (limited) latin1. You have to use
> something else.
> 
> Have a look at the HelloWorldTTF example [1]. It shows how to embed
> a true type font. You have to choose a suitable font from your OS
> or something like the noto fonts from google.
> 
> W.r.t. font embedding. It's always a good idea to embed all
> resources which are needed to render a pdf. PDFBox reduces the
> amount of space as it limits the embedded font to the used
> characters.

Thanks for the pointers. I'm finally getting around to doing something
about this. I used "Arial Unicode" as referenced in a quickie online
tutorial[1] and what I'm finding is that:

1. The Chinese characters render correctly (yay!)
2. My English-only file has gone from ~1k to ~18k

This test file was the simplest I could muster so it's really an
unfair comparison at this point.

But it's clear that the file will get bigger (of course) by adding the
font.

I'd like to avoid bundling the font unless it's necessary. For several
months, we've been able to get away with the standard PDF default
fonts (which, presumably, the PDF spec requires all clients to provide
which is why the files can be so small). Is there a good way to probe
text to determine whether or not an alternate font will be necessary
and only load/bundle it then?

Thanks,
- -chris

[1] http://www.kscodes.com/java/write-chinese-pdf-using-apache-pdfbox/

> Am 31.01.19 um 02:56 schrieb Christopher Schultz: Hello,
> 
> We are using PDFBox to generate PDFs in a very simple way and only 
> including fonts available from the PDType1Font class (e.g. 
> PDType1Font.HELVETICA). The PDFs we are generating are really only 
> including a few title/subtitles, text, and bulleted/numbered
> lists.
> 
> Everything is fine when we use what is probably in the standard
> Latin alphabet, and we've had some troubles with special characters
> that don't fit in there such as ≥ and ≤. We've dealt with that by
> simply replacing "≤" with "<=" and so on, but we're starting to use
> languages that don't use Latin script and so we can no longer
> replace out way out of the problem.
> 
> For example, I need to be able to put Chinese characters into a PDF
> we generate. So let's take the text "中國" which is just the word
> "China" in Traditional Chinese script.
> 
> First, how can I find out that the character isn't going to fit
> into the font that I'm currently using? Should I do it for every
> character we try to put into the page, or should we just catch
> exceptions when we try to write the text to the page and then scan
> at that point? I'm trying to avoid writing hideously inefficient
> code to handle these situations.
> 
> Second, once I know that I need to choose another font... how do I 
> know which font to choose? Should I keep a mapping of Unicode code 
> point ranges and the best fonts to use for them?
> 
> Finally, what fonts are actually available to PDFBox? How do I add
> new ones? I have a lot of control over the environment and I get to
> see failing conversions and intervene, so some trial and error is
> okay for each new situation.
> 
> The recipients of our PDFs are file-size sensitive, so I'd only
> want to include (bundle) a font in a PDF if it was absolutely
> necessary to include the font itself. If we can get away with
> including a *reference* to the font in the PDF and telling these
> recipients "sorry, if you want to read the Chinese PDFs we send,
> you'd better make sure you have font X installed" then that's okay
> with me, too.
> 
> What suggestions to people have for doing all of the above?
> 
> Thanks, -chris
>> 
>> ---------------------------------------------------------------------
>>
>> 
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
>
> 
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlx6mRwACgkQHPApP6U8
pFjWYQ/+JqfHbkkJ4ML+uxduY4PIJqY7u+FC1lsbVvbVjIhi1rLCQRuNDUWnpkmz
bSfwCoDOevamegryFFxH/I4Ok+v8TXmBUEnAeEOFtHGlWHDuNXcijxmlFRKdpjIi
MFzqv8t+4+YY6dS4KyHr4+fhj57sSqRkGVrKAYANonx3z/nEn/X7PqOnY1seDrEJ
QGB/09y36+58E6TI+65resE181nvYFcw5kqchFWIjziwH654gldLQCojZ15GS5+/
PylDx5f6n/pxPYJLX940zEDjfqR4FCQryuzo1Yf3xM96c1IMYJbViv/LWrz+lQnc
+7PPK99oVhRdQKQ90HOsFA+7WfyB6IXv/uOdFyXSjWTP7NNQ4v5wSp+nULrRsSRH
uc3FL9N55ujdHb5uTQW5tl5kENfIXdgh5X0XtI/3TQGnmFJRbsx/py/Elpno7HVO
IwbwWTXnefYGvjsP1zU1YjCS4WBuekE/3C5Mn5zJaQFxRNrNCXmAeYBLskA6gitk
u5A+wl3jPlGrJe5Vvvgr6CJJl9p67XldiJslUQ/Gekjqd0VA572zeiOhj35Qkh1D
Eh43WPn2KR2TGYtmU1WyM4fyKIN7/9ReqTv53hV8t5P/ItEjlY9zAABnMDsK+eXr
iRK/Q8LbMpLz3osZQuccmCaSfTufbGr444lngSZLRhFs7Uoihl4=
=p8mE
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Choosing a font for non-ASCII characters

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Hi,

the standard pdf font (PDType1Font.HELVETICA et. al.) don't support anything 
else than (limited) latin1. You have to use something else.

Have a look at the HelloWorldTTF example [1]. It shows how to embed a true type 
font. You have to choose a suitable font from your OS or something like the noto 
fonts from google.

W.r.t. font embedding. It's always a good idea to embed all resources which are 
needed to render a pdf. PDFBox reduces the amount of space as it limits the 
embedded font to the used characters.

HTH,
Andreas

[1] 
http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/HelloWorldTTF.java

Am 31.01.19 um 02:56 schrieb Christopher Schultz:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> Hello,
> 
> We are using PDFBox to generate PDFs in a very simple way and only
> including fonts available from the PDType1Font class (e.g.
> PDType1Font.HELVETICA). The PDFs we are generating are really only
> including a few title/subtitles, text, and bulleted/numbered lists.
> 
> Everything is fine when we use what is probably in the standard Latin
> alphabet, and we've had some troubles with special characters that
> don't fit in there such as ≥ and ≤. We've dealt with that by simply
> replacing "≤" with "<=" and so on, but we're starting to use languages
> that don't use Latin script and so we can no longer replace out way
> out of the problem.
> 
> For example, I need to be able to put Chinese characters into a PDF we
> generate. So let's take the text "中國" which is just the word "China"
> in Traditional Chinese script.
> 
> First, how can I find out that the character isn't going to fit into
> the font that I'm currently using? Should I do it for every character
> we try to put into the page, or should we just catch exceptions when
> we try to write the text to the page and then scan at that point? I'm
> trying to avoid writing hideously inefficient code to handle these
> situations.
> 
> Second, once I know that I need to choose another font... how do I
> know which font to choose? Should I keep a mapping of Unicode code
> point ranges and the best fonts to use for them?
> 
> Finally, what fonts are actually available to PDFBox? How do I add new
> ones? I have a lot of control over the environment and I get to see
> failing conversions and intervene, so some trial and error is okay for
> each new situation.
> 
> The recipients of our PDFs are file-size sensitive, so I'd only want
> to include (bundle) a font in a PDF if it was absolutely necessary to
> include the font itself. If we can get away with including a
> *reference* to the font in the PDF and telling these recipients
> "sorry, if you want to read the Chinese PDFs we send, you'd better
> make sure you have font X installed" then that's okay with me, too.
> 
> What suggestions to people have for doing all of the above?
> 
> Thanks,
> - -chris
> -----BEGIN PGP SIGNATURE-----
> Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/
> 
> iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlxSVeMACgkQHPApP6U8
> pFgQew/8CS1YmJs27QrD+WGV/Zcn2RAeG/ZVs5w3huMwKLY8NfXQ4Vdp3o+s+B7u
> 2wn9m2LJVXuWT2dfDDQzZDIfBgfqZI5sl4+hBDSos9gEVV3ddWcox1A0YSTCy5VW
> DAlDZSscEdIDyMIVz2E1dQi6/p35MrSyJ/Xom6Tbnvt3ZHAp87GHZ1rB8XXrtVZS
> itVE756hJ59o4tZJoM9cH1NH1w9PuLLJyrGpCsc1oTgcZTI0jXxiIC9Q4GvLbLbO
> yVdExITzTVflLAo0BRGOJkb5IF1OyVf51HHas1+DMEvtSXY5J89e1dFnyo1dFxMU
> MXJ5rKh/FQvJtC5Lf9QoQ3tV8r3qyWv0wc8FVgMcLUA9DHbx7QtcydQwoKf3poJz
> ymlOJWH2b4d5uLbSfdjr9Nof4IRNH504cwjoth3eor3Ra/SCaem2ZrTQhY6XzoF1
> vCpZChDIKzDvI7NDGbcaNvzzezNmlbdRdh3Ekwk1E/vwfrmtb4VmW7sW9PICP1o6
> 80sqydy6qIMtQNjr1EK55VIvD4+e10SwYWhcZinsByQkYZpoRjKWQ9kTNk10vvwk
> cLB8bVeLPHC7nLe4FqJe4y3+hWBfGP25O2VdnNU1sjd4lbzQhNIgCMj0n+6ziDuU
> Nh9vDuKRXEIIXHZUxrN2Td3hOw96wKHqEQ8RtxYpuGWABx4wIWw=
> =aMPi
> -----END PGP SIGNATURE-----
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Choosing a font for non-ASCII characters

Posted by "Richter, Michael" <m....@tu-berlin.de>.

Hi,
A few weeks ago I had issues with unicode too. I switched the font to LiberationSans which is included in PDFBox:

PDFont font = PDType0Font.load(document,
PDDocument.class.getResourceAsStream("/org/apache/pdfbox/resources/ttf/LiberationSans-Regular.ttf"), true);

This works for me.

And I stumbled over this which may help you:

https://stackoverflow.com/questions/51481600/handle-many-unicode-caracters-with-pdfbox

Michael Richter

Am Mittwoch, den 30.01.2019, 20:56 -0500 schrieb Christopher Schultz:

Hello,

We are using PDFBox to generate PDFs in a very simple way and only

including fonts available from the PDType1Font class (e.g.

PDType1Font.HELVETICA). The PDFs we are generating are really only

including a few title/subtitles, text, and bulleted/numbered lists.

Everything is fine when we use what is probably in the standard Latin

alphabet, and we've had some troubles with special characters that

don't fit in there such as ≥ and ≤. We've dealt with that by simply

replacing "≤" with "<=" and so on, but we're starting to use languages

that don't use Latin script and so we can no longer replace out way

out of the problem.

For example, I need to be able to put Chinese characters into a PDF we

generate. So let's take the text "中國" which is just the word "China"

in Traditional Chinese script.

First, how can I find out that the character isn't going to fit into

the font that I'm currently using? Should I do it for every character

we try to put into the page, or should we just catch exceptions when

we try to write the text to the page and then scan at that point? I'm

trying to avoid writing hideously inefficient code to handle these

situations.

Second, once I know that I need to choose another font... how do I

know which font to choose? Should I keep a mapping of Unicode code

point ranges and the best fonts to use for them?

Finally, what fonts are actually available to PDFBox? How do I add new

ones? I have a lot of control over the environment and I get to see

failing conversions and intervene, so some trial and error is okay for

each new situation.

The recipients of our PDFs are file-size sensitive, so I'd only want

to include (bundle) a font in a PDF if it was absolutely necessary to

include the font itself. If we can get away with including a

*reference* to the font in the PDF and telling these recipients

"sorry, if you want to read the Chinese PDFs we send, you'd better

make sure you have font X installed" then that's okay with me, too.

What suggestions to people have for doing all of the above?

Thanks,

-chris

---------------------------------------------------------------------

To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org<ma...@pdfbox.apache.org>

For additional commands, e-mail: users-help@pdfbox.apache.org<ma...@pdfbox.apache.org>