You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by jorgeeflorez <jo...@gmail.com> on 2021/05/05 16:39:53 UTC

Detecting CID fonts

Hi,
I would like to know what would be the best way to detect whether ia pdf
file has CID fonts. As far as I understand, these fonts are used in asian
texts (japanese, chinese, korean, etc). I have the following code:

        PDDocument doc = PDDocument.load(myFile);
        for (int i = 0; i < doc.getNumberOfPages(); ++i)
        {
            PDPage page = doc.getPage(i);
            PDResources res = page.getResources();
            for (COSName fontName : res.getFontNames())
            {
                PDFont font = res.getFont(fontName);
                COSName subType =
font.getCOSObject().getCOSName(COSName.SUBTYPE);
                System.out.println("CID? " + COSName.TYPE0.equals(subType));
                System.out.println("font instanceof PDType0Font? " + (font
instanceof PDType0Font));
            }
        }
Would this be the right way to do it?

I need to detect this and try to create a pdf file from the original, but
without the text.

Any indication is appreciated.

Regards,

Jorge

Re: Detecting CID fonts

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 06.05.2021 um 20:52 schrieb jorgeeflorez:
>> If the problem goes away by just opening and saving the PDF, then why
>> modify it?
> I cannot share the PDF file to the support team of that library. So I was
> wondering if I could share the pdf without images, so technically would not
> be sharing the file, only the part that is causing them problems. But it
> probably won't work...
Obviously not, if the problem goes away by saving.
>
> I've never heard of fonts being a problem... rather patterns, big images
>> or very complex vector graphics.
>>
> An internal call to Arrays.copyOf (doing God knows what) takes all
> available memory. Strange indeed.

You would have to see the rest of the stack trace. Also try to run the 
whole thing with a bigger -Xmx value.

Tilman



>
> El jue, 6 may 2021 a las 12:34, Tilman Hausherr (<TH...@t-online.de>)
> escribió:
>
>> Maybe the PDF is somehow broken and PDFBox repairs it. If the problem
>> goes away by just opening and saving the PDF, then why modify it?
>>
>> I've never heard of fonts being a problem... rather patterns, big images
>> or very complex vector graphics.
>>
>> Tilman
>>
>> Am 06.05.2021 um 14:16 schrieb jorgeeflorez:
>>> Hi Tilman,
>>> thank you for your reply.
>>>
>>> It's more complicated because form XObjects, patterns, annotations,
>>>> softmasks (and maybe more) can also have fonts. I also doubt that you
>>>> can detect CK fonts this way.
>>>>
>>> I see... I have a nasty pdf file that is causing OutOfMemoryError when
>> used
>>> by another library and I reached the conclusion that it is (somehow)
>>> because the text and the fonts it uses...
>>>
>>> I saw the RemoveAllText example and maybe is what I need. I modified it
>> and
>>> instead of removing text I did nothing, and the new pdf file seems to
>> have
>>> the "corruption" removed...
>>>
>>> One last question, how could I modify the RemoveAllText example to remove
>>> from the pdf file all images?
>>>
>>> Thanks.
>>>
>>> Jorge
>>>
>>>
>>>
>>> El jue, 6 may 2021 a las 1:07, Tilman Hausherr (<TH...@t-online.de>)
>>> escribió:
>>>
>>>> Am 05.05.2021 um 18:39 schrieb jorgeeflorez:
>>>>> Hi,
>>>>> I would like to know what would be the best way to detect whether ia
>> pdf
>>>>> file has CID fonts. As far as I understand, these fonts are used in
>> asian
>>>>> texts (japanese, chinese, korean, etc). I have the following code:
>>>>>
>>>>>            PDDocument doc = PDDocument.load(myFile);
>>>>>            for (int i = 0; i < doc.getNumberOfPages(); ++i)
>>>>>            {
>>>>>                PDPage page = doc.getPage(i);
>>>>>                PDResources res = page.getResources();
>>>>>                for (COSName fontName : res.getFontNames())
>>>>>                {
>>>>>                    PDFont font = res.getFont(fontName);
>>>>>                    COSName subType =
>>>>> font.getCOSObject().getCOSName(COSName.SUBTYPE);
>>>>>                    System.out.println("CID? " +
>>>> COSName.TYPE0.equals(subType));
>>>>>                    System.out.println("font instanceof PDType0Font? " +
>>>> (font
>>>>> instanceof PDType0Font));
>>>>>                }
>>>>>            }
>>>>> Would this be the right way to do it?
>>>> It's more complicated because form XObjects, patterns, annotations,
>>>> softmasks (and maybe more) can also have fonts. I also doubt that you
>>>> can detect CK fonts this way.
>>>>
>>>> Re removing the text, see the RemoveAllTexts example in the source code
>>>> download. IIRC this one only does the page content stream.
>>>>
>>>> Tilman
>>>>
>>>>
>>>>> I need to detect this and try to create a pdf file from the original,
>> but
>>>>> without the text.
>>>>>
>>>>> Any indication is appreciated.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Jorge
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Detecting CID fonts

Posted by jorgeeflorez <jo...@gmail.com>.

>
> If the problem goes away by just opening and saving the PDF, then why
> modify it?

I cannot share the PDF file to the support team of that library. So I was
wondering if I could share the pdf without images, so technically would not
be sharing the file, only the part that is causing them problems. But it
probably won't work...

I've never heard of fonts being a problem... rather patterns, big images
> or very complex vector graphics.
>
An internal call to Arrays.copyOf (doing God knows what) takes all
available memory. Strange indeed.

El jue, 6 may 2021 a las 12:34, Tilman Hausherr (<TH...@t-online.de>)
escribió:

> Maybe the PDF is somehow broken and PDFBox repairs it. If the problem
> goes away by just opening and saving the PDF, then why modify it?
>
> I've never heard of fonts being a problem... rather patterns, big images
> or very complex vector graphics.
>
> Tilman
>
> Am 06.05.2021 um 14:16 schrieb jorgeeflorez:
> > Hi Tilman,
> > thank you for your reply.
> >
> > It's more complicated because form XObjects, patterns, annotations,
> >> softmasks (and maybe more) can also have fonts. I also doubt that you
> >> can detect CK fonts this way.
> >>
> > I see... I have a nasty pdf file that is causing OutOfMemoryError when
> used
> > by another library and I reached the conclusion that it is (somehow)
> > because the text and the fonts it uses...
> >
> > I saw the RemoveAllText example and maybe is what I need. I modified it
> and
> > instead of removing text I did nothing, and the new pdf file seems to
> have
> > the "corruption" removed...
> >
> > One last question, how could I modify the RemoveAllText example to remove
> > from the pdf file all images?
> >
> > Thanks.
> >
> > Jorge
> >
> >
> >
> > El jue, 6 may 2021 a las 1:07, Tilman Hausherr (<TH...@t-online.de>)
> > escribió:
> >
> >> Am 05.05.2021 um 18:39 schrieb jorgeeflorez:
> >>> Hi,
> >>> I would like to know what would be the best way to detect whether ia
> pdf
> >>> file has CID fonts. As far as I understand, these fonts are used in
> asian
> >>> texts (japanese, chinese, korean, etc). I have the following code:
> >>>
> >>>           PDDocument doc = PDDocument.load(myFile);
> >>>           for (int i = 0; i < doc.getNumberOfPages(); ++i)
> >>>           {
> >>>               PDPage page = doc.getPage(i);
> >>>               PDResources res = page.getResources();
> >>>               for (COSName fontName : res.getFontNames())
> >>>               {
> >>>                   PDFont font = res.getFont(fontName);
> >>>                   COSName subType =
> >>> font.getCOSObject().getCOSName(COSName.SUBTYPE);
> >>>                   System.out.println("CID? " +
> >> COSName.TYPE0.equals(subType));
> >>>                   System.out.println("font instanceof PDType0Font? " +
> >> (font
> >>> instanceof PDType0Font));
> >>>               }
> >>>           }
> >>> Would this be the right way to do it?
> >>
> >> It's more complicated because form XObjects, patterns, annotations,
> >> softmasks (and maybe more) can also have fonts. I also doubt that you
> >> can detect CK fonts this way.
> >>
> >> Re removing the text, see the RemoveAllTexts example in the source code
> >> download. IIRC this one only does the page content stream.
> >>
> >> Tilman
> >>
> >>
> >>> I need to detect this and try to create a pdf file from the original,
> but
> >>> without the text.
> >>>
> >>> Any indication is appreciated.
> >>>
> >>> Regards,
> >>>
> >>> Jorge
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Detecting CID fonts

Posted by Tilman Hausherr <TH...@t-online.de>.

Maybe the PDF is somehow broken and PDFBox repairs it. If the problem 
goes away by just opening and saving the PDF, then why modify it?

I've never heard of fonts being a problem... rather patterns, big images 
or very complex vector graphics.

Tilman

Am 06.05.2021 um 14:16 schrieb jorgeeflorez:
> Hi Tilman,
> thank you for your reply.
>
> It's more complicated because form XObjects, patterns, annotations,
>> softmasks (and maybe more) can also have fonts. I also doubt that you
>> can detect CK fonts this way.
>>
> I see... I have a nasty pdf file that is causing OutOfMemoryError when used
> by another library and I reached the conclusion that it is (somehow)
> because the text and the fonts it uses...
>
> I saw the RemoveAllText example and maybe is what I need. I modified it and
> instead of removing text I did nothing, and the new pdf file seems to have
> the "corruption" removed...
>
> One last question, how could I modify the RemoveAllText example to remove
> from the pdf file all images?
>
> Thanks.
>
> Jorge
>
>
>
> El jue, 6 may 2021 a las 1:07, Tilman Hausherr (<TH...@t-online.de>)
> escribió:
>
>> Am 05.05.2021 um 18:39 schrieb jorgeeflorez:
>>> Hi,
>>> I would like to know what would be the best way to detect whether ia pdf
>>> file has CID fonts. As far as I understand, these fonts are used in asian
>>> texts (japanese, chinese, korean, etc). I have the following code:
>>>
>>>           PDDocument doc = PDDocument.load(myFile);
>>>           for (int i = 0; i < doc.getNumberOfPages(); ++i)
>>>           {
>>>               PDPage page = doc.getPage(i);
>>>               PDResources res = page.getResources();
>>>               for (COSName fontName : res.getFontNames())
>>>               {
>>>                   PDFont font = res.getFont(fontName);
>>>                   COSName subType =
>>> font.getCOSObject().getCOSName(COSName.SUBTYPE);
>>>                   System.out.println("CID? " +
>> COSName.TYPE0.equals(subType));
>>>                   System.out.println("font instanceof PDType0Font? " +
>> (font
>>> instanceof PDType0Font));
>>>               }
>>>           }
>>> Would this be the right way to do it?
>>
>> It's more complicated because form XObjects, patterns, annotations,
>> softmasks (and maybe more) can also have fonts. I also doubt that you
>> can detect CK fonts this way.
>>
>> Re removing the text, see the RemoveAllTexts example in the source code
>> download. IIRC this one only does the page content stream.
>>
>> Tilman
>>
>>
>>> I need to detect this and try to create a pdf file from the original, but
>>> without the text.
>>>
>>> Any indication is appreciated.
>>>
>>> Regards,
>>>
>>> Jorge
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Detecting CID fonts

Posted by jorgeeflorez <jo...@gmail.com>.

Hi Tilman,
thank you for your reply.

It's more complicated because form XObjects, patterns, annotations,
> softmasks (and maybe more) can also have fonts. I also doubt that you
> can detect CK fonts this way.
>
I see... I have a nasty pdf file that is causing OutOfMemoryError when used
by another library and I reached the conclusion that it is (somehow)
because the text and the fonts it uses...

I saw the RemoveAllText example and maybe is what I need. I modified it and
instead of removing text I did nothing, and the new pdf file seems to have
the "corruption" removed...

One last question, how could I modify the RemoveAllText example to remove
from the pdf file all images?

Thanks.

Jorge



El jue, 6 may 2021 a las 1:07, Tilman Hausherr (<TH...@t-online.de>)
escribió:

> Am 05.05.2021 um 18:39 schrieb jorgeeflorez:
> > Hi,
> > I would like to know what would be the best way to detect whether ia pdf
> > file has CID fonts. As far as I understand, these fonts are used in asian
> > texts (japanese, chinese, korean, etc). I have the following code:
> >
> >          PDDocument doc = PDDocument.load(myFile);
> >          for (int i = 0; i < doc.getNumberOfPages(); ++i)
> >          {
> >              PDPage page = doc.getPage(i);
> >              PDResources res = page.getResources();
> >              for (COSName fontName : res.getFontNames())
> >              {
> >                  PDFont font = res.getFont(fontName);
> >                  COSName subType =
> > font.getCOSObject().getCOSName(COSName.SUBTYPE);
> >                  System.out.println("CID? " +
> COSName.TYPE0.equals(subType));
> >                  System.out.println("font instanceof PDType0Font? " +
> (font
> > instanceof PDType0Font));
> >              }
> >          }
> > Would this be the right way to do it?
>
>
> It's more complicated because form XObjects, patterns, annotations,
> softmasks (and maybe more) can also have fonts. I also doubt that you
> can detect CK fonts this way.
>
> Re removing the text, see the RemoveAllTexts example in the source code
> download. IIRC this one only does the page content stream.
>
> Tilman
>
>
> >
> > I need to detect this and try to create a pdf file from the original, but
> > without the text.
> >
> > Any indication is appreciated.
> >
> > Regards,
> >
> > Jorge
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Detecting CID fonts

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 05.05.2021 um 18:39 schrieb jorgeeflorez:
> Hi,
> I would like to know what would be the best way to detect whether ia pdf
> file has CID fonts. As far as I understand, these fonts are used in asian
> texts (japanese, chinese, korean, etc). I have the following code:
>
>          PDDocument doc = PDDocument.load(myFile);
>          for (int i = 0; i < doc.getNumberOfPages(); ++i)
>          {
>              PDPage page = doc.getPage(i);
>              PDResources res = page.getResources();
>              for (COSName fontName : res.getFontNames())
>              {
>                  PDFont font = res.getFont(fontName);
>                  COSName subType =
> font.getCOSObject().getCOSName(COSName.SUBTYPE);
>                  System.out.println("CID? " + COSName.TYPE0.equals(subType));
>                  System.out.println("font instanceof PDType0Font? " + (font
> instanceof PDType0Font));
>              }
>          }
> Would this be the right way to do it?


It's more complicated because form XObjects, patterns, annotations, 
softmasks (and maybe more) can also have fonts. I also doubt that you 
can detect CK fonts this way.

Re removing the text, see the RemoveAllTexts example in the source code 
download. IIRC this one only does the page content stream.

Tilman


>
> I need to detect this and try to create a pdf file from the original, but
> without the text.
>
> Any indication is appreciated.
>
> Regards,
>
> Jorge
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org