You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by Bernard Segonnes <bs...@free.fr> on 2010/08/09 15:41:49 UTC

Text extraction : do we need those files ?

Hi,

I have ported PDFBox 1.1.0 on Android  (only text extraction).  The binary is
too big & too slow (probably due to memory constraints...) : around 5Mo   (9Mo
once installed on a mobile device : too much)

So I'm looking for files I can delete.... I only need to extract text.

Are the files in :
1)  cmap     require ?    (78-EUC_H   Adobe-CNS-5   GBK-EUC-V   UniKS-UTF8-H
...)  I would be please to remove all those files :-)


2) pdf_*.xml  are they require for text extraction ?     (pdf_he_IL.xml  
pdf_zh_Hant.xml ....)


3) other resoucres file I can remove ?

Thanks for the help.

Re: Text extraction : do we need those files ?

Posted by Bernard Segonnes <bs...@free.fr>.

Thanks for the answer.

The PDFBOX-586 is from myself  :-)

So, as I expect to have customers in asian, and 'righ to left' countries : I
will keep those files :-(

(I sometimes have Out Of Memory Exception I should catch as my app. runs on
mobile devices/phones).  I will optimize elsewhere.

Selon Jukka Zitting <ju...@gmail.com>:

> Hi,
>
> On Mon, Aug 9, 2010 at 3:41 PM, Bernard Segonnes <bs...@free.fr> wrote:
> > I have ported PDFBox 1.1.0 on Android  (only text extraction).  The binary
> is
> > too big & too slow (probably due to memory constraints...) : around 5Mo  
> (9Mo
> > once installed on a mobile device : too much)
>
> See PDFBOX-586 [1] for some related progress.
>
> > Are the files in :
> > 1)  cmap     require ?    (78-EUC_H   Adobe-CNS-5   GBK-EUC-V  
> UniKS-UTF8-H
> > ...)  I would be please to remove all those files :-)
>
> These are only needed for processing PDF documents that use CJK
> (Chinese, Japanese, Korean) fonts. These CMaps are needed to translate
> from the internal font-specific character identification codes to
> Unicode.
>
> > 2) pdf_*.xml  are they require for text extraction ?     (pdf_he_IL.xml
> > pdf_zh_Hant.xml ....)
>
> These are part of the ICU4J library. You only need ICU4J for handling
> Arabic and other right-to-left languages.
>
> [1] https://issues.apache.org/jira/browse/PDFBOX-586
>
> BR,
>
> Jukka Zitting
>


Bernard SEGONNES
-------------------------------------
bsegonnes@free.fr
http://bsegonnes.free.fr

Re: Text extraction : do we need those files ?

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Mon, Aug 9, 2010 at 3:41 PM, Bernard Segonnes <bs...@free.fr> wrote:
> I have ported PDFBox 1.1.0 on Android  (only text extraction).  The binary is
> too big & too slow (probably due to memory constraints...) : around 5Mo   (9Mo
> once installed on a mobile device : too much)

See PDFBOX-586 [1] for some related progress.

> Are the files in :
> 1)  cmap     require ?    (78-EUC_H   Adobe-CNS-5   GBK-EUC-V   UniKS-UTF8-H
> ...)  I would be please to remove all those files :-)

These are only needed for processing PDF documents that use CJK
(Chinese, Japanese, Korean) fonts. These CMaps are needed to translate
from the internal font-specific character identification codes to
Unicode.

> 2) pdf_*.xml  are they require for text extraction ?     (pdf_he_IL.xml
> pdf_zh_Hant.xml ....)

These are part of the ICU4J library. You only need ICU4J for handling
Arabic and other right-to-left languages.

[1] https://issues.apache.org/jira/browse/PDFBOX-586

BR,

Jukka Zitting