You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Bernard Segonnes <bs...@free.fr> on 2010/08/09 15:41:49 UTC
Text extraction : do we need those files ?
Hi,
I have ported PDFBox 1.1.0 on Android (only text extraction). The binary is
too big & too slow (probably due to memory constraints...) : around 5Mo (9Mo
once installed on a mobile device : too much)
So I'm looking for files I can delete.... I only need to extract text.
Are the files in :
1) cmap require ? (78-EUC_H Adobe-CNS-5 GBK-EUC-V UniKS-UTF8-H
...) I would be please to remove all those files :-)
2) pdf_*.xml are they require for text extraction ? (pdf_he_IL.xml
pdf_zh_Hant.xml ....)
3) other resoucres file I can remove ?
Thanks for the help.
Re: Text extraction : do we need those files ?
Posted by Bernard Segonnes <bs...@free.fr>.
Thanks for the answer.
The PDFBOX-586 is from myself :-)
So, as I expect to have customers in asian, and 'righ to left' countries : I
will keep those files :-(
(I sometimes have Out Of Memory Exception I should catch as my app. runs on
mobile devices/phones). I will optimize elsewhere.
Selon Jukka Zitting <ju...@gmail.com>:
> Hi,
>
> On Mon, Aug 9, 2010 at 3:41 PM, Bernard Segonnes <bs...@free.fr> wrote:
> > I have ported PDFBox 1.1.0 on Android (only text extraction). The binary
> is
> > too big & too slow (probably due to memory constraints...) : around 5Mo
> (9Mo
> > once installed on a mobile device : too much)
>
> See PDFBOX-586 [1] for some related progress.
>
> > Are the files in :
> > 1) cmap require ? (78-EUC_H Adobe-CNS-5 GBK-EUC-V
> UniKS-UTF8-H
> > ...) I would be please to remove all those files :-)
>
> These are only needed for processing PDF documents that use CJK
> (Chinese, Japanese, Korean) fonts. These CMaps are needed to translate
> from the internal font-specific character identification codes to
> Unicode.
>
> > 2) pdf_*.xml are they require for text extraction ? (pdf_he_IL.xml
> > pdf_zh_Hant.xml ....)
>
> These are part of the ICU4J library. You only need ICU4J for handling
> Arabic and other right-to-left languages.
>
> [1] https://issues.apache.org/jira/browse/PDFBOX-586
>
> BR,
>
> Jukka Zitting
>
Bernard SEGONNES
-------------------------------------
bsegonnes@free.fr
http://bsegonnes.free.fr
Re: Text extraction : do we need those files ?
Posted by Jukka Zitting <ju...@gmail.com>.
Hi,
On Mon, Aug 9, 2010 at 3:41 PM, Bernard Segonnes <bs...@free.fr> wrote:
> I have ported PDFBox 1.1.0 on Android (only text extraction). The binary is
> too big & too slow (probably due to memory constraints...) : around 5Mo (9Mo
> once installed on a mobile device : too much)
See PDFBOX-586 [1] for some related progress.
> Are the files in :
> 1) cmap require ? (78-EUC_H Adobe-CNS-5 GBK-EUC-V UniKS-UTF8-H
> ...) I would be please to remove all those files :-)
These are only needed for processing PDF documents that use CJK
(Chinese, Japanese, Korean) fonts. These CMaps are needed to translate
from the internal font-specific character identification codes to
Unicode.
> 2) pdf_*.xml are they require for text extraction ? (pdf_he_IL.xml
> pdf_zh_Hant.xml ....)
These are part of the ICU4J library. You only need ICU4J for handling
Arabic and other right-to-left languages.
[1] https://issues.apache.org/jira/browse/PDFBOX-586
BR,
Jukka Zitting