You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "Zhang, Lisheng" <Li...@broadvision.com> on 2006/02/10 19:45:57 UTC
Can PDFBox or POI handle multi-byte characters with different enc
odings?
Hi,
Currently we are using PDFBox to process PDF files and
POI to process DOC/XLS files, before send strings to lucene
for indexing,
Does any one know if PDFBox or POI can process multi-
byte characters like Japanese with various encodings (whatever
specified in PDF or DOC)?
Thanks very much for helps, Lisheng
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Can PDFBox or POI handle multi-byte characters with different
enc odings?
Posted by Ben Litchfield <be...@csh.rit.edu>.
PDFBox can handle multi-byte encodings. There are a couple recent fixes
for CJK languages that are not part of 0.7.2 but are part of the nightly
build.
Ben
On Fri, 10 Feb 2006, Zhang, Lisheng wrote:
> Hi,
>
> Currently we are using PDFBox to process PDF files and
> POI to process DOC/XLS files, before send strings to lucene
> for indexing,
>
> Does any one know if PDFBox or POI can process multi-
> byte characters like Japanese with various encodings (whatever
> specified in PDF or DOC)?
>
> Thanks very much for helps, Lisheng
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org