You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2010/12/14 20:11:01 UTC
[jira] Updated: (PDFBOX-779) All English characters and some
Chinese words are separated by a space
[ https://issues.apache.org/jira/browse/PDFBOX-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler updated PDFBOX-779:
--------------------------------------
Attachment: PDFBOX779-IKAnalyzer.txt
> All English characters and some Chinese words are separated by a space
> ----------------------------------------------------------------------
>
> Key: PDFBOX-779
> URL: https://issues.apache.org/jira/browse/PDFBOX-779
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.2.1, 1.3.1
> Environment: x86_64 GNU/Linux
> java 1.6.0_20
> pdfbox 1.2.1
> fontbax 1.2.1
> Reporter: Jingxuan Yu
> Fix For: 1.4.0
>
> Attachments: IKAnalyzer.pdf, IKAnalyzer.txt, PDFBOX779-IKAnalyzer.txt
>
>
> See the pdf document and text document extracted by ExtractText.
> The file's info:
> $ pdfinfo IKAnalyzer.pdf
> Title: IKAnalyzer中文分词器V3.0使用手册
> Keywords: IK Analyzer 中文分词器 Lucene
> Author: 林良益、卓诗垚
> Creator: WPS Office 个人版
> Producer: PDFlib 7.0.3 (C++/Win32)
> CreationDate: Sun Dec 6 22:07:26 2009
> Tagged: no
> Pages: 15
> Encrypted: no
> Page size: 595.3 x 841.9 pts (A4)
> File size: 441273 bytes
> Optimized: no
> PDF version: 1.5
> $ pdffonts IKAnalyzer.pdf
> name type emb sub uni object ID
> ------------------------------------ ----------------- --- --- --- ---------
> INUZMH+NSimSun-Identity-H CID TrueType yes yes yes 7 0
> MGIXAY+MicrosoftYaHei-Identity-H CID TrueType yes yes yes 8 0
> CFLOPA+SimSun-Identity-H CID TrueType yes yes yes 6 0
> GHNZKZ+TimesNewRomanPS-BoldMT-Identity-H CID TrueType yes yes yes 19 0
> UNEBHT+Cambria-Bold-Identity-H CID TrueType yes yes yes 20 0
> UQKWWP+Wingdings-Regular-Identity-H CID TrueType yes yes yes 33 0
> NKFTTO+MicrosoftYaHei-Identity-H CID TrueType yes yes yes 40 0
> OOJXDG+CourierNewPSMT-Identity-H CID TrueType yes yes yes 51 0
> WHLDYI+CourierNewPS-ItalicMT-Identity-H CID TrueType yes yes yes 58 0
> TXIHGB+Cambria-Identity-H CID TrueType yes yes yes 100 0
> CRJWMD+TimesNewRomanPSMT-Identity-H CID TrueType yes yes yes 108 0
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.