You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Jingxuan Yu (JIRA)" <ji...@apache.org> on 2010/07/19 11:19:51 UTC

[jira] Created: (PDFBOX-779) All English characters and some Chinese words are separated by a space

All English characters and some Chinese words are separated by a space
----------------------------------------------------------------------

                 Key: PDFBOX-779
                 URL: https://issues.apache.org/jira/browse/PDFBOX-779
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.2.1, 1.3.0
         Environment: java 1.6.0_20
pdfbox 1.2.1
fontbax 1.2.1
            Reporter: Jingxuan Yu


See the pdf document and text document extracted by ExtractText. 
:( Can't upload attatchments???
So, the file's info:
$ pdfinfo IKAnalyzer.pdf 
Title:          IKAnalyzer中文分词器V3.0使用手册
Keywords:       IK Analyzer 中文分词器 Lucene
Author:         林良益、卓诗垚
Creator:        WPS Office 个人版
Producer:       PDFlib 7.0.3 (C++/Win32)
CreationDate:   Sun Dec  6 22:07:26 2009
Tagged:         no
Pages:          15
Encrypted:      no
Page size:      595.3 x 841.9 pts (A4)
File size:      441273 bytes
Optimized:      no
PDF version:    1.5

$ pdffonts IKAnalyzer.pdf 
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
INUZMH+NSimSun-Identity-H            CID TrueType      yes yes yes      7  0
MGIXAY+MicrosoftYaHei-Identity-H     CID TrueType      yes yes yes      8  0
CFLOPA+SimSun-Identity-H             CID TrueType      yes yes yes      6  0
GHNZKZ+TimesNewRomanPS-BoldMT-Identity-H CID TrueType      yes yes yes     19  0
UNEBHT+Cambria-Bold-Identity-H       CID TrueType      yes yes yes     20  0
UQKWWP+Wingdings-Regular-Identity-H  CID TrueType      yes yes yes     33  0
NKFTTO+MicrosoftYaHei-Identity-H     CID TrueType      yes yes yes     40  0
OOJXDG+CourierNewPSMT-Identity-H     CID TrueType      yes yes yes     51  0
WHLDYI+CourierNewPS-ItalicMT-Identity-H CID TrueType      yes yes yes     58  0
TXIHGB+Cambria-Identity-H            CID TrueType      yes yes yes    100  0
CRJWMD+TimesNewRomanPSMT-Identity-H  CID TrueType      yes yes yes    108  0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-779) All English characters and some Chinese words are separated by a space

Posted by "Jingxuan Yu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jingxuan Yu updated PDFBOX-779:
-------------------------------

    Environment: 
x86_64 GNU/Linux
java 1.6.0_20
pdfbox 1.2.1
fontbax 1.2.1

  was:
java 1.6.0_20
pdfbox 1.2.1
fontbax 1.2.1

    Description: 
See the pdf document and text document extracted by ExtractText. 
The file's info:
$ pdfinfo IKAnalyzer.pdf 
Title:          IKAnalyzer中文分词器V3.0使用手册
Keywords:       IK Analyzer 中文分词器 Lucene
Author:         林良益、卓诗垚
Creator:        WPS Office 个人版
Producer:       PDFlib 7.0.3 (C++/Win32)
CreationDate:   Sun Dec  6 22:07:26 2009
Tagged:         no
Pages:          15
Encrypted:      no
Page size:      595.3 x 841.9 pts (A4)
File size:      441273 bytes
Optimized:      no
PDF version:    1.5

$ pdffonts IKAnalyzer.pdf 
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
INUZMH+NSimSun-Identity-H            CID TrueType      yes yes yes      7  0
MGIXAY+MicrosoftYaHei-Identity-H     CID TrueType      yes yes yes      8  0
CFLOPA+SimSun-Identity-H             CID TrueType      yes yes yes      6  0
GHNZKZ+TimesNewRomanPS-BoldMT-Identity-H CID TrueType      yes yes yes     19  0
UNEBHT+Cambria-Bold-Identity-H       CID TrueType      yes yes yes     20  0
UQKWWP+Wingdings-Regular-Identity-H  CID TrueType      yes yes yes     33  0
NKFTTO+MicrosoftYaHei-Identity-H     CID TrueType      yes yes yes     40  0
OOJXDG+CourierNewPSMT-Identity-H     CID TrueType      yes yes yes     51  0
WHLDYI+CourierNewPS-ItalicMT-Identity-H CID TrueType      yes yes yes     58  0
TXIHGB+Cambria-Identity-H            CID TrueType      yes yes yes    100  0
CRJWMD+TimesNewRomanPSMT-Identity-H  CID TrueType      yes yes yes    108  0

  was:
See the pdf document and text document extracted by ExtractText. 
:( Can't upload attatchments???
So, the file's info:
$ pdfinfo IKAnalyzer.pdf 
Title:          IKAnalyzer中文分词器V3.0使用手册
Keywords:       IK Analyzer 中文分词器 Lucene
Author:         林良益、卓诗垚
Creator:        WPS Office 个人版
Producer:       PDFlib 7.0.3 (C++/Win32)
CreationDate:   Sun Dec  6 22:07:26 2009
Tagged:         no
Pages:          15
Encrypted:      no
Page size:      595.3 x 841.9 pts (A4)
File size:      441273 bytes
Optimized:      no
PDF version:    1.5

$ pdffonts IKAnalyzer.pdf 
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
INUZMH+NSimSun-Identity-H            CID TrueType      yes yes yes      7  0
MGIXAY+MicrosoftYaHei-Identity-H     CID TrueType      yes yes yes      8  0
CFLOPA+SimSun-Identity-H             CID TrueType      yes yes yes      6  0
GHNZKZ+TimesNewRomanPS-BoldMT-Identity-H CID TrueType      yes yes yes     19  0
UNEBHT+Cambria-Bold-Identity-H       CID TrueType      yes yes yes     20  0
UQKWWP+Wingdings-Regular-Identity-H  CID TrueType      yes yes yes     33  0
NKFTTO+MicrosoftYaHei-Identity-H     CID TrueType      yes yes yes     40  0
OOJXDG+CourierNewPSMT-Identity-H     CID TrueType      yes yes yes     51  0
WHLDYI+CourierNewPS-ItalicMT-Identity-H CID TrueType      yes yes yes     58  0
TXIHGB+Cambria-Identity-H            CID TrueType      yes yes yes    100  0
CRJWMD+TimesNewRomanPSMT-Identity-H  CID TrueType      yes yes yes    108  0


> All English characters and some Chinese words are separated by a space
> ----------------------------------------------------------------------
>
>                 Key: PDFBOX-779
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-779
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.1, 1.3.0
>         Environment: x86_64 GNU/Linux
> java 1.6.0_20
> pdfbox 1.2.1
> fontbax 1.2.1
>            Reporter: Jingxuan Yu
>         Attachments: IKAnalyzer.pdf, IKAnalyzer.txt
>
>
> See the pdf document and text document extracted by ExtractText. 
> The file's info:
> $ pdfinfo IKAnalyzer.pdf 
> Title:          IKAnalyzer中文分词器V3.0使用手册
> Keywords:       IK Analyzer 中文分词器 Lucene
> Author:         林良益、卓诗垚
> Creator:        WPS Office 个人版
> Producer:       PDFlib 7.0.3 (C++/Win32)
> CreationDate:   Sun Dec  6 22:07:26 2009
> Tagged:         no
> Pages:          15
> Encrypted:      no
> Page size:      595.3 x 841.9 pts (A4)
> File size:      441273 bytes
> Optimized:      no
> PDF version:    1.5
> $ pdffonts IKAnalyzer.pdf 
> name                                 type              emb sub uni object ID
> ------------------------------------ ----------------- --- --- --- ---------
> INUZMH+NSimSun-Identity-H            CID TrueType      yes yes yes      7  0
> MGIXAY+MicrosoftYaHei-Identity-H     CID TrueType      yes yes yes      8  0
> CFLOPA+SimSun-Identity-H             CID TrueType      yes yes yes      6  0
> GHNZKZ+TimesNewRomanPS-BoldMT-Identity-H CID TrueType      yes yes yes     19  0
> UNEBHT+Cambria-Bold-Identity-H       CID TrueType      yes yes yes     20  0
> UQKWWP+Wingdings-Regular-Identity-H  CID TrueType      yes yes yes     33  0
> NKFTTO+MicrosoftYaHei-Identity-H     CID TrueType      yes yes yes     40  0
> OOJXDG+CourierNewPSMT-Identity-H     CID TrueType      yes yes yes     51  0
> WHLDYI+CourierNewPS-ItalicMT-Identity-H CID TrueType      yes yes yes     58  0
> TXIHGB+Cambria-Identity-H            CID TrueType      yes yes yes    100  0
> CRJWMD+TimesNewRomanPSMT-Identity-H  CID TrueType      yes yes yes    108  0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-779) All English characters and some Chinese words are separated by a space

Posted by "Jingxuan Yu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jingxuan Yu updated PDFBOX-779:
-------------------------------

    Attachment: IKAnalyzer.pdf
                IKAnalyzer.txt

> All English characters and some Chinese words are separated by a space
> ----------------------------------------------------------------------
>
>                 Key: PDFBOX-779
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-779
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.1, 1.3.0
>         Environment: java 1.6.0_20
> pdfbox 1.2.1
> fontbax 1.2.1
>            Reporter: Jingxuan Yu
>         Attachments: IKAnalyzer.pdf, IKAnalyzer.txt
>
>
> See the pdf document and text document extracted by ExtractText. 
> :( Can't upload attatchments???
> So, the file's info:
> $ pdfinfo IKAnalyzer.pdf 
> Title:          IKAnalyzer中文分词器V3.0使用手册
> Keywords:       IK Analyzer 中文分词器 Lucene
> Author:         林良益、卓诗垚
> Creator:        WPS Office 个人版
> Producer:       PDFlib 7.0.3 (C++/Win32)
> CreationDate:   Sun Dec  6 22:07:26 2009
> Tagged:         no
> Pages:          15
> Encrypted:      no
> Page size:      595.3 x 841.9 pts (A4)
> File size:      441273 bytes
> Optimized:      no
> PDF version:    1.5
> $ pdffonts IKAnalyzer.pdf 
> name                                 type              emb sub uni object ID
> ------------------------------------ ----------------- --- --- --- ---------
> INUZMH+NSimSun-Identity-H            CID TrueType      yes yes yes      7  0
> MGIXAY+MicrosoftYaHei-Identity-H     CID TrueType      yes yes yes      8  0
> CFLOPA+SimSun-Identity-H             CID TrueType      yes yes yes      6  0
> GHNZKZ+TimesNewRomanPS-BoldMT-Identity-H CID TrueType      yes yes yes     19  0
> UNEBHT+Cambria-Bold-Identity-H       CID TrueType      yes yes yes     20  0
> UQKWWP+Wingdings-Regular-Identity-H  CID TrueType      yes yes yes     33  0
> NKFTTO+MicrosoftYaHei-Identity-H     CID TrueType      yes yes yes     40  0
> OOJXDG+CourierNewPSMT-Identity-H     CID TrueType      yes yes yes     51  0
> WHLDYI+CourierNewPS-ItalicMT-Identity-H CID TrueType      yes yes yes     58  0
> TXIHGB+Cambria-Identity-H            CID TrueType      yes yes yes    100  0
> CRJWMD+TimesNewRomanPSMT-Identity-H  CID TrueType      yes yes yes    108  0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.