You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Gang Luo (JIRA)" <ji...@apache.org> on 2015/12/17 02:02:46 UTC

[jira] [Reopened] (PDFBOX-3166) Unwanted spaces before number in chinese text extraction

     [ https://issues.apache.org/jira/browse/PDFBOX-3166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gang Luo reopened PDFBOX-3166:
------------------------------

Text extraction is very sensitive to changes. Yes ,I see. Is there API can adjust space char to appear or not?
I try PDFTextStripper.setSpacingTolerance(). But it cannot eliminate space before the 1 , if I add setSpacingTolerance value.

PDFTextStripper stripper = new PDFTextStripper();
stripper.setSpacingTolerance(800.0f); //0.08f

If I reduce the setSpacingTolerance value , it did add space after date number.

The rest is pretty good.


> Unwanted spaces before number in chinese text extraction
> --------------------------------------------------------
>
>                 Key: PDFBOX-3166
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3166
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>         Environment: Windows
>            Reporter: Gang Luo
>              Labels: test
>         Attachments: 1201830823-marked-1.png
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Unwanted spaces before number in chinese date text .
> such as this pdf file
> http://www.cninfo.com.cn/finalpage/2015-12-12/1201830823.PDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org