You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Dan Liu <13...@qq.com> on 2017/12/20 02:39:44 UTC

1 text line becomes 2 line after extraction

such as:
1 line: 肺具有广泛的呼吸面积,成人的总呼吸面积约有100m2(3 亿-7.5 亿肺泡),在呼吸过程中,
  ‍

  ‍

becomes 2 lines after extraction:
2
肺具有广泛的呼吸面积,成人的总呼吸面积约有100m(3 亿-7.5 亿肺泡),在呼吸过程中,

since y coordinate of char '2' is smaller than other chars. 


------------------


with best regards


daniel

Re: 1 text line becomes 2 line after extraction

Posted by Tilman Hausherr <TH...@t-online.de>.
And it's not just chinese glyphs, even in western texts things may go 
wrong when the superscript is a few pixels too high:
https://issues.apache.org/jira/browse/PDFBOX-4000


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: 1 text line becomes 2 line after extraction

Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,

Yes, that happens at page 19. This is because the baseline of the 
superscript is larger/equal to the top of the heuristic height, so 
PDFBox will assume that these are different lines. That height is never 
the correct height, it is usually the height of a small character (e.g. 
a, e, u) in "western" text. So in other text this may not work very 
well. See this screenshot:

There is an alternative code in
https://issues.apache.org/jira/browse/PDFBOX-3970
which uses actual glyph heights and that one works for p19 of your file.

To test that, get LegacyPDFStreamEngine.java and replace that one in the 
build. (Yes, you'll have to build from source) It is the call to 
getActualGlyphBoundingBox() and the following line.

However, build tests have shown problems with other files so I have done 
nothing further.

Tilman

result:

1.血液检查 呼吸系统感染时,中性粒细胞增加,有时还伴有中毒颗粒;嗜酸性粒细
胞增加提示过敏性因素、曲霉或寄生虫感染;其他血清学抗体试验,如荧光抗体、对流免疫
电泳、酶联免疫吸附测定等,对于病毒、支原体和细菌感染的诊断均有一定价值。
2.抗原皮肤试验 哮喘的变应原皮肤试验阳性有助于变应体质的确定和相应抗原的脱
敏治疗。对结核或真菌呈阳性的皮肤反应仅说明已受感染,并不能肯定患病。
3.痰液检查 痰涂片在低倍镜视野里上皮细胞<1O 个,白细胞>25 个为相对污染少的
痰标本,定量培养菌量≥1O7cfu/ml 
可判定为致病菌。若经环甲膜穿刺气管吸引、或经纤维
支气管镜(简称纤支镜)防污染双套管毛刷采样,可防止咽喉部寄殖菌的污染,此时培养菌
量≥103cfu/ml 即有诊断意义。反复作痰脱落细胞检查,有助于肺癌的诊断。
4.胸腔积液(胸液)检查和胸膜活检 常规胸液检查可明确渗出性或是漏出性胸液。
检查胸液的溶菌酶、腺苷脱氨酶、癌胚抗原及进行染色体分析,有助于结核性与恶性胸液的
鉴别。脱落细胞和胸膜病理活检对明确肿瘤或结核有诊断价值。
5.影像学检查 胸部 X 线透视配合正侧位胸片,可发现被心、纵隔等掩盖的病变,并
能观察膈、心血管活动情况。高电压体层摄片和 CT 
能进一步明确病变部位、性质以及有关
气管、支气管通畅程度。磁共振显像(MRI)对纵隔疾病和肺血栓栓塞症有较大帮助。肺血
管造影用于肺血栓栓塞症和各种先天性或获得性血管病变的诊断;支气管动脉造影和栓塞术
对咯血有较好的诊治价值。
6.支气管镜和胸腔镜 硬质支气管镜检查已被纤维支气管镜(纤支镜)所替代,前者
仅必要时才用于作气管内肿瘤或异物的摘除手术。纤支镜能深人亚段支气管,直接窥视黏膜
水肿、充血、溃疡、肉芽肿、新生物、异物等,作黏膜的刷检或钳检,进行组织学检查;并
可经纤支镜作支气管肺泡灌洗。灌洗液的微生物、细胞学、免疫学、生物化学等检查,有助
于明确病原和病理诊断;还可通过它取出异物、诊断咯血,经高频电刀、激光、微波及药物
注射治疗良、恶性肿瘤。借助纤支镜的引导还可作气管插管。胸腔镜已广泛应用于胸膜活检、
肺活检。
7.放射性核素扫描 应用 133氙或 
99m锝-二乙三胺五乙酸(99mTc-DTPA)雾化吸人。99m锝
大颗粒人血清聚合清蛋白(99mTc-MAA)静脉注射对肺区域性通气/灌注情况、肺血栓栓塞症
和血流缺损,以及占位病变的诊断有帮助。67镓对间质性肺纤维化的肺泡炎、结节病和肺癌


Am 21.12.2017 um 02:26 schrieb Dan Liu:
> Hi, all:
>      I'm using pdfbox 2.0.8, the test pdf file can download from  http://proj.gz-yibo.com:2880/nk7.pdf
>
> eg:
> a text line in page 19:
> 7.放射性核素扫描应用133 氙或99m 锝-二乙三胺五乙酸(99mTc-DTPA)雾化吸人。99m 锝
> becomes:
> 133 99m 99m 99m
> 7.放射性核素扫描应用 氙或 锝-二乙三胺五乙酸(Tc-DTPA)雾化吸人。 锝
>
>
> ------------------
>    With best regards
>
>
> Daniel
>
>
> ------------------ Original ------------------
> From:  "139250065";<13...@qq.com>;
> Date:  Wed, Dec 20, 2017 10:39 AM
> To:  "users"<us...@pdfbox.apache.org>;
>
> Subject:  1 text line becomes 2 line after extraction
>
>
>
> such as:
> 1 line: 肺具有广泛的呼吸面积,成人的总呼吸面积约有100m2(3 亿-7.5 亿肺泡),在呼吸过程中,
>    ‍
>
>    ‍
>
> becomes 2 lines after extraction:
> 2
> 肺具有广泛的呼吸面积,成人的总呼吸面积约有100m(3 亿-7.5 亿肺泡),在呼吸过程中,
>
> since y coordinate of char '2' is smaller than other chars.
>
>
> ------------------
>
>
> with best regards
>
>
> daniel



Re: 1 text line becomes 2 line after extraction

Posted by Dan Liu <13...@qq.com>.
Hi, all:
    I'm using pdfbox 2.0.8, the test pdf file can download from  http://proj.gz-yibo.com:2880/nk7.pdf

eg: 
a text line in page 19:
7.放射性核素扫描应用133 氙或99m 锝-二乙三胺五乙酸(99mTc-DTPA)雾化吸人。99m 锝
becomes:
133 99m 99m 99m
7.放射性核素扫描应用 氙或 锝-二乙三胺五乙酸(Tc-DTPA)雾化吸人。 锝


------------------
  With best regards


Daniel


------------------ Original ------------------
From:  "139250065";<13...@qq.com>;
Date:  Wed, Dec 20, 2017 10:39 AM
To:  "users"<us...@pdfbox.apache.org>;

Subject:  1 text line becomes 2 line after extraction



such as:
1 line: 肺具有广泛的呼吸面积,成人的总呼吸面积约有100m2(3 亿-7.5 亿肺泡),在呼吸过程中,
  ‍

  ‍

becomes 2 lines after extraction:
2
肺具有广泛的呼吸面积,成人的总呼吸面积约有100m(3 亿-7.5 亿肺泡),在呼吸过程中,

since y coordinate of char '2' is smaller than other chars. 


------------------


with best regards


daniel