You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Bin Hawking (JIRA)" <ji...@apache.org> on 2014/09/28 10:21:33 UTC
[jira] [Created] (TIKA-1430) CHM parser gets faulty text (fix
found)
Bin Hawking created TIKA-1430:
---------------------------------
Summary: CHM parser gets faulty text (fix found)
Key: TIKA-1430
URL: https://issues.apache.org/jira/browse/TIKA-1430
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.6, 1.5
Environment: Windows 7; JDK 7 or 8
Reporter: Bin Hawking
Priority: Critical
Get partially wrong text out of a CHM file, including the chm files in tika-parsers/src/test/resources/test-documents/testChm*.chm
I tried 1.6 and 1.5. Same bad. I wonder why no one complained before?
I checked the source code. The cause is obvious:
When tika decompresses the LZX, the first block is done well, but as to the 2nd block and later on, Tika uses previous content as the compressed data. see in org.apache.tika.parser.chm.lzx.ChmLzxBlock
"""
if (prevBlock != null
&& prevBlock.getState().getBlockLength() > prevBlock
.getState().getBlockRemaining())
setChmSection(new ChmSection(prevBlock.getContent()));
// NOTE: the dataSegment to be decompressed is not kept
else
setChmSection(new ChmSection(dataSegment));
"""
My fix:
1. Add a prevcontent member variable in ChmSection class, so that dataSegment and prevBlock.getContent() are both kept in it.
2. In ChmLzxBlock.extractContent() when invoking decompressVerbatimBlock(), pass ChmSection.prevcontent if exists, instead of ChmSection.data.
Now, I try some chm files, and got the correct texts.
BTW. The unit test should be tougher, as in this case some small text (the first block) is decompressed correctly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)