You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/12/07 14:43:00 UTC
[jira] [Issue Comment Deleted] (TIKA-2519) Issue parsing multiple
CHM files concurrently
[ https://issues.apache.org/jira/browse/TIKA-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison updated TIKA-2519:
------------------------------
Comment: was deleted
(was: I'm seeing this when I run the code against chm multithreaded:
{noformat}
Caused by: org.apache.tika.exception.TikaException: can't copy beyond array length
at org.apache.tika.parser.chm.core.ChmCommons.copyOfRange(ChmCommons.java:347)
at org.apache.tika.parser.chm.accessor.ChmDirectoryListingSet.enumerateChmDirectoryListingList(ChmDirectoryListingSet.java:144)
at org.apache.tika.parser.chm.accessor.ChmDirectoryListingSet.<init>(ChmDirectoryListingSet.java:63)
at org.apache.tika.parser.chm.core.ChmExtractor.<init>(ChmExtractor.java:181)
at org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:63)
{noformat}
This is a problem.)
> Issue parsing multiple CHM files concurrently
> ---------------------------------------------
>
> Key: TIKA-2519
> URL: https://issues.apache.org/jira/browse/TIKA-2519
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.16
> Reporter: Eamonn Saunders
> Priority: Blocker
>
> Should I expect to be able to parse multiple CHM files concurrently in multiple threads?
> What I'm noticing when attempting to parse 2 different CHM files in different threads is that:
> - ChmExtractor.extractChmEntry() gets a ChmBlockInfo as follows:
> {code}
> ChmBlockInfo bb = ChmBlockInfo.getChmBlockInfoInstance(
> directoryListingEntry, (int) getChmLzxcResetTable()
> .getBlockLen(), getChmLzxcControlData());
> {code}
> - ChmBlockInfo.getChmBlockInfoInstance() is a static method that appears to limit the number of ChmBlockInfo instances to 1.
> {code}
> public static ChmBlockInfo getChmBlockInfoInstance(
> DirectoryListingEntry dle, int bytesPerBlock,
> ChmLzxcControlData clcd) {
> setChmBlockInfo(new ChmBlockInfo());
> getChmBlockInfo().setStartBlock(dle.getOffset() / bytesPerBlock);
> getChmBlockInfo().setEndBlock(
> (dle.getOffset() + dle.getLength()) / bytesPerBlock);
> getChmBlockInfo().setStartOffset(dle.getOffset() % bytesPerBlock);
> getChmBlockInfo().setEndOffset(
> (dle.getOffset() + dle.getLength()) % bytesPerBlock);
> // potential problem with casting long to int
> getChmBlockInfo().setIniBlock(
> getChmBlockInfo().startBlock - getChmBlockInfo().startBlock
> % (int) clcd.getResetInterval());
> // (getChmBlockInfo().startBlock - getChmBlockInfo().startBlock)
> // % (int) clcd.getResetInterval());
> return getChmBlockInfo();
> }
> {code}
> Is there a good reason why there should only ever be one instance of ChmBlockInfo?
> Should we forget about attempting to process CHM files in parallel and instead queue them up to be processed sequentially?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)