You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Oleg Tikhonov (JIRA)" <ji...@apache.org> on 2010/12/06 07:54:12 UTC

[jira] Issue Comment Edited: (TIKA-245) Support of CHM Format

    [ https://issues.apache.org/jira/browse/TIKA-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966721#action_12966721 ] 

Oleg Tikhonov edited comment on TIKA-245 at 12/6/10 1:52 AM:
-------------------------------------------------------------

A couple of weeks ago I received the answer from SourceForge.net:
"My apologies for not passing this message on sooner, however the project  admin has responded that he is not willing to give up this project at this time. As such, we are not fulfilling this takeover request."

The library as it is today contains critical bugs, because the fact that project is abandoned I cannot fix its bugs, thus I would exclude it as an option.

Other option - 7-Zip-JBinding (http://sourceforge.net/projects/sevenzipjbind/develop/). I've implemented chm parser using this library, it works pretty well, the throughput of html extracting is about 5mb/sec. However, it's licensed under LGPL. I've asked Boris Brodski (the developer of that library) if he could re-license it for us. Here is a link to the discussion between him and Igor Pavlov (the author of 7Zip).
http://sourceforge.net/projects/sevenzip/forums/forum/45797/topic/3983892

What do you think?

BR,
Oleg 

  
> Support of CHM Format
> ---------------------
>
>                 Key: TIKA-245
>                 URL: https://issues.apache.org/jira/browse/TIKA-245
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>         Environment: All
>            Reporter: Karl Heinz Marbaise
>            Priority: Minor
>         Attachments: TIKA-245.tikhonov.20103107.patch.txt
>
>
> It might be a good idea to support the CHM File format of Windows. Some information about http://en.wikipedia.org/wiki/Microsoft_Compiled_HTML_Help#Extracting_to_HTML. The CHM format contains HTML files which can be parsed by Tika. So the "only" problem is to extract the data from the CHM file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.