You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Chris A. Mattmann (JIRA)" <ji...@apache.org> on 2011/06/05 00:32:47 UTC

[jira] [Commented] (TIKA-245) Support of CHM Format

    [ https://issues.apache.org/jira/browse/TIKA-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13044403#comment-13044403 ] 

Chris A. Mattmann commented on TIKA-245:
----------------------------------------

Hi Oleg,

Looking over this patch, I have a few recommendations:

# the patch should be applied to the Tika source tree format (e.g., tika-parsers/src/main/java/org/apache/tika/parsers/chm)
# Many of the class-top-level comments can probably be removed and thrown up on the Tika Wiki
# it would be nice to include at least a unit test or 2 to know this is working. It's a huge patch, and I don't have a lot of CHM files to test it out on (being a Mac guy :-) )

Cheers,
Chris



> Support of CHM Format
> ---------------------
>
>                 Key: TIKA-245
>                 URL: https://issues.apache.org/jira/browse/TIKA-245
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>         Environment: All
>            Reporter: Karl Heinz Marbaise
>            Priority: Minor
>         Attachments: TIKA-245.tikhonov.04082011.patch.txt, TIKA-245.tikhonov.20103107.patch.txt, TIKA-245.tikhonov.20112603.txt, TIKA-245.tikhonov.20112703.txt
>
>
> It might be a good idea to support the CHM File format of Windows. Some information about http://en.wikipedia.org/wiki/Microsoft_Compiled_HTML_Help#Extracting_to_HTML. The CHM format contains HTML files which can be parsed by Tika. So the "only" problem is to extract the data from the CHM file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] [Commented] (TIKA-245) Support of CHM Format

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hey Jukka,

Thanks for the motivation. I put my money where my mouth was :-)

Oleg, your patch rox. That's all I had to say. My improvement was simply to commit it to the Tika sources. Feel free to mod/add/whatever on it after that, per Jukka's comments.

I am going to make one more update, just to add the parser to the SPI provider file for Parsers.

Thanks guys!

Cheers,
Chris

On Jun 7, 2011, at 7:27 AM, Mattmann, Chris A (388J) wrote:

> Hey Jukka,
> 
> On Jun 7, 2011, at 6:55 AM, Jukka Zitting wrote:
> 
>> Hi,
>> 
>> On Tue, Jun 7, 2011 at 3:52 PM, Mattmann, Chris A (388J)
>> <ch...@jpl.nasa.gov> wrote:
>>> Please revert r1132997, and then just modify your patch to make sure that
>>> your java classes and files fit into the appropriate Tika source code area.
>>> Then please attach a new patch real quick so I (or some other committer)
>>> can verify and then you're good to go.
>> 
>> Oleg's a committer like anyone else, so IMHO there's no need for an
>> extra round of review. CTR and all that.
>> 
>> If you have ideas on how to improve on Oleg's commit, you can post a
>> patch or commit the improvements directly.
> 
> I thought about doing that after sending the email -- that said, I was trying to give him a shot to do it. 
> 
> If he doesn't, or if no one else does soon, I'll get to it this weekend.
> 
> Cheers,
> Chris
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: [jira] [Commented] (TIKA-245) Support of CHM Format

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hey Jukka,

On Jun 7, 2011, at 6:55 AM, Jukka Zitting wrote:

> Hi,
> 
> On Tue, Jun 7, 2011 at 3:52 PM, Mattmann, Chris A (388J)
> <ch...@jpl.nasa.gov> wrote:
>> Please revert r1132997, and then just modify your patch to make sure that
>> your java classes and files fit into the appropriate Tika source code area.
>> Then please attach a new patch real quick so I (or some other committer)
>> can verify and then you're good to go.
> 
> Oleg's a committer like anyone else, so IMHO there's no need for an
> extra round of review. CTR and all that.
> 
> If you have ideas on how to improve on Oleg's commit, you can post a
> patch or commit the improvements directly.

I thought about doing that after sending the email -- that said, I was trying to give him a shot to do it. 

If he doesn't, or if no one else does soon, I'll get to it this weekend.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: [jira] [Commented] (TIKA-245) Support of CHM Format

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Tue, Jun 7, 2011 at 3:52 PM, Mattmann, Chris A (388J)
<ch...@jpl.nasa.gov> wrote:
> Please revert r1132997, and then just modify your patch to make sure that
> your java classes and files fit into the appropriate Tika source code area.
> Then please attach a new patch real quick so I (or some other committer)
> can verify and then you're good to go.

Oleg's a committer like anyone else, so IMHO there's no need for an
extra round of review. CTR and all that.

If you have ideas on how to improve on Oleg's commit, you can post a
patch or commit the improvements directly.

BR,

Jukka Zitting

Re: [jira] [Commented] (TIKA-245) Support of CHM Format

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Oleg,

On Jun 7, 2011, at 6:28 AM, Oleg Tikhonov wrote:

> Hi Chris,
> 
> I've applied the patch to the
> tika-parsers/src/main/java/org/apache/tika/parser/chm, also added 3 chm
> files to the tika-parsers\src\test\resources\test-documents and the tests.

Thanks sorry I think I confused you with my comments on the JIRA issue. 

Please uncommit the patch from Tika. By:

>> the patch should be applied to the Tika source tree format (e.g.,
>> tika-parsers/src/main/java/org/apache/tika/parsers/chm)

I didn't mean literally to "commit the patch to SVN" :-) I meant, if you looked at your patch inside of it, it didn't put the Java class files in the appropriate TIka source code area (e.g., tika-parsers/src/main/java/org/apache/tika/parsers/chm). 

Please revert r1132997, and then just modify your patch to make sure that your java classes and files fit into the appropriate Tika source code area. Then please attach a new patch real quick so I (or some other committer) can verify and then you're good to go. 

Cheers,
Chris


> 
> BR,
> Oleg
> 
> On Sun, Jun 5, 2011 at 1:32 AM, Chris A. Mattmann (JIRA) <ji...@apache.org>wrote:
> 
>> 
>>   [
>> https://issues.apache.org/jira/browse/TIKA-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13044403#comment-13044403]
>> 
>> Chris A. Mattmann commented on TIKA-245:
>> ----------------------------------------
>> 
>> Hi Oleg,
>> 
>> Looking over this patch, I have a few recommendations:
>> 
>> # the patch should be applied to the Tika source tree format (e.g.,
>> tika-parsers/src/main/java/org/apache/tika/parsers/chm)
>> # Many of the class-top-level comments can probably be removed and thrown
>> up on the Tika Wiki
>> # it would be nice to include at least a unit test or 2 to know this is
>> working. It's a huge patch, and I don't have a lot of CHM files to test it
>> out on (being a Mac guy :-) )
>> 
>> Cheers,
>> Chris
>> 
>> 
>> 
>>> Support of CHM Format
>>> ---------------------
>>> 
>>>                Key: TIKA-245
>>>                URL: https://issues.apache.org/jira/browse/TIKA-245
>>>            Project: Tika
>>>         Issue Type: New Feature
>>>         Components: parser
>>>        Environment: All
>>>           Reporter: Karl Heinz Marbaise
>>>           Priority: Minor
>>>        Attachments: TIKA-245.tikhonov.04082011.patch.txt,
>> TIKA-245.tikhonov.20103107.patch.txt, TIKA-245.tikhonov.20112603.txt,
>> TIKA-245.tikhonov.20112703.txt
>>> 
>>> 
>>> It might be a good idea to support the CHM File format of Windows. Some
>> information about
>> http://en.wikipedia.org/wiki/Microsoft_Compiled_HTML_Help#Extracting_to_HTML.
>> The CHM format contains HTML files which can be parsed by Tika. So the
>> "only" problem is to extract the data from the CHM file.
>> 
>> --
>> This message is automatically generated by JIRA.
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: [jira] [Commented] (TIKA-245) Support of CHM Format

Posted by Oleg Tikhonov <ol...@apache.org>.
Hi Chris,

I've applied the patch to the
tika-parsers/src/main/java/org/apache/tika/parser/chm, also added 3 chm
files to the tika-parsers\src\test\resources\test-documents and the tests.

BR,
Oleg

On Sun, Jun 5, 2011 at 1:32 AM, Chris A. Mattmann (JIRA) <ji...@apache.org>wrote:

>
>    [
> https://issues.apache.org/jira/browse/TIKA-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13044403#comment-13044403]
>
> Chris A. Mattmann commented on TIKA-245:
> ----------------------------------------
>
> Hi Oleg,
>
> Looking over this patch, I have a few recommendations:
>
> # the patch should be applied to the Tika source tree format (e.g.,
> tika-parsers/src/main/java/org/apache/tika/parsers/chm)
> # Many of the class-top-level comments can probably be removed and thrown
> up on the Tika Wiki
> # it would be nice to include at least a unit test or 2 to know this is
> working. It's a huge patch, and I don't have a lot of CHM files to test it
> out on (being a Mac guy :-) )
>
> Cheers,
> Chris
>
>
>
> > Support of CHM Format
> > ---------------------
> >
> >                 Key: TIKA-245
> >                 URL: https://issues.apache.org/jira/browse/TIKA-245
> >             Project: Tika
> >          Issue Type: New Feature
> >          Components: parser
> >         Environment: All
> >            Reporter: Karl Heinz Marbaise
> >            Priority: Minor
> >         Attachments: TIKA-245.tikhonov.04082011.patch.txt,
> TIKA-245.tikhonov.20103107.patch.txt, TIKA-245.tikhonov.20112603.txt,
> TIKA-245.tikhonov.20112703.txt
> >
> >
> > It might be a good idea to support the CHM File format of Windows. Some
> information about
> http://en.wikipedia.org/wiki/Microsoft_Compiled_HTML_Help#Extracting_to_HTML.
> The CHM format contains HTML files which can be parsed by Tika. So the
> "only" problem is to extract the data from the CHM file.
>
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>