You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jan Riewe <ja...@comspace.de> on 2012/08/08 12:03:11 UTC

CHM Files and Tika

Hey there,

i try to parse CHM (Microsoft Help Files) with Nucht, but i get a:

Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp

i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which
should be able to parse those files
https://issues.apache.org/jira/browse/TIKA-245

In the tika-mimetypes.xml i do find a entry related to
application/vnd.ms-htmlhelp

Does anyone ever ran into the same issues and knows how to fix that?

Bye
Jan

Re: CHM Files and Tika

Posted by Jan Riewe <ja...@comspace.de>.
Hey Sebastian,

as far is i found out, the Tika parser is far away from being perfect,
but i would expect that the included Testfiles should get correct
results. 

There is an alternative lib (http://sourceforge.net/projects/chm4j/),
but i don't think that there are enough possible users to switch for
this filetype to a differed parser. 

Jan

Am Dienstag, den 14.08.2012, 22:28 +0200 schrieb Sebastian Nagel: 
> Hi Jan,
> 
> opened a Jira issue: https://issues.apache.org/jira/browse/NUTCH-1454
> Thanks!
> 
> Beyond the "can't retrieve parser" error:
> I've tried a couple of chm files (among them the test files from Tika)
> but I wasn't able to get Tika to extract content.
> 
>  % java -jar tika-app/target/tika-app-1.3-SNAPSHOT.jar -v \
>     tika-parsers/src/test/resources/test-documents/testChm2.chm
> 
> only extracts:
> 
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="Content-Length" content="10807437"/>
> <meta name="Content-Type" content="application/vnd.ms-htmlhelp"/>
> <meta name="resourceName" content="testChm2.chm"/>
> <title/>
> </head>
> <body/></html>
> 
> A CHM-viewer shows much more content. What's wrong?
> 
> Sebastian
> 
> On 08/10/2012 09:32 AM, Julien Nioche wrote:
> > new JIRA?
> > 
> > On 9 August 2012 23:30, Markus Jelsma <ma...@openindex.io> wrote:
> > 
> >> hmm, i'm not sure but maybe we don't include all Tika parser deps in our
> >> build.xml?
> >>
> >>
> >>
> >> -----Original message-----
> >>> From:Sebastian Nagel <wa...@googlemail.com>
> >>> Sent: Thu 09-Aug-2012 23:18
> >>> To: user@nutch.apache.org
> >>> Subject: Re: CHM Files and Tika
> >>>
> >>> Hi Jan,
> >>>
> >>> confirmed: Nutch cannot parse, while Tika (same version used by Nutch)
> >>> can parse chm. The chm parsers are in tika-parser*.jar which is contained
> >>> in the Nutch package.
> >>>
> >>> Any ideas?
> >>>
> >>> Sebastian
> >>>
> >>> On 08/08/2012 12:03 PM, Jan Riewe wrote:
> >>>> Hey there,
> >>>>
> >>>> i try to parse CHM (Microsoft Help Files) with Nucht, but i get a:
> >>>>
> >>>> Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp
> >>>>
> >>>> i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which
> >>>> should be able to parse those files
> >>>> https://issues.apache.org/jira/browse/TIKA-245
> >>>>
> >>>> In the tika-mimetypes.xml i do find a entry related to
> >>>> application/vnd.ms-htmlhelp
> >>>>
> >>>> Does anyone ever ran into the same issues and knows how to fix that?
> >>>>
> >>>> Bye
> >>>> Jan
> >>>>
> >>>
> >>>
> >>
> > 
> > 
> > 
> 


Re: CHM Files and Tika

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Jan,

opened a Jira issue: https://issues.apache.org/jira/browse/NUTCH-1454
Thanks!

Beyond the "can't retrieve parser" error:
I've tried a couple of chm files (among them the test files from Tika)
but I wasn't able to get Tika to extract content.

 % java -jar tika-app/target/tika-app-1.3-SNAPSHOT.jar -v \
    tika-parsers/src/test/resources/test-documents/testChm2.chm

only extracts:

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Content-Length" content="10807437"/>
<meta name="Content-Type" content="application/vnd.ms-htmlhelp"/>
<meta name="resourceName" content="testChm2.chm"/>
<title/>
</head>
<body/></html>

A CHM-viewer shows much more content. What's wrong?

Sebastian

On 08/10/2012 09:32 AM, Julien Nioche wrote:
> new JIRA?
> 
> On 9 August 2012 23:30, Markus Jelsma <ma...@openindex.io> wrote:
> 
>> hmm, i'm not sure but maybe we don't include all Tika parser deps in our
>> build.xml?
>>
>>
>>
>> -----Original message-----
>>> From:Sebastian Nagel <wa...@googlemail.com>
>>> Sent: Thu 09-Aug-2012 23:18
>>> To: user@nutch.apache.org
>>> Subject: Re: CHM Files and Tika
>>>
>>> Hi Jan,
>>>
>>> confirmed: Nutch cannot parse, while Tika (same version used by Nutch)
>>> can parse chm. The chm parsers are in tika-parser*.jar which is contained
>>> in the Nutch package.
>>>
>>> Any ideas?
>>>
>>> Sebastian
>>>
>>> On 08/08/2012 12:03 PM, Jan Riewe wrote:
>>>> Hey there,
>>>>
>>>> i try to parse CHM (Microsoft Help Files) with Nucht, but i get a:
>>>>
>>>> Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp
>>>>
>>>> i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which
>>>> should be able to parse those files
>>>> https://issues.apache.org/jira/browse/TIKA-245
>>>>
>>>> In the tika-mimetypes.xml i do find a entry related to
>>>> application/vnd.ms-htmlhelp
>>>>
>>>> Does anyone ever ran into the same issues and knows how to fix that?
>>>>
>>>> Bye
>>>> Jan
>>>>
>>>
>>>
>>
> 
> 
> 


Re: CHM Files and Tika

Posted by Julien Nioche <li...@gmail.com>.
new JIRA?

On 9 August 2012 23:30, Markus Jelsma <ma...@openindex.io> wrote:

> hmm, i'm not sure but maybe we don't include all Tika parser deps in our
> build.xml?
>
>
>
> -----Original message-----
> > From:Sebastian Nagel <wa...@googlemail.com>
> > Sent: Thu 09-Aug-2012 23:18
> > To: user@nutch.apache.org
> > Subject: Re: CHM Files and Tika
> >
> > Hi Jan,
> >
> > confirmed: Nutch cannot parse, while Tika (same version used by Nutch)
> > can parse chm. The chm parsers are in tika-parser*.jar which is contained
> > in the Nutch package.
> >
> > Any ideas?
> >
> > Sebastian
> >
> > On 08/08/2012 12:03 PM, Jan Riewe wrote:
> > > Hey there,
> > >
> > > i try to parse CHM (Microsoft Help Files) with Nucht, but i get a:
> > >
> > > Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp
> > >
> > > i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which
> > > should be able to parse those files
> > > https://issues.apache.org/jira/browse/TIKA-245
> > >
> > > In the tika-mimetypes.xml i do find a entry related to
> > > application/vnd.ms-htmlhelp
> > >
> > > Does anyone ever ran into the same issues and knows how to fix that?
> > >
> > > Bye
> > > Jan
> > >
> >
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

RE: CHM Files and Tika

Posted by Markus Jelsma <ma...@openindex.io>.
hmm, i'm not sure but maybe we don't include all Tika parser deps in our build.xml?

 
 
-----Original message-----
> From:Sebastian Nagel <wa...@googlemail.com>
> Sent: Thu 09-Aug-2012 23:18
> To: user@nutch.apache.org
> Subject: Re: CHM Files and Tika
> 
> Hi Jan,
> 
> confirmed: Nutch cannot parse, while Tika (same version used by Nutch)
> can parse chm. The chm parsers are in tika-parser*.jar which is contained
> in the Nutch package.
> 
> Any ideas?
> 
> Sebastian
> 
> On 08/08/2012 12:03 PM, Jan Riewe wrote:
> > Hey there,
> > 
> > i try to parse CHM (Microsoft Help Files) with Nucht, but i get a:
> > 
> > Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp
> > 
> > i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which
> > should be able to parse those files
> > https://issues.apache.org/jira/browse/TIKA-245
> > 
> > In the tika-mimetypes.xml i do find a entry related to
> > application/vnd.ms-htmlhelp
> > 
> > Does anyone ever ran into the same issues and knows how to fix that?
> > 
> > Bye
> > Jan
> > 
> 
> 

SolrIndex command

Posted by ma...@Automationdirect.com.
Hi There,
I am a new Nutch user. I am using Nutch to crawl and then send crawl data
to SOLR. I have a question about bin/nutch solrindex command. Which tika
libraries are being used to index; Is it the tika libraries in Nutch or
does Nutch let SOLR index so it uses Solr's tika libraries? I think I read
it somewhere that Nutch is focusing on crawling and parsing and lets SOLR
do the indexing so SOLR's libraries should get used.

Specifically, I am having problems in extracting tags I.e. Say <H1> from
pdf files using Nutch/SOLR combination. The extract-contrib module defined
in schema.xml should get used.

Thanks in advance,
Madhvi

>


Re: CHM Files and Tika

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Jan,

confirmed: Nutch cannot parse, while Tika (same version used by Nutch)
can parse chm. The chm parsers are in tika-parser*.jar which is contained
in the Nutch package.

Any ideas?

Sebastian

On 08/08/2012 12:03 PM, Jan Riewe wrote:
> Hey there,
> 
> i try to parse CHM (Microsoft Help Files) with Nucht, but i get a:
> 
> Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp
> 
> i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which
> should be able to parse those files
> https://issues.apache.org/jira/browse/TIKA-245
> 
> In the tika-mimetypes.xml i do find a entry related to
> application/vnd.ms-htmlhelp
> 
> Does anyone ever ran into the same issues and knows how to fix that?
> 
> Bye
> Jan
>