You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Yaidel Guedes Beltran <yg...@estudiantes.uci.cu> on 2009/07/06 19:16:27 UTC

Problems when index .chm files

Example1:

Error parsing: http://localhost/mydocs/Programacion/Web/Ajax/Ajax.Hacks.Tips.and.Tools.for.Creating.Responsive.Web.Sites.Mar.2006.chm: org.apache.nutch.parse.ParseException: parser not found for contentType=chemical/x-chemdraw url=http://localhost/mydocs/Programacion/Web/Ajax/Ajax.Hacks.Tips.and.Tools.for.Creating.Responsive.Web.Sites.Mar.2006.chm
	at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
	at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
	at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)

Example2:

Error parsing: http://localhost/mydocs/Programacion/Estructura%20de%20Datos/Uml/Project%20Management%20Methodologies%20(2004)%20(Wiley).chm: org.apache.nutch.parse.ParseException: parser not found for contentType=chemical/x-chemdraw url=http://localhost/mydocs/Programacion/Estructura%20de%20Datos/Uml/Project%20Management%20Methodologies%20(2004)%20(Wiley).chm
	at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
	at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
	at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)

and other sames errors...

Any solution ??

 

Re: Problems when index .chm files

Posted by Ken Krugler <kk...@transpac.com>.
>Example1:
>
>Error parsing: 
>http://localhost/mydocs/Programacion/Web/Ajax/Ajax.Hacks.Tips.and.Tools.for.Creating.Responsive.Web.Sites.Mar.2006.chm: 
>org.apache.nutch.parse.ParseException: parser not found for 
>contentType=chemical/x-chemdraw 
>url=http://localhost/mydocs/Programacion/Web/Ajax/Ajax.Hacks.Tips.and.Tools.for.Creating.Responsive.Web.Sites.Mar.2006.chm
>	at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
>	at 
>org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
>	at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)
>
>Example2:
>
>Error parsing: 
>http://localhost/mydocs/Programacion/Estructura%20de%20Datos/Uml/Project%20Management%20Methodologies%20(2004)%20(Wiley).chm: 
>org.apache.nutch.parse.ParseException: parser not found for 
>contentType=chemical/x-chemdraw 
>url=http://localhost/mydocs/Programacion/Estructura%20de%20Datos/Uml/Project%20Management%20Methodologies%20(2004)%20(Wiley).chm
>	at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
>	at 
>org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
>	at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)
>
>and other sames errors...
>
>Any solution ??

A .chm file can be

- Chameleon source font
- Microsoft compiled HTML help file
- Chemdraw chemical structure

 From the URLs in your email, I'm going to guess these are compiled 
HTML help files.

1. You should file a Jira issue to improve the Nutch mime-type 
detection, as currently it's flagging the .chm files as Chemdraw 
chemical structure documents.

2. I think you'd need to create your own Nutch plugin to parse chm 
files. Supposedly there are open source tools available for reading 
these files - but it's likely these are C#/VB/some other Microsoft 
language.

See http://en.wikipedia.org/wiki/Microsoft_Compiled_HTML_Help for details.

-- Ken
-- 
Ken Krugler
<http://ken-blog.krugler.org>
+1 530-265-2225