You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Yaidel Guedes Beltran <yg...@estudiantes.uci.cu> on 2009/07/06 19:16:27 UTC
Problems when index .chm files
Example1:
Error parsing: http://localhost/mydocs/Programacion/Web/Ajax/Ajax.Hacks.Tips.and.Tools.for.Creating.Responsive.Web.Sites.Mar.2006.chm: org.apache.nutch.parse.ParseException: parser not found for contentType=chemical/x-chemdraw url=http://localhost/mydocs/Programacion/Web/Ajax/Ajax.Hacks.Tips.and.Tools.for.Creating.Responsive.Web.Sites.Mar.2006.chm
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)
Example2:
Error parsing: http://localhost/mydocs/Programacion/Estructura%20de%20Datos/Uml/Project%20Management%20Methodologies%20(2004)%20(Wiley).chm: org.apache.nutch.parse.ParseException: parser not found for contentType=chemical/x-chemdraw url=http://localhost/mydocs/Programacion/Estructura%20de%20Datos/Uml/Project%20Management%20Methodologies%20(2004)%20(Wiley).chm
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)
and other sames errors...
Any solution ??
Re: Problems when index .chm files
Posted by Ken Krugler <kk...@transpac.com>.
>Example1:
>
>Error parsing:
>http://localhost/mydocs/Programacion/Web/Ajax/Ajax.Hacks.Tips.and.Tools.for.Creating.Responsive.Web.Sites.Mar.2006.chm:
>org.apache.nutch.parse.ParseException: parser not found for
>contentType=chemical/x-chemdraw
>url=http://localhost/mydocs/Programacion/Web/Ajax/Ajax.Hacks.Tips.and.Tools.for.Creating.Responsive.Web.Sites.Mar.2006.chm
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
> at
>org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
> at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)
>
>Example2:
>
>Error parsing:
>http://localhost/mydocs/Programacion/Estructura%20de%20Datos/Uml/Project%20Management%20Methodologies%20(2004)%20(Wiley).chm:
>org.apache.nutch.parse.ParseException: parser not found for
>contentType=chemical/x-chemdraw
>url=http://localhost/mydocs/Programacion/Estructura%20de%20Datos/Uml/Project%20Management%20Methodologies%20(2004)%20(Wiley).chm
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
> at
>org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
> at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)
>
>and other sames errors...
>
>Any solution ??
A .chm file can be
- Chameleon source font
- Microsoft compiled HTML help file
- Chemdraw chemical structure
From the URLs in your email, I'm going to guess these are compiled
HTML help files.
1. You should file a Jira issue to improve the Nutch mime-type
detection, as currently it's flagging the .chm files as Chemdraw
chemical structure documents.
2. I think you'd need to create your own Nutch plugin to parse chm
files. Supposedly there are open source tools available for reading
these files - but it's likely these are C#/VB/some other Microsoft
language.
See http://en.wikipedia.org/wiki/Microsoft_Compiled_HTML_Help for details.
-- Ken
--
Ken Krugler
<http://ken-blog.krugler.org>
+1 530-265-2225