You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jair Piedrahita Vargas <JA...@bancolombia.com.co> on 2009/07/27 18:50:29 UTC

question

Can Nutch search inside the content of an msword file? I've tried, but it says "parser not found for contentType=application/msword"
What can I do to correct this Error?

Thanks

JAIR PIEDRAHITA VARGAS
Gerencia de Investigación y Nuevas Tecnologías
Teléfono: 4040000   Ext 41632
Av. los Industriales Cra 48 # 26-85 piso 6B
BANCOLOMBIA S.A


________________________________
El contenido de este mensaje puede ser información privilegiada y confidencial. Si usted no es el destinatario real del mismo, por favor informe de ello a quien lo envía y destrúyalo en forma inmediata. Está prohibida su retención, grabación, utilización o divulgación con cualquier propósito. Este mensaje ha sido verificado con software antivirus; en consecuencia, el remitente de éste no se hace responsable por la presencia en él o en sus anexos de algún virus que pueda generar daños en los equipos o programas del destinatario.
******************************************************************************************************
This communication (including all attachments) may contain information that is private, confidential and privileged. If you have received this communication in error; please notify the sender immediately, delete this communication from all data storage devices and destroy all hard copies. Any use, dissemination, distribution, copying or disclosure of this message and any attachments, in whole or in part, by anyone other than the intended recipient(s) is strictly prohibited. This message has been checked with an antivirus software; accordingly, the sender is not liable for the presence of any virus in attachments that causes or may cause damage to the recipient's equipment or software.

Re: question

Posted by reinhard schwab <re...@aon.at>.
i believe it can.
check your configuration files, nutch-site.xml and nutch-default.xml.

you will find something like

<property>
  <name>plugin.includes</name>
 
<value>protocol-http|urlfilter-regex|parse-(text|html|swf|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems
with the
  underlying commons-httpclient library.
  </description>
</property>

add to the parsers "msword".
change
parse-(text|html|swf|pdf)|
to
parse-(text|html|swf|pdf|msword)

there is a plugin in plugins folder,
which is parsing ms word documents.
parse-msword    

i have not tried it so far.

Jair Piedrahita Vargas schrieb:
> Can Nutch search inside the content of an msword file? I've tried, but it says "parser not found for contentType=application/msword"
> What can I do to correct this Error?
>
> Thanks
>
> JAIR PIEDRAHITA VARGAS
> Gerencia de Investigación y Nuevas Tecnologías
> Teléfono: 4040000   Ext 41632
> Av. los Industriales Cra 48 # 26-85 piso 6B
> BANCOLOMBIA S.A
>
>
> ________________________________
> El contenido de este mensaje puede ser información privilegiada y confidencial. Si usted no es el destinatario real del mismo, por favor informe de ello a quien lo envía y destrúyalo en forma inmediata. Está prohibida su retención, grabación, utilización o divulgación con cualquier propósito. Este mensaje ha sido verificado con software antivirus; en consecuencia, el remitente de éste no se hace responsable por la presencia en él o en sus anexos de algún virus que pueda generar daños en los equipos o programas del destinatario.
> ******************************************************************************************************
> This communication (including all attachments) may contain information that is private, confidential and privileged. If you have received this communication in error; please notify the sender immediately, delete this communication from all data storage devices and destroy all hard copies. Any use, dissemination, distribution, copying or disclosure of this message and any attachments, in whole or in part, by anyone other than the intended recipient(s) is strictly prohibited. This message has been checked with an antivirus software; accordingly, the sender is not liable for the presence of any virus in attachments that causes or may cause damage to the recipient's equipment or software.
>
>