You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jair Piedrahita Vargas <JA...@bancolombia.com.co> on 2009/09/02 00:51:39 UTC

written accent

Hi everyone!

I want to search into a intranet that has pages in Spanish language, but I am having problems when I want to search pages that have words with written accent. Apparently Nutch is not indexing that pages. What can I do to solve this problem?

Thanks

Saludos,

Jair Piedrahíta Vargas

________________________________
El contenido de este mensaje puede ser información privilegiada y confidencial. Si usted no es el destinatario real del mismo, por favor informe de ello a quien lo envía y destrúyalo en forma inmediata. Está prohibida su retención, grabación, utilización o divulgación con cualquier propósito. Este mensaje ha sido verificado con software antivirus; en consecuencia, el remitente de éste no se hace responsable por la presencia en él o en sus anexos de algún virus que pueda generar daños en los equipos o programas del destinatario.
******************************************************************************************************
This communication (including all attachments) may contain information that is private, confidential and privileged. If you have received this communication in error; please notify the sender immediately, delete this communication from all data storage devices and destroy all hard copies. Any use, dissemination, distribution, copying or disclosure of this message and any attachments, in whole or in part, by anyone other than the intended recipient(s) is strictly prohibited. This message has been checked with an antivirus software; accordingly, the sender is not liable for the presence of any virus in attachments that causes or may cause damage to the recipient's equipment or software.

Re: written accent

Posted by MilleBii <mi...@gmail.com>.
Yes and I had the problem too some time ago, what about a page on the Nutch
Wiki so that future users don't go around this issue again & again.

2009/9/2 Jair Piedrahita Vargas <JA...@bancolombia.com.co>

> I set URIEncoding="UTF-8" in Tomcat server.xml, and now I can search words
> with written accent, even though the link "en caché" appear still like that:
> "en cachÃf©"
>
> Saludos,
>
> Jair Piedrahíta Vargas
> Estudiante en Práctica - Gerencia de Investigación y Nuevas Tecnologías
> Dirección de Estrategia y Arquitectura
> Vicepresidencia de Tecnología de Información
> BANCOLOMBIA S.A.
> www.bancolombia.com
> Tel: (++ 57) (4) 40 41 632
> Fax: (++ 57) (4) 40 40 197 - (++ 57) (4) 40 40 198
> E-mail: japiedra@bancolombia.com.co
> Cra. 48 # 26 - 85 Av. Los Industriales
> Torre Norte Piso 6B -  120 (Medellín, Colombia)
> ____________________________________________________
> Horario flexible: 7:00 - 12:00 y 1:30 - 4:30 GMT (-05:00)
>
>
> -----Mensaje original-----
> De: Alexey Torochkov [mailto:all.net.ru@gmail.com]
> Enviado el: Miércoles, 02 de Septiembre de 2009 08:43 a.m.
> Para: nutch-user@lucene.apache.org
> Asunto: Re: written accent
>
> Probably you should set URIEncoding="UTF-8" in Tomcat server.xml
>
> --
> Alexey Torochkov
>
> El contenido de este mensaje puede ser información privilegiada y
> confidencial. Si usted no es el destinatario real del mismo, por favor
> informe de ello a quien lo envía y destrúyalo en forma inmediata. Está
> prohibida su retención, grabación, utilización o divulgación con cualquier
> propósito. Este mensaje ha sido verificado con software antivirus; en
> consecuencia, el remitente de éste no se hace responsable por la presencia
> en él o en sus anexos de algún virus que pueda generar daños en los equipos
> o programas del destinatario.
>
> ******************************************************************************************************
> This communication (including all attachments) may contain information that
> is private, confidential and privileged. If you have received this
> communication in error; please notify the sender immediately, delete this
> communication from all data storage devices and destroy all hard copies. Any
> use, dissemination, distribution, copying or disclosure of this message and
> any attachments, in whole or in part, by anyone other than the intended
> recipient(s) is strictly prohibited. This message has been checked with an
> antivirus software; accordingly, the sender is not liable for the presence
> of any virus in attachments that causes or may cause damage to the
> recipient's equipment or software.
>
>


-- 
-MilleBii-

RE: written accent

Posted by Jair Piedrahita Vargas <JA...@bancolombia.com.co>.
I set URIEncoding="UTF-8" in Tomcat server.xml, and now I can search words with written accent, even though the link "en caché" appear still like that: "en cachÃf©"

Saludos,

Jair Piedrahíta Vargas
Estudiante en Práctica - Gerencia de Investigación y Nuevas Tecnologías
Dirección de Estrategia y Arquitectura
Vicepresidencia de Tecnología de Información
BANCOLOMBIA S.A.
www.bancolombia.com
Tel: (++ 57) (4) 40 41 632
Fax: (++ 57) (4) 40 40 197 - (++ 57) (4) 40 40 198
E-mail: japiedra@bancolombia.com.co
Cra. 48 # 26 - 85 Av. Los Industriales
Torre Norte Piso 6B -  120 (Medellín, Colombia)
____________________________________________________
Horario flexible: 7:00 - 12:00 y 1:30 - 4:30 GMT (-05:00)


-----Mensaje original-----
De: Alexey Torochkov [mailto:all.net.ru@gmail.com]
Enviado el: Miércoles, 02 de Septiembre de 2009 08:43 a.m.
Para: nutch-user@lucene.apache.org
Asunto: Re: written accent

Probably you should set URIEncoding="UTF-8" in Tomcat server.xml

--
Alexey Torochkov

El contenido de este mensaje puede ser información privilegiada y confidencial. Si usted no es el destinatario real del mismo, por favor informe de ello a quien lo envía y destrúyalo en forma inmediata. Está prohibida su retención, grabación, utilización o divulgación con cualquier propósito. Este mensaje ha sido verificado con software antivirus; en consecuencia, el remitente de éste no se hace responsable por la presencia en él o en sus anexos de algún virus que pueda generar daños en los equipos o programas del destinatario.
******************************************************************************************************
This communication (including all attachments) may contain information that is private, confidential and privileged. If you have received this communication in error; please notify the sender immediately, delete this communication from all data storage devices and destroy all hard copies. Any use, dissemination, distribution, copying or disclosure of this message and any attachments, in whole or in part, by anyone other than the intended recipient(s) is strictly prohibited. This message has been checked with an antivirus software; accordingly, the sender is not liable for the presence of any virus in attachments that causes or may cause damage to the recipient's equipment or software.


RE: written accent

Posted by Jair Piedrahita Vargas <JA...@bancolombia.com.co>.
Muchísimas gracias, ese era el problema...
So much thanks, that was the problem...

Saludos,

Jair

-----Mensaje original-----
De: Alexey Torochkov [mailto:all.net.ru@gmail.com]
Enviado el: Miércoles, 02 de Septiembre de 2009 08:43 a.m.
Para: nutch-user@lucene.apache.org
Asunto: Re: written accent

Probably you should set URIEncoding="UTF-8" in Tomcat server.xml

--
Alexey Torochkov

El contenido de este mensaje puede ser información privilegiada y confidencial. Si usted no es el destinatario real del mismo, por favor informe de ello a quien lo envía y destrúyalo en forma inmediata. Está prohibida su retención, grabación, utilización o divulgación con cualquier propósito. Este mensaje ha sido verificado con software antivirus; en consecuencia, el remitente de éste no se hace responsable por la presencia en él o en sus anexos de algún virus que pueda generar daños en los equipos o programas del destinatario.
******************************************************************************************************
This communication (including all attachments) may contain information that is private, confidential and privileged. If you have received this communication in error; please notify the sender immediately, delete this communication from all data storage devices and destroy all hard copies. Any use, dissemination, distribution, copying or disclosure of this message and any attachments, in whole or in part, by anyone other than the intended recipient(s) is strictly prohibited. This message has been checked with an antivirus software; accordingly, the sender is not liable for the presence of any virus in attachments that causes or may cause damage to the recipient's equipment or software.


Re: written accent

Posted by Alexey Torochkov <al...@gmail.com>.
Probably you should set URIEncoding="UTF-8" in Tomcat server.xml

-- 
Alexey Torochkov

RE: written accent

Posted by Jair Piedrahita Vargas <JA...@bancolombia.com.co>.
Both. I think there are problems indexing, because there are pages that the crawl should reach, but it doesn't. And there are problems searching too, because when I write a word with written accent in the textbox for search it look good, but when I click in the search button the word in the textbox change (the branded letter change into other characters) and it doesn't found anything.
For example, I write "investigación" and when I click in search, the word change to "investigación".

What could be the problem?

Thanks

Saludos,

Jair

-----Mensaje original-----
De: MilleBii [mailto:millebii@gmail.com]
Enviado el: Miércoles, 02 de Septiembre de 2009 01:46 a.m.
Para: nutch-user@lucene.apache.org
Asunto: Re: written accent

For me it works fine,
Do you mean indexing or searching ?

2009/9/2 Jair Piedrahita Vargas <JA...@bancolombia.com.co>

> Hi everyone!
>
> I want to search into a intranet that has pages in Spanish language, but I
> am having problems when I want to search pages that have words with written
> accent. Apparently Nutch is not indexing that pages. What can I do to solve
> this problem?
>
> Thanks
>
> Saludos,
>
> Jair Piedrahíta Vargas
>
> ________________________________
> El contenido de este mensaje puede ser información privilegiada y
> confidencial. Si usted no es el destinatario real del mismo, por favor
> informe de ello a quien lo envía y destrúyalo en forma inmediata. Está
> prohibida su retención, grabación, utilización o divulgación con cualquier
> propósito. Este mensaje ha sido verificado con software antivirus; en
> consecuencia, el remitente de éste no se hace responsable por la presencia
> en él o en sus anexos de algún virus que pueda generar daños en los equipos
> o programas del destinatario.
>
> ******************************************************************************************************
> This communication (including all attachments) may contain information that
> is private, confidential and privileged. If you have received this
> communication in error; please notify the sender immediately, delete this
> communication from all data storage devices and destroy all hard copies. Any
> use, dissemination, distribution, copying or disclosure of this message and
> any attachments, in whole or in part, by anyone other than the intended
> recipient(s) is strictly prohibited. This message has been checked with an
> antivirus software; accordingly, the sender is not liable for the presence
> of any virus in attachments that causes or may cause damage to the
> recipient's equipment or software.
>



--
-MilleBii-

El contenido de este mensaje puede ser información privilegiada y confidencial. Si usted no es el destinatario real del mismo, por favor informe de ello a quien lo envía y destrúyalo en forma inmediata. Está prohibida su retención, grabación, utilización o divulgación con cualquier propósito. Este mensaje ha sido verificado con software antivirus; en consecuencia, el remitente de éste no se hace responsable por la presencia en él o en sus anexos de algún virus que pueda generar daños en los equipos o programas del destinatario.
******************************************************************************************************
This communication (including all attachments) may contain information that is private, confidential and privileged. If you have received this communication in error; please notify the sender immediately, delete this communication from all data storage devices and destroy all hard copies. Any use, dissemination, distribution, copying or disclosure of this message and any attachments, in whole or in part, by anyone other than the intended recipient(s) is strictly prohibited. This message has been checked with an antivirus software; accordingly, the sender is not liable for the presence of any virus in attachments that causes or may cause damage to the recipient's equipment or software.

Re: written accent

Posted by MilleBii <mi...@gmail.com>.
For me it works fine,
Do you mean indexing or searching ?

2009/9/2 Jair Piedrahita Vargas <JA...@bancolombia.com.co>

> Hi everyone!
>
> I want to search into a intranet that has pages in Spanish language, but I
> am having problems when I want to search pages that have words with written
> accent. Apparently Nutch is not indexing that pages. What can I do to solve
> this problem?
>
> Thanks
>
> Saludos,
>
> Jair Piedrahíta Vargas
>
> ________________________________
> El contenido de este mensaje puede ser información privilegiada y
> confidencial. Si usted no es el destinatario real del mismo, por favor
> informe de ello a quien lo envía y destrúyalo en forma inmediata. Está
> prohibida su retención, grabación, utilización o divulgación con cualquier
> propósito. Este mensaje ha sido verificado con software antivirus; en
> consecuencia, el remitente de éste no se hace responsable por la presencia
> en él o en sus anexos de algún virus que pueda generar daños en los equipos
> o programas del destinatario.
>
> ******************************************************************************************************
> This communication (including all attachments) may contain information that
> is private, confidential and privileged. If you have received this
> communication in error; please notify the sender immediately, delete this
> communication from all data storage devices and destroy all hard copies. Any
> use, dissemination, distribution, copying or disclosure of this message and
> any attachments, in whole or in part, by anyone other than the intended
> recipient(s) is strictly prohibited. This message has been checked with an
> antivirus software; accordingly, the sender is not liable for the presence
> of any virus in attachments that causes or may cause damage to the
> recipient's equipment or software.
>



-- 
-MilleBii-