You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by suryas <st...@yahoo.com> on 2009/01/25 23:31:17 UTC

How to index and search Indian languages

Hi,
I want to index & search Tamil (an Indian language) pages using Nutch. I
have some knowledge of Lucene and just got the "Nutch Basic Tutorial"
working.

Where do I look for indexing Tamil or any other Indian language pages?

I'm looking for:
*step-by-step" documentation for indexing and searching foreign language
pages, particularly Indian languages
*some examples, samples, tutorials would be nice

Or if you could just point me in the right direction, that'll be fine too.

I saw some postings from "saran" & "saravana kumar" talking about this same
thing. Guys, did you figure this out? could you please help?

Could someone help?

thanks,
Surya 
-- 
View this message in context: http://www.nabble.com/How-to-index-and-search-Indian-languages-tp21657719p21657719.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: How to index and search Indian languages

Posted by suryas <st...@yahoo.com>.
Thanks Vishal. 

I had it working last night, with exactly the steps you had suggested. 

I was using CentOS, Tomcat, Nutch 0.9 - after I got the basic tutorial
working I tried crawling a tamil site, the crawling worked OK - but I
couldn't search. That's where I needed help. 

Here are the tiny road blocks I had to get over: 
1. Tamil fonts was not installed in my CentOS, so couldn't search from
command line or browser.
     "yum install tamil-fonts" fixed that problem. 

2. After installing the fonts, I was able to search from Command line

3. To get the tamil search working from Nutch WAR application, I had to set
the URI encoding in the  
   Tomcat Connector: 
   http://wiki.apache.org/nutch/GettingNutchRunningWithUtf8

Now the basic stuff are working.

Thanks Vishal and Saran for the help. 

-Surya


vishal vachhani wrote:
> 
> If your documents are in UTF then you directly crawl and index the
> documents
> using Nutch(tutorial is given in nutch wiki), else you need to first
> convert
> the documents into UTF-8 and then you can index. After indexing is over
> first try to search using command line searching APIs of Nutch and then
> using modify the GUI(jsp page of nutch) so that it can also search from
> GUI.
> In order to varify your index you can also use "LUKE-lucene index tool".
> 
> 
> 
> On Mon, Jan 26, 2009 at 4:01 AM, suryas <st...@yahoo.com> wrote:
> 
>>
>> Hi,
>> I want to index & search Tamil (an Indian language) pages using Nutch. I
>> have some knowledge of Lucene and just got the "Nutch Basic Tutorial"
>> working.
>>
>> Where do I look for indexing Tamil or any other Indian language pages?
>>
>> I'm looking for:
>> *step-by-step" documentation for indexing and searching foreign language
>> pages, particularly Indian languages
>> *some examples, samples, tutorials would be nice
>>
>> Or if you could just point me in the right direction, that'll be fine
>> too.
>>
>> I saw some postings from "saran" & "saravana kumar" talking about this
>> same
>> thing. Guys, did you figure this out? could you please help?
>>
>> Could someone help?
>>
>> thanks,
>> Surya
>> --
>> View this message in context:
>> http://www.nabble.com/How-to-index-and-search-Indian-languages-tp21657719p21657719.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/How-to-index-and-search-Indian-languages-tp21657719p21684449.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: How to index and search Indian languages

Posted by vishal vachhani <vi...@gmail.com>.
If your documents are in UTF then you directly crawl and index the documents
using Nutch(tutorial is given in nutch wiki), else you need to first convert
the documents into UTF-8 and then you can index. After indexing is over
first try to search using command line searching APIs of Nutch and then
using modify the GUI(jsp page of nutch) so that it can also search from GUI.
In order to varify your index you can also use "LUKE-lucene index tool".



On Mon, Jan 26, 2009 at 4:01 AM, suryas <st...@yahoo.com> wrote:

>
> Hi,
> I want to index & search Tamil (an Indian language) pages using Nutch. I
> have some knowledge of Lucene and just got the "Nutch Basic Tutorial"
> working.
>
> Where do I look for indexing Tamil or any other Indian language pages?
>
> I'm looking for:
> *step-by-step" documentation for indexing and searching foreign language
> pages, particularly Indian languages
> *some examples, samples, tutorials would be nice
>
> Or if you could just point me in the right direction, that'll be fine too.
>
> I saw some postings from "saran" & "saravana kumar" talking about this same
> thing. Guys, did you figure this out? could you please help?
>
> Could someone help?
>
> thanks,
> Surya
> --
> View this message in context:
> http://www.nabble.com/How-to-index-and-search-Indian-languages-tp21657719p21657719.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>