You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by sanjeev <sa...@hotmail.com> on 2006/11/16 07:56:05 UTC

implement thai language indexing and search

hi all,

I've been trying unsuccessfully for the past week to implement the thai
language analyzer 
with nutch. One thing I don't understand is the thai analyzer belongs to the
lucene.analysis package
instead of the nutch.analysis package.

I have the thai ngp file + analyzer (albeit from lucene) + nutch 0.8 dev
pack

My question is how to integrate this into nutch that when I index and search
- it will analyze and search 

the thai lanaguage correctly.

Someone please help as i'm sure it can be done.

thanks and regards,
sanjeev
-- 
View this message in context: http://www.nabble.com/implement-thai-language-indexing-and-search-tf2641172.html#a7372518
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: implement thai language indexing and search

Posted by Thorsten Scherler <th...@juntadeandalucia.es>.

On Wed, 2006-12-20 at 21:52 -0800, sanjeev wrote:
> Hello,
> 
> My crawl index is not being created correctly using the new settings.

https://issues.apache.org/jira/browse/SOLR-88

> Although the log shows no errors - I am not able to open using Luke,
> it says index corrupt, access denied, invalid index etc....
> what could be wrong ?

index corrupt could be above issue. 

"Erik Hatcher [20/Dec/06 05:07 AM] 
Luke would very likely work if you used it with the Solr version of
Lucene, rather than modifying Solr's schema. lukeall JAR embeds Lucene,
but an older version. "

Change solr with nutch. 

Access denied I get sometimes when I open the index with luke on a
running instance. Try shutting down the server that luke is the only
application that uses it.

HTH

salu2

>  Also the size of the index is rather small - 8Kb or
> so...:-(
> And no info in the logs about how many documents were indexed and all - the
> logfile
> pattern in 0.8.1 seems different from nutch 0.7.2 - am i right or wrong ?
> 
> please help as i'm going despo here ...
> 
> Thanks.
> sanjeev.

Re: implement thai language indexing and search

Posted by sanjeev <sa...@hotmail.com>.

Hello,

My crawl index is not being created correctly using the new settings.
Although the log shows no errors - I am not able to open using Luke,
it says index corrupt, access denied, invalid index etc....
what could be wrong ? Also the size of the index is rather small - 8Kb or
so...:-(
And no info in the logs about how many documents were indexed and all - the
logfile
pattern in 0.8.1 seems different from nutch 0.7.2 - am i right or wrong ?

please help as i'm going despo here ...

Thanks.
sanjeev.
-- 
View this message in context: http://www.nabble.com/implement-thai-language-indexing-and-search-tf2641172.html#a8003157
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: implement thai language indexing and search

Posted by sanjeev <sa...@hotmail.com>.

Thanks a bunch Shtykh.

After reading your tutorial - i understood how to wrap the thaianalyzer over
the lucene one.

I got a analysis-th directory in nutch-0.8.1/plugins with a plugin.xml -
made the changes in 
nutch-site.xml and all. 

>From the hadoop logfile it appears the language identifier has been
activated and thai appears 
among the list of supported languages. 

However I am unable to open the index using luke so I have no way of knowing
whether thai is 
being indexed correctly...here are some excerpts from the hadoop log.......


2549-12-15 11:25:55,638 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.parse.HtmlParseFilter
2549-12-15 11:25:55,638 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.protocol.Protocol
2549-12-15 11:25:55,638 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.searcher.QueryFilter
2549-12-15 11:25:55,638 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.net.URLFilter
2549-12-15 11:25:55,638 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.analysis.NutchAnalyzer
2549-12-15 11:25:55,638 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.searcher.Summarizer
2549-12-15 11:25:55,638 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.scoring.ScoringFilter
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - Registered Plugins:
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - 	CyberNeko HTML
Parser (lib-nekohtml)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - 	Site Query Filter
(query-site)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - 	Html Parse Plug-in
(parse-html)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - 	Regex URL Filter
Framework (lib-regex-filter)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - 	Basic Indexing
Filter (index-basic)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - 	Basic Summarizer
Plug-in (summary-basic)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - 	Text Parse Plug-in
(parse-text)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - 	JavaScript Parser
(parse-js)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - 	Regex URL Filter
(urlfilter-regex)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - 	Basic Query Filter
(query-basic)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - 	HTTP Framework
(lib-http)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - 	URL Query Filter
(query-url)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - 	Http Protocol
Plug-in (protocol-http)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - 	the nutch core
extension points (nutch-extensionpoints)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - 	OPIC Scoring
Plug-in (scoring-opic)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - 	Language
Identification Parser/Filter (language-identifier)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - Registered
Extension-Points:
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - 	Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - 	Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - 	Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - 	Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - 	HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - 	Nutch Online Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - 	Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - 	Nutch Content
Parser (org.apache.nutch.parse.Parser)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - 	Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - 	Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - 	Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter)
2549-12-15 11:25:55,669 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2549-12-15 11:25:55,779 INFO  lang.LanguageIdentifier - Language identifier
configuration [1-4/2048]
2549-12-15 11:25:56,544 INFO  lang.LanguageIdentifier - Language identifier
plugin supports: it(1000) is(1000) hu(1000) th(1000) sv(1000) fr(1000)
ru(1000) fi(1000) es(1000) en(1000) el(1000) ee(1000) pt(1000) de(1000)
da(1000) pl(1000) no(1000) nl(1000)
2549-12-15 11:25:56,544 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.analysis.lang.LanguageIndexingFilter
2549-12-15 11:25:57,091 INFO  indexer.Indexer - Optimizing index.
2549-12-15 11:25:57,544 INFO  indexer.Indexer - Indexer: done

///////////////////////////////////////////////////////////////////////////////////////
and this crawl log............

Fetcher: starting
Fetcher: segment: crawlxx3/segments/25491215112523
Fetcher: threads: 10
fetching http://www.pantip.com/cafe
redirectCount=0
fetch of http://www.pantip.com/cafe failed with:
java.lang.NullPointerException
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawlxx3/crawldb
CrawlDb update: segment: crawlxx3/segments/25491215112523
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: starting
Generator: segment: crawlxx3/segments/25491215112536
Generator: Selecting best-scoring urls due for fetch.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawlxx3/segments/25491215112536
Fetcher: threads: 10
fetching http://www.pantip.com/cafe
redirectCount=0
fetch of http://www.pantip.com/cafe failed with:
java.lang.NullPointerException
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawlxx3/crawldb
CrawlDb update: segment: crawlxx3/segments/25491215112536
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: crawlxx3/linkdb
LinkDb: adding segment: crawlxx3/segments/25491215112523
LinkDb: adding segment: crawlxx3/segments/25491215112536
LinkDb: done
Indexer: starting
Indexer: linkdb: crawlxx3/linkdb
Indexer: adding segment: crawlxx3/segments/25491215112523
Indexer: adding segment: crawlxx3/segments/25491215112536
Optimizing index.
Indexer: done
Dedup: starting
Dedup: adding indexes in: crawlxx3/indexes
Dedup: done
Adding crawlxx3/indexes/part-00000









Shtykh Roman wrote:
> 
> Hi,
> 
> I have recently dealt with Japanese support and wrote
> how I did it on
> http://nislab.human.waseda.ac.jp/blog/?page_id=7 . I
> think it'll give you some idea.
> 
> Br,
> Roman
> 
> --- sanjeev <sa...@hotmail.com> wrote:
> 
>> 
>> Hi all,
>> 
>> I am still waiting for some help re: the thai
>> language indexing and
>> searching.
>> 
>> Please help as i'm quite lost on this one.
>> 
>> Thanks and regards,
>> sanjeev.
>> 
>> 
>> sanjeev wrote:
>> > 
>> > Thanks for clearing up some doubts. But exactly
>> how do i wrap it ?
>> > Do I need to make changes in code to utilize the
>> new thaitokenizer ?
>> > If yes - where are the places that need
>> modification ? 
>> > Do I need to download a dev version and do a
>> recompile ?
>> > 
>> > Please - if you could possibly tell me the steps -
>> in brief - i would be
>> > highly obliged.
>> > 
>> > Thanks,
>> > sanjeev.
>> > 
>> > 
>> > 
>> > 
>> > Jérôme Charron wrote:
>> >> 
>> >>> i used an existing ThaiAnalyzer which was in
>> lucene packlage.
>> >>> ok - i renamed the lucene.analysis.th.* to
>> nutch.analysis.th.* -
>> >>> compiled
>> >>> and
>> >>> placed all class files in a jar -
>> analysis-th.jar (do i need to bundle
>> >>> the
>> >>> ngp file in the jar as well ?)
>> >> 
>> >> 1. You don't have to refactor the lucene
>> analyzer. Just to wrap it like I
>> >> do
>> >> with french and german analyzers (they both use
>> some analyzers from
>> >> lucene).
>> >>  2. Analyzer doesn't need ngp files... I think
>> you misunderstood
>> >> something:
>> >> 2.1 In one side there is the language identifier
>> that use NGP files to
>> >> identify language of a document
>> >> 2.2 In the other sided if a suitable analyzer is
>> found for the identified
>> >> language, it is used to analyze the document.
>> >> 
>> >> Regards
>> >> 
>> >> Jérôme
>> >> 
>> >> 
>> >> -- 
>> >> http://motrech.free.fr/
>> >> http://www.frutch.org/
>> >> 
>> >> 
>> > 
>> > 
>> 
>> -- 
>> View this message in context:
>>
> http://www.nabble.com/implement-thai-language-indexing-and-search-tf2641172.html#a7827701
>> Sent from the Nutch - Dev mailing list archive at
>> Nabble.com.
>> 
>> 
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com 
> 
> 

-- 
View this message in context: http://www.nabble.com/implement-thai-language-indexing-and-search-tf2641172.html#a7886152
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: implement thai language indexing and search

Posted by Shtykh Roman <rs...@yahoo.com>.

Hi,

I have recently dealt with Japanese support and wrote
how I did it on
http://nislab.human.waseda.ac.jp/blog/?page_id=7 . I
think it'll give you some idea.

Br,
Roman

--- sanjeev <sa...@hotmail.com> wrote:

> 
> Hi all,
> 
> I am still waiting for some help re: the thai
> language indexing and
> searching.
> 
> Please help as i'm quite lost on this one.
> 
> Thanks and regards,
> sanjeev.
> 
> 
> sanjeev wrote:
> > 
> > Thanks for clearing up some doubts. But exactly
> how do i wrap it ?
> > Do I need to make changes in code to utilize the
> new thaitokenizer ?
> > If yes - where are the places that need
> modification ? 
> > Do I need to download a dev version and do a
> recompile ?
> > 
> > Please - if you could possibly tell me the steps -
> in brief - i would be
> > highly obliged.
> > 
> > Thanks,
> > sanjeev.
> > 
> > 
> > 
> > 
> > Jérôme Charron wrote:
> >> 
> >>> i used an existing ThaiAnalyzer which was in
> lucene packlage.
> >>> ok - i renamed the lucene.analysis.th.* to
> nutch.analysis.th.* -
> >>> compiled
> >>> and
> >>> placed all class files in a jar -
> analysis-th.jar (do i need to bundle
> >>> the
> >>> ngp file in the jar as well ?)
> >> 
> >> 1. You don't have to refactor the lucene
> analyzer. Just to wrap it like I
> >> do
> >> with french and german analyzers (they both use
> some analyzers from
> >> lucene).
> >>  2. Analyzer doesn't need ngp files... I think
> you misunderstood
> >> something:
> >> 2.1 In one side there is the language identifier
> that use NGP files to
> >> identify language of a document
> >> 2.2 In the other sided if a suitable analyzer is
> found for the identified
> >> language, it is used to analyze the document.
> >> 
> >> Regards
> >> 
> >> Jérôme
> >> 
> >> 
> >> -- 
> >> http://motrech.free.fr/
> >> http://www.frutch.org/
> >> 
> >> 
> > 
> > 
> 
> -- 
> View this message in context:
>
http://www.nabble.com/implement-thai-language-indexing-and-search-tf2641172.html#a7827701
> Sent from the Nutch - Dev mailing list archive at
> Nabble.com.
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: implement thai language indexing and search

Posted by sanjeev <sa...@hotmail.com>.

Hi all,

I am still waiting for some help re: the thai language indexing and
searching.

Please help as i'm quite lost on this one.

Thanks and regards,
sanjeev.


sanjeev wrote:
> 
> Thanks for clearing up some doubts. But exactly how do i wrap it ?
> Do I need to make changes in code to utilize the new thaitokenizer ?
> If yes - where are the places that need modification ? 
> Do I need to download a dev version and do a recompile ?
> 
> Please - if you could possibly tell me the steps - in brief - i would be
> highly obliged.
> 
> Thanks,
> sanjeev.
> 
> 
> 
> 
> Jérôme Charron wrote:
>> 
>>> i used an existing ThaiAnalyzer which was in lucene packlage.
>>> ok - i renamed the lucene.analysis.th.* to nutch.analysis.th.* -
>>> compiled
>>> and
>>> placed all class files in a jar - analysis-th.jar (do i need to bundle
>>> the
>>> ngp file in the jar as well ?)
>> 
>> 1. You don't have to refactor the lucene analyzer. Just to wrap it like I
>> do
>> with french and german analyzers (they both use some analyzers from
>> lucene).
>>  2. Analyzer doesn't need ngp files... I think you misunderstood
>> something:
>> 2.1 In one side there is the language identifier that use NGP files to
>> identify language of a document
>> 2.2 In the other sided if a suitable analyzer is found for the identified
>> language, it is used to analyze the document.
>> 
>> Regards
>> 
>> Jérôme
>> 
>> 
>> -- 
>> http://motrech.free.fr/
>> http://www.frutch.org/
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/implement-thai-language-indexing-and-search-tf2641172.html#a7827701
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: implement thai language indexing and search

Posted by sanjeev <sa...@hotmail.com>.

Thanks for clearing up some doubts. But exactly how do i wrap it ?
Do I need to make changes in code to utilize the new thaitokenizer ?
If yes - where are the places that need modification ? 
Do I need to download a dev version and do a recompile ?

Please - if you could possibly tell me the steps - in brief - i would be
highly obliged.

Thanks,
sanjeev.




Jérôme Charron wrote:
> 
>> i used an existing ThaiAnalyzer which was in lucene packlage.
>> ok - i renamed the lucene.analysis.th.* to nutch.analysis.th.* - compiled
>> and
>> placed all class files in a jar - analysis-th.jar (do i need to bundle
>> the
>> ngp file in the jar as well ?)
> 
> 1. You don't have to refactor the lucene analyzer. Just to wrap it like I
> do
> with french and german analyzers (they both use some analyzers from
> lucene).
>  2. Analyzer doesn't need ngp files... I think you misunderstood
> something:
> 2.1 In one side there is the language identifier that use NGP files to
> identify language of a document
> 2.2 In the other sided if a suitable analyzer is found for the identified
> language, it is used to analyze the document.
> 
> Regards
> 
> Jérôme
> 
> 
> -- 
> http://motrech.free.fr/
> http://www.frutch.org/
> 
> 

-- 
View this message in context: http://www.nabble.com/implement-thai-language-indexing-and-search-tf2641172.html#a7671727
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: implement thai language indexing and search

Posted by Jérôme Charron <je...@gmail.com>.

> i used an existing ThaiAnalyzer which was in lucene package.
> ok - i renamed the lucene.analysis.th.* to nutch.analysis.th.* - compiled
> and
> placed all class files in a jar - analysis-th.jar (do i need to bundle the
> ngp file in the jar as well ?)

1. You don't have to refactor the lucene analyzer. Just to wrap it like I do
with french and german analyzers (they both use some analyzers from lucene).
 2. Analyzer doesn't need ngp files... I think you misunderstood something:
2.1 In one side there is the language identifier that use NGP files to
identify language of a document
2.2 In the other sided if a suitable analyzer is found for the identified
language, it is used to analyze the document.

Regards

Jérôme


-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: implement thai language indexing and search

Posted by sanjeev <sa...@hotmail.com>.

Thanks Jerome,

i used an existing ThaiAnalyzer which was in lucene package.

ok - i renamed the lucene.analysis.th.* to nutch.analysis.th.* - compiled
and
placed all class files in a jar - analysis-th.jar (do i need to bundle the
ngp file in the jar as well ?)

take a look at the log file for a sample crawl - somehow i feel the
language-identifier is still not
activated. 

Need your help urgently in resolving this issue.

cheers and regards and thanks for all your help.

sanjeev.
491116 151804 parsing
file:/C:/cygwin/home/robert/nutch-0.7.2/conf/nutch-default.xml
491116 151804 parsing
file:/C:/cygwin/home/robert/nutch-0.7.2/conf/crawl-tool.xml
491116 151804 parsing
file:/C:/cygwin/home/robert/nutch-0.7.2/conf/nutch-site.xml
491116 151804 No FS indicated, using default:local
491116 151804 crawl started in: crawlnewxx2
491116 151804 rootUrlFile = urls
491116 151804 threads = 10
491116 151804 depth = 10
491116 151804 Created webdb at
LocalFS,C:\cygwin\home\robert\nutch-0.7.2\crawlnewxx2\db
491116 151804 Starting URL processing
491116 151804 Plugins: looking in: C:\cygwin\home\robert\nutch-0.7.2\plugins
491116 151804 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\clustering-carrot2
491116 151804 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\creativecommons
491116 151804 parsing:
C:\cygwin\home\robert\nutch-0.7.2\plugins\index-basic\plugin.xml
491116 151805 impl: point=org.apache.nutch.indexer.IndexingFilter
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\index-more
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\language-identifier
491116 151805 parsing:
C:\cygwin\home\robert\nutch-0.7.2\plugins\nutch-extensionpoints\plugin.xml
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\ontology
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-ext
491116 151805 parsing:
C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-html\plugin.xml
491116 151805 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.html.HtmlParser
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-js
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-msword
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-pdf
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-rss
491116 151805 parsing:
C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-text\plugin.xml
491116 151805 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.text.TextParser
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\protocol-file
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\protocol-ftp
491116 151805 parsing:
C:\cygwin\home\robert\nutch-0.7.2\plugins\protocol-http\plugin.xml
491116 151805 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.http.Http
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\protocol-httpclient
491116 151805 parsing:
C:\cygwin\home\robert\nutch-0.7.2\plugins\query-basic\plugin.xml
491116 151805 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.basic.BasicQueryFilter
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\query-more
491116 151805 parsing:
C:\cygwin\home\robert\nutch-0.7.2\plugins\query-site\plugin.xml
491116 151805 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.site.SiteQueryFilter
491116 151805 parsing:
C:\cygwin\home\robert\nutch-0.7.2\plugins\query-url\plugin.xml
491116 151805 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.url.URLQueryFilter
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\urlfilter-prefix
491116 151805 parsing:
C:\cygwin\home\robert\nutch-0.7.2\plugins\urlfilter-regex\plugin.xml
491116 151805 impl: point=org.apache.nutch.net.URLFilter
class=org.apache.nutch.net.RegexURLFilter
491116 151805 found resource regex-urlfilter.txt at
file:/C:/cygwin/home/robert/nutch-0.7.2/conf/regex-urlfilter.txt
491116 151805 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer










Jérôme Charron wrote:
> 
>> ok. I was able to enable the language identifier plugin by adding the
>> value
>> in plugin.includes attribute
>> in nutch-site.xml - but i'm not sure just by doing that I can have thai
>> text
>> recognized and tokenized
>> properly.
>> What else do I have to do ? Please help me.
> 
> 1. You must create a thai NGP (Ngram Profile file) so that the language
> identifier can identify thai !
> 2. You must create a thai analyzer (see for instance analysis-fr and
> analysis-de sample analyzers).
> 
> Best Regards
> 
> Jérôme
> 
> 

-- 
View this message in context: http://www.nabble.com/implement-thai-language-indexing-and-search-tf2641172.html#a7375925
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: implement thai language indexing and search

Posted by Jérôme Charron <je...@gmail.com>.

> ok. I was able to enable the language identifier plugin by adding the
> value
> in plugin.includes attribute
> in nutch-site.xml - but i'm not sure just by doing that I can have thai
> text
> recognized and tokenized
> properly.
> What else do I have to do ? Please help me.

1. You must create a thai NGP (Ngram Profile file) so that the language
identifier can identify thai !
2. You must create a thai analyzer (see for instance analysis-fr and
analysis-de sample analyzers).

Best Regards

Jérôme

Re: implement thai language indexing and search

Posted by sanjeev <sa...@hotmail.com>.


ok. I was able to enable the language identifier plugin by adding the value
in plugin.includes attribute 
in nutch-site.xml - but i'm not sure just by doing that I can have thai text
recognized and tokenized
properly. 
What else do I have to do ? Please help me.

Thanks and regards,
sanjeev.




sanjeev wrote:
> 
> hi all,
> 
> I've been trying unsuccessfully for the past week to implement the thai
> language analyzer 
> with nutch. One thing I don't understand is the thai analyzer belongs to
> the lucene.analysis package
> instead of the nutch.analysis package.
> 
> I have the thai ngp file + analyzer (albeit from lucene) + nutch 0.8 dev
> pack
> 
> My question is how to integrate this into nutch that when I index and
> search - it will analyze and search 
> 
> the thai lanaguage correctly.
> 
> Someone please help as i'm sure it can be done.
> 
> thanks and regards,
> sanjeev
> 

-- 
View this message in context: http://www.nabble.com/implement-thai-language-indexing-and-search-tf2641172.html#a7375203
Sent from the Nutch - Dev mailing list archive at Nabble.com.