You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Andy Liu <an...@gmail.com> on 2005/08/01 18:25:50 UTC

Detecting CJKV / Asian language pages

The current Nutch language identifier plugin currently doesn't handle
CJKV pages.  Does anybody here have any experience with automatically
detecting the language of such pages?

I know there are specific encodings which give away what language the
page is, but for Asian language pages that use unicode or its
variants, I'm out of luck.

Andy

Re: nutch prune

Posted by Matthias Jaekle <ja...@eventax.de>.
Hi Jay,
I think with the current version you could only prune segments.
We have once written a class to prune the db.
Maybe you could use this and add a function to delete pages according to 
the urlfilter. I have attached our class.
Matthias
-- 
http://www.eventax.com - eventax GmbH
http://www.umkreisfinder.de - Die Suchmaschine für Lokales und Events


Jay Pound schrieb:

> How do I write my queries file for pruning my database, to only .com .edu
> .org .us etc... only us sites?
> Thanks,
> Jay Pound

nutch prune

Posted by Jay Pound <we...@poundwebhosting.com>.
How do I write my queries file for pruning my database, to only .com .edu
.org .us etc... only us sites?
Thanks,
Jay Pound



RE: Memory usage2

Posted by EM <em...@cpuedge.com>.
Why isn't 'analyze' supported anymore?

-----Original Message-----
From: Andy Liu [mailto:andyliu1227@gmail.com] 
Sent: Tuesday, August 02, 2005 5:44 PM
To: nutch-dev@lucene.apache.org
Subject: Re: Memory usage2

I have found that merging indexes does help performance significantly.

If you're not using the cached pages for anything, I believe you can
delete the /content directory for each segment and the engine should
work fine (test before you try for real!)  However, if you ever have
to reindex the segments for whatever reason, you'll run into problems
without the /content dirs.

Nutch doesn't use the HITS algorithm.  Nutch's analyze phase was based
off of PageRank, but it's no longer supported.  By default Nutch
boosts documents based on the # of incoming links, which works well in
small document collections, but is not a robust method in a whole-web
environment.  In terms of search quality, Nutch would not be able to
hang with the "big dogs" of search just yet.  There's still much work
that needs to be done in the area of search quality and spamming.

Andy


Re: Memory usage2

Posted by Andy Liu <an...@gmail.com>.
I have found that merging indexes does help performance significantly.

If you're not using the cached pages for anything, I believe you can
delete the /content directory for each segment and the engine should
work fine (test before you try for real!)  However, if you ever have
to reindex the segments for whatever reason, you'll run into problems
without the /content dirs.

Nutch doesn't use the HITS algorithm.  Nutch's analyze phase was based
off of PageRank, but it's no longer supported.  By default Nutch
boosts documents based on the # of incoming links, which works well in
small document collections, but is not a robust method in a whole-web
environment.  In terms of search quality, Nutch would not be able to
hang with the "big dogs" of search just yet.  There's still much work
that needs to be done in the area of search quality and spamming.

Andy

On 8/2/05, Fredrik Andersson <fi...@gmail.com> wrote:
> Hi Jay!
> 
> Why not use the "Google approach" and buy lots of cheap
> workstations/servers to distribute the search on? You can really get
> away cheap these days, compared to high-end servers. Even if NDFS and
> isn't fully up to par in 0.7-dev yet, you can still move your indices
> around to separate computers and distribute them that way.  Writing a
> small client/server for this purpose can be done in a matter of hours.
> Gathering as much data as you have on one server sounds like a bad
> idea to me, no matter how monstrous that server is.
> 
> Regarding the HITS algorithm - check out the example on the Nutch
> website for the Internet crawl, where you select the top scorers after
> you finished a segment (of arbitrary size), and continue on crawling
> from those high-ranking sites. That way you will get the most
> authorative sites in your index first, which is good.
> 
> Good night,
> Fredrik
> 
> On 8/2/05, Jay Pound <we...@poundwebhosting.com> wrote:
> > ....
> > one last important question, if I merge my indexes will it search faster
> > than if I don't merge them, I currently have 20 directories of 1-1.7mill
> > pages each.
> > and if I split up these indexes across multiple machines will the searching
> > be faster, I couldent get the nutch-server to work but I'm using 0.6.
> > ...
> > Thank you
> > -Jay Pound
> > Fromped.com
> > BTW windows 2000 is not 100% stable with dual core processors. nutch is ok
> > but cant do too many things at once or I'll get a kernel inpage error (guess
> > its time to migrate to 2003.net server-damn)
> > ----- Original Message -----
> > From: "Doug Cutting" <cu...@nutch.org>
> > To: <nu...@lucene.apache.org>
> > Sent: Tuesday, August 02, 2005 1:53 PM
> > Subject: Re: Memory usage
> >
> >
> > > Try the following settings in your nutch-site.xml:
> > >
> > > <property>
> > >    <name>io.map.index.skip</name>
> > >    <value>7</value>
> > > </property>
> > >
> > > <property>
> > >    <name>indexer.termIndexInterval</name>
> > >    <value>1024</value>
> > > </property>
> > >
> > > The first causes data files to use considerably less memory.
> > >
> > > The second affects index creation, so must be done before you create the
> > > index you search.  It's okay if your segment indexes were created
> > > without this, you can just (re-)merge indexes and the merged index will
> > > get the setting and use less memory when searching.
> > >
> > > Combining these two I have searched a 40+M page index on a machine using
> > > about 500MB of RAM.  That said, search times with such a large index are
> > > not good.  At some point, as your collection grows, you will want to
> > > merge multiple indexes containing different subsets of segments and put
> > > each on a separate box and search them with distributed search.
> > >
> > > Doug
> > >
> > > Jay Pound wrote:
> > > > I'm testing an index of 30 million pages, it requires 1.5gb of ram to
> > search
> > > > using tomcat 5, I plan on having an index with multiple billion pages,
> > but
> > > > if this is to scale then even with 16GB of ram I wont be able to have an
> > > > index larger than 320million pages? how can I distribute the memory
> > > > requirements across multiple machines, or is there another servlet
> > program
> > > > (like resin) that will require less memory to operate, has anyone else
> > run
> > > > into this?
> > > > Thanks,
> > > > -Jay Pound
> > > >
> > > >
> > >
> > >
> >
> >
> >
>

Re: Memory usage2

Posted by Fredrik Andersson <fi...@gmail.com>.
Hi Jay!

Why not use the "Google approach" and buy lots of cheap
workstations/servers to distribute the search on? You can really get
away cheap these days, compared to high-end servers. Even if NDFS and
isn't fully up to par in 0.7-dev yet, you can still move your indices
around to separate computers and distribute them that way.  Writing a
small client/server for this purpose can be done in a matter of hours.
Gathering as much data as you have on one server sounds like a bad
idea to me, no matter how monstrous that server is.

Regarding the HITS algorithm - check out the example on the Nutch
website for the Internet crawl, where you select the top scorers after
you finished a segment (of arbitrary size), and continue on crawling
from those high-ranking sites. That way you will get the most
authorative sites in your index first, which is good.

Good night,
Fredrik

On 8/2/05, Jay Pound <we...@poundwebhosting.com> wrote:
> ....
> one last important question, if I merge my indexes will it search faster
> than if I don't merge them, I currently have 20 directories of 1-1.7mill
> pages each.
> and if I split up these indexes across multiple machines will the searching
> be faster, I couldent get the nutch-server to work but I'm using 0.6.
> ...
> Thank you
> -Jay Pound
> Fromped.com
> BTW windows 2000 is not 100% stable with dual core processors. nutch is ok
> but cant do too many things at once or I'll get a kernel inpage error (guess
> its time to migrate to 2003.net server-damn)
> ----- Original Message -----
> From: "Doug Cutting" <cu...@nutch.org>
> To: <nu...@lucene.apache.org>
> Sent: Tuesday, August 02, 2005 1:53 PM
> Subject: Re: Memory usage
> 
> 
> > Try the following settings in your nutch-site.xml:
> >
> > <property>
> >    <name>io.map.index.skip</name>
> >    <value>7</value>
> > </property>
> >
> > <property>
> >    <name>indexer.termIndexInterval</name>
> >    <value>1024</value>
> > </property>
> >
> > The first causes data files to use considerably less memory.
> >
> > The second affects index creation, so must be done before you create the
> > index you search.  It's okay if your segment indexes were created
> > without this, you can just (re-)merge indexes and the merged index will
> > get the setting and use less memory when searching.
> >
> > Combining these two I have searched a 40+M page index on a machine using
> > about 500MB of RAM.  That said, search times with such a large index are
> > not good.  At some point, as your collection grows, you will want to
> > merge multiple indexes containing different subsets of segments and put
> > each on a separate box and search them with distributed search.
> >
> > Doug
> >
> > Jay Pound wrote:
> > > I'm testing an index of 30 million pages, it requires 1.5gb of ram to
> search
> > > using tomcat 5, I plan on having an index with multiple billion pages,
> but
> > > if this is to scale then even with 16GB of ram I wont be able to have an
> > > index larger than 320million pages? how can I distribute the memory
> > > requirements across multiple machines, or is there another servlet
> program
> > > (like resin) that will require less memory to operate, has anyone else
> run
> > > into this?
> > > Thanks,
> > > -Jay Pound
> > >
> > >
> >
> >
> 
> 
>

Re: distributed search

Posted by Jay Pound <we...@poundwebhosting.com>.
Thank You Piotr, and sorry for mispelling your name in the long e-mail
-J
----- Original Message ----- 
From: "Piotr Kosiorowski" <pk...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Friday, August 05, 2005 8:43 AM
Subject: Re: distributed search


> If you have two search servers
> search1.mydomain.com
> search2.mydomain.com
> Then on each of them run
> ./bin/nutch server 1234 /index
>
> Now go to your tomcat box. In the directory where you used to have
> "segments" dir
> (either tomcat startup directory or directory specified in nutch config
xml).
> Create "search-servers.txt" file containing:
> search1.mydomain.com 1234
> search2.mydomain.com 1234
>
> And move your old segment/index directories somewhere else so they are
> not by accident used.
> You should see search activity in your search servers logs now.
> Regards
> Piotr
>
>
>
>
>
> On 8/2/05, webmaster <sa...@www.poundwebhosting.com> wrote:
> > I read the wiki on the server option, how does it talk with tomcat for
the
> > search? it says
> > ./bin/nutch server port index dir
> > ./bin/nutch server 1234 /index
> >
> > how does it talk with eachother to find the other servers in the
cluster?
> > -Jay
> >
>
>



Re: distributed search

Posted by Piotr Kosiorowski <pk...@gmail.com>.
If you have two search servers
search1.mydomain.com
search2.mydomain.com
Then on each of them run
./bin/nutch server 1234 /index

Now go to your tomcat box. In the directory where you used to have
"segments" dir
(either tomcat startup directory or directory specified in nutch config xml).
Create "search-servers.txt" file containing:
search1.mydomain.com 1234
search2.mydomain.com 1234

And move your old segment/index directories somewhere else so they are
not by accident used.
You should see search activity in your search servers logs now.
Regards
Piotr





On 8/2/05, webmaster <sa...@www.poundwebhosting.com> wrote:
> I read the wiki on the server option, how does it talk with tomcat for the
> search? it says
> ./bin/nutch server port index dir
> ./bin/nutch server 1234 /index
> 
> how does it talk with eachother to find the other servers in the cluster?
> -Jay
>

distributed search

Posted by webmaster <sa...@www.poundwebhosting.com>.
I read the wiki on the server option, how does it talk with tomcat for the 
search? it says 
./bin/nutch server port index dir
./bin/nutch server 1234 /index

how does it talk with eachother to find the other servers in the cluster?
-Jay

Re: [Nutch-general] Re: Memory usage2

Posted by Jay Pound <we...@poundwebhosting.com>.
this is going to be a web wide search engine, I just want to be able to set
it up for each language, right now it returns results for all languages, so
the results are not so good
I'm trying to get pruning to work but don't know how, then I'll make an
smaller index for each language out of a larger index containing all
languages.
-J
----- Original Message ----- 
From: "Sébastien LE CALLONNEC" <sl...@yahoo.ie>
To: <nu...@lucene.apache.org>
Sent: Tuesday, August 02, 2005 4:34 PM
Subject: Re: [Nutch-general] Re: Memory usage2


> Obviously not:  it must be for « [urls] just ending in US
> extensions(.com.edu etc...) ». :))
>
> Anyway, it all sounds very impressive!  Good luck with your
> investigations and please keep us posted.
>
>
> Regards,
> Sébastien.
>
>
> --- ogjunk-nutch@yahoo.com a écrit :
>
> > Wow, a pile of questions. :)
> > Is this for a web-wide search engine?
> >
> > Otis
> >
> >
> > --- Jay Pound <we...@poundwebhosting.com> wrote:
> >
> > > whats the bottleneck for the slow searching, I'm monitoring it and
> > > its doing
> > > about 57% cpu load when I'm searching , it takes about 50secs to
> > > bring up
> > > the results page the first time, then if I search for the same
> > thing
> > > again
> > > its much faster.
> > > Doug, can I trash my segments after they are indexed, I don't want
> > to
> > > have
> > > cached access to the pages do the segments still need to be there?
> > my
> > > 30mil
> > > page index/segment is using over 300gb I have the space, but when I
> > > get to
> > > the hundreds of millions of pages I will run out of room on my raid
> > > controler's for hd expansion, I'm planning on moving to lustre if
> > > ndfs is
> > > not stable by then. I plan on having a multi billion page index if
> > > the
> > > memory requirements for that can be below 16gb per search node.
> > right
> > > now
> > > I'm getting pretty crappy results from my 30 million pages, I read
> > > the
> > > whitepaper on Authoritative Sources in a Hyperlinked Environment
> > > because
> > > someone said thats how the nutch algorithm worked, so I'm assuming
> > as
> > > my
> > > index grows the pages that deserve top placement will recieve top
> > > placement,
> > > but I don't know if I should re-fetch a new set of segments with
> > root
> > > url's
> > > just ending in US extensions(.com.edu etc...) I made a small set
> > > testing
> > > this theory (100000 pages) and its results were much better than my
> > > results
> > > from the 30mill page index. whats your thought on this, am I right
> > in
> > > thinking that the pages with the most pages linking to them will
> > show
> > > up
> > > first? so if I index 500 million pages my results should be on par
> > > with the
> > > rest of the "big dogs"?
> > >
> > > one last important question, if I merge my indexes will it search
> > > faster
> > > than if I don't merge them, I currently have 20 directories of
> > > 1-1.7mill
> > > pages each.
> > > and if I split up these indexes across multiple machines will the
> > > searching
> > > be faster, I couldent get the nutch-server to work but I'm using
> > 0.6.
> > >
> > > I have a very fast server I didnt know if the searching would take
> > > advantage
> > > of smp, fetching will and I can run multiple index's at the same
> > > time. my HD
> > > array is 200MB a sec i/o I have the new dual core opteron 275 italy
> > > core
> > > with 4gb ram, working my way to 16gb when I need it and a second
> > > processor
> > > when I need it, 1.28TB of hd space for nutch currently with
> > expansion
> > > up to
> > > 5.12TB, I'm currently running windows 2000 on it as they havent
> > made
> > > a
> > > driver yet for suse 9.3 for my raid cards (highpoint 2220) so my
> > > scalability
> > > will be to 960MB a sec with all the drives in the system and 4x2.2
> > > Ghz
> > > processor cores. untill I need to cluster thats what I have to play
> > > with for
> > > nutch.
> > > in case you guys needed to know what hardware I'm running
> > > Thank you
> > > -Jay Pound
> > > Fromped.com
> > > BTW windows 2000 is not 100% stable with dual core processors.
> > nutch
> > > is ok
> > > but cant do too many things at once or I'll get a kernel inpage
> > error
> > > (guess
> > > its time to migrate to 2003.net server-damn)
> > > ----- Original Message ----- 
> > > From: "Doug Cutting" <cu...@nutch.org>
> > > To: <nu...@lucene.apache.org>
> > > Sent: Tuesday, August 02, 2005 1:53 PM
> > > Subject: Re: Memory usage
> > >
> > >
> > > > Try the following settings in your nutch-site.xml:
> > > >
> > > > <property>
> > > >    <name>io.map.index.skip</name>
> > > >    <value>7</value>
> > > > </property>
> > > >
> > > > <property>
> > > >    <name>indexer.termIndexInterval</name>
> > > >    <value>1024</value>
> > > > </property>
> > > >
> > > > The first causes data files to use considerably less memory.
> > > >
> > > > The second affects index creation, so must be done before you
> > > create the
> > > > index you search.  It's okay if your segment indexes were created
> > > > without this, you can just (re-)merge indexes and the merged
> > index
> > > will
> > > > get the setting and use less memory when searching.
> > > >
> > > > Combining these two I have searched a 40+M page index on a
> > machine
> > > using
> > > > about 500MB of RAM.  That said, search times with such a large
> > > index are
> > > > not good.  At some point, as your collection grows, you will want
> > > to
> > > > merge multiple indexes containing different subsets of segments
> > and
> > > put
> > > > each on a separate box and search them with distributed search.
> > > >
> > > > Doug
> > > >
> > > > Jay Pound wrote:
> > > > > I'm testing an index of 30 million pages, it requires 1.5gb of
> > > ram to
> > > search
> > > > > using tomcat 5, I plan on having an index with multiple billion
> > > pages,
> > > but
> > > > > if this is to scale then even with 16GB of ram I wont be able
> > to
> > > have an
> > > > > index larger than 320million pages? how can I distribute the
> > > memory
> > > > > requirements across multiple machines, or is there another
> > > servlet
> > > program
> > > > > (like resin) that will require less memory to operate, has
> > anyone
> > > else
> > > run
> > > > > into this?
> > > > > Thanks,
> > > > > -Jay Pound
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
> > > -------------------------------------------------------
> > > SF.Net email is sponsored by: Discover Easy Linux Migration
> > > Strategies
> > > from IBM. Find simple to follow Roadmaps, straightforward articles,
> > > informative Webcasts and more! Get everything you need to get up to
> > > speed, fast.
> > http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
> > > _______________________________________________
> > > Nutch-general mailing list
> > > Nutch-general@lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/nutch-general
> > >
> >
> >
>
>
>
>
>
>
>
>
___________________________________________________________________________
> Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger
> Téléchargez cette version sur http://fr.messenger.yahoo.com
>
>



Re: [Nutch-general] Re: Memory usage2

Posted by Sébastien LE CALLONNEC <sl...@yahoo.ie>.
Obviously not:  it must be for « [urls] just ending in US
extensions(.com.edu etc...) ». :))

Anyway, it all sounds very impressive!  Good luck with your
investigations and please keep us posted.


Regards,
Sébastien.


--- ogjunk-nutch@yahoo.com a écrit :

> Wow, a pile of questions. :)
> Is this for a web-wide search engine?
> 
> Otis
> 
> 
> --- Jay Pound <we...@poundwebhosting.com> wrote:
> 
> > whats the bottleneck for the slow searching, I'm monitoring it and
> > its doing
> > about 57% cpu load when I'm searching , it takes about 50secs to
> > bring up
> > the results page the first time, then if I search for the same
> thing
> > again
> > its much faster.
> > Doug, can I trash my segments after they are indexed, I don't want
> to
> > have
> > cached access to the pages do the segments still need to be there?
> my
> > 30mil
> > page index/segment is using over 300gb I have the space, but when I
> > get to
> > the hundreds of millions of pages I will run out of room on my raid
> > controler's for hd expansion, I'm planning on moving to lustre if
> > ndfs is
> > not stable by then. I plan on having a multi billion page index if
> > the
> > memory requirements for that can be below 16gb per search node.
> right
> > now
> > I'm getting pretty crappy results from my 30 million pages, I read
> > the
> > whitepaper on Authoritative Sources in a Hyperlinked Environment
> > because
> > someone said thats how the nutch algorithm worked, so I'm assuming
> as
> > my
> > index grows the pages that deserve top placement will recieve top
> > placement,
> > but I don't know if I should re-fetch a new set of segments with
> root
> > url's
> > just ending in US extensions(.com.edu etc...) I made a small set
> > testing
> > this theory (100000 pages) and its results were much better than my
> > results
> > from the 30mill page index. whats your thought on this, am I right
> in
> > thinking that the pages with the most pages linking to them will
> show
> > up
> > first? so if I index 500 million pages my results should be on par
> > with the
> > rest of the "big dogs"?
> > 
> > one last important question, if I merge my indexes will it search
> > faster
> > than if I don't merge them, I currently have 20 directories of
> > 1-1.7mill
> > pages each.
> > and if I split up these indexes across multiple machines will the
> > searching
> > be faster, I couldent get the nutch-server to work but I'm using
> 0.6.
> > 
> > I have a very fast server I didnt know if the searching would take
> > advantage
> > of smp, fetching will and I can run multiple index's at the same
> > time. my HD
> > array is 200MB a sec i/o I have the new dual core opteron 275 italy
> > core
> > with 4gb ram, working my way to 16gb when I need it and a second
> > processor
> > when I need it, 1.28TB of hd space for nutch currently with
> expansion
> > up to
> > 5.12TB, I'm currently running windows 2000 on it as they havent
> made
> > a
> > driver yet for suse 9.3 for my raid cards (highpoint 2220) so my
> > scalability
> > will be to 960MB a sec with all the drives in the system and 4x2.2
> > Ghz
> > processor cores. untill I need to cluster thats what I have to play
> > with for
> > nutch.
> > in case you guys needed to know what hardware I'm running
> > Thank you
> > -Jay Pound
> > Fromped.com
> > BTW windows 2000 is not 100% stable with dual core processors.
> nutch
> > is ok
> > but cant do too many things at once or I'll get a kernel inpage
> error
> > (guess
> > its time to migrate to 2003.net server-damn)
> > ----- Original Message ----- 
> > From: "Doug Cutting" <cu...@nutch.org>
> > To: <nu...@lucene.apache.org>
> > Sent: Tuesday, August 02, 2005 1:53 PM
> > Subject: Re: Memory usage
> > 
> > 
> > > Try the following settings in your nutch-site.xml:
> > >
> > > <property>
> > >    <name>io.map.index.skip</name>
> > >    <value>7</value>
> > > </property>
> > >
> > > <property>
> > >    <name>indexer.termIndexInterval</name>
> > >    <value>1024</value>
> > > </property>
> > >
> > > The first causes data files to use considerably less memory.
> > >
> > > The second affects index creation, so must be done before you
> > create the
> > > index you search.  It's okay if your segment indexes were created
> > > without this, you can just (re-)merge indexes and the merged
> index
> > will
> > > get the setting and use less memory when searching.
> > >
> > > Combining these two I have searched a 40+M page index on a
> machine
> > using
> > > about 500MB of RAM.  That said, search times with such a large
> > index are
> > > not good.  At some point, as your collection grows, you will want
> > to
> > > merge multiple indexes containing different subsets of segments
> and
> > put
> > > each on a separate box and search them with distributed search.
> > >
> > > Doug
> > >
> > > Jay Pound wrote:
> > > > I'm testing an index of 30 million pages, it requires 1.5gb of
> > ram to
> > search
> > > > using tomcat 5, I plan on having an index with multiple billion
> > pages,
> > but
> > > > if this is to scale then even with 16GB of ram I wont be able
> to
> > have an
> > > > index larger than 320million pages? how can I distribute the
> > memory
> > > > requirements across multiple machines, or is there another
> > servlet
> > program
> > > > (like resin) that will require less memory to operate, has
> anyone
> > else
> > run
> > > > into this?
> > > > Thanks,
> > > > -Jay Pound
> > > >
> > > >
> > >
> > >
> > 
> > 
> > 
> > 
> > -------------------------------------------------------
> > SF.Net email is sponsored by: Discover Easy Linux Migration
> > Strategies
> > from IBM. Find simple to follow Roadmaps, straightforward articles,
> > informative Webcasts and more! Get everything you need to get up to
> > speed, fast.
> http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
> > _______________________________________________
> > Nutch-general mailing list
> > Nutch-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/nutch-general
> > 
> 
> 



	

	
		
___________________________________________________________________________ 
Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger 
Téléchargez cette version sur http://fr.messenger.yahoo.com

Re: [Nutch-general] Re: Memory usage2

Posted by og...@yahoo.com.
Wow, a pile of questions. :)
Is this for a web-wide search engine?

Otis


--- Jay Pound <we...@poundwebhosting.com> wrote:

> whats the bottleneck for the slow searching, I'm monitoring it and
> its doing
> about 57% cpu load when I'm searching , it takes about 50secs to
> bring up
> the results page the first time, then if I search for the same thing
> again
> its much faster.
> Doug, can I trash my segments after they are indexed, I don't want to
> have
> cached access to the pages do the segments still need to be there? my
> 30mil
> page index/segment is using over 300gb I have the space, but when I
> get to
> the hundreds of millions of pages I will run out of room on my raid
> controler's for hd expansion, I'm planning on moving to lustre if
> ndfs is
> not stable by then. I plan on having a multi billion page index if
> the
> memory requirements for that can be below 16gb per search node. right
> now
> I'm getting pretty crappy results from my 30 million pages, I read
> the
> whitepaper on Authoritative Sources in a Hyperlinked Environment
> because
> someone said thats how the nutch algorithm worked, so I'm assuming as
> my
> index grows the pages that deserve top placement will recieve top
> placement,
> but I don't know if I should re-fetch a new set of segments with root
> url's
> just ending in US extensions(.com.edu etc...) I made a small set
> testing
> this theory (100000 pages) and its results were much better than my
> results
> from the 30mill page index. whats your thought on this, am I right in
> thinking that the pages with the most pages linking to them will show
> up
> first? so if I index 500 million pages my results should be on par
> with the
> rest of the "big dogs"?
> 
> one last important question, if I merge my indexes will it search
> faster
> than if I don't merge them, I currently have 20 directories of
> 1-1.7mill
> pages each.
> and if I split up these indexes across multiple machines will the
> searching
> be faster, I couldent get the nutch-server to work but I'm using 0.6.
> 
> I have a very fast server I didnt know if the searching would take
> advantage
> of smp, fetching will and I can run multiple index's at the same
> time. my HD
> array is 200MB a sec i/o I have the new dual core opteron 275 italy
> core
> with 4gb ram, working my way to 16gb when I need it and a second
> processor
> when I need it, 1.28TB of hd space for nutch currently with expansion
> up to
> 5.12TB, I'm currently running windows 2000 on it as they havent made
> a
> driver yet for suse 9.3 for my raid cards (highpoint 2220) so my
> scalability
> will be to 960MB a sec with all the drives in the system and 4x2.2
> Ghz
> processor cores. untill I need to cluster thats what I have to play
> with for
> nutch.
> in case you guys needed to know what hardware I'm running
> Thank you
> -Jay Pound
> Fromped.com
> BTW windows 2000 is not 100% stable with dual core processors. nutch
> is ok
> but cant do too many things at once or I'll get a kernel inpage error
> (guess
> its time to migrate to 2003.net server-damn)
> ----- Original Message ----- 
> From: "Doug Cutting" <cu...@nutch.org>
> To: <nu...@lucene.apache.org>
> Sent: Tuesday, August 02, 2005 1:53 PM
> Subject: Re: Memory usage
> 
> 
> > Try the following settings in your nutch-site.xml:
> >
> > <property>
> >    <name>io.map.index.skip</name>
> >    <value>7</value>
> > </property>
> >
> > <property>
> >    <name>indexer.termIndexInterval</name>
> >    <value>1024</value>
> > </property>
> >
> > The first causes data files to use considerably less memory.
> >
> > The second affects index creation, so must be done before you
> create the
> > index you search.  It's okay if your segment indexes were created
> > without this, you can just (re-)merge indexes and the merged index
> will
> > get the setting and use less memory when searching.
> >
> > Combining these two I have searched a 40+M page index on a machine
> using
> > about 500MB of RAM.  That said, search times with such a large
> index are
> > not good.  At some point, as your collection grows, you will want
> to
> > merge multiple indexes containing different subsets of segments and
> put
> > each on a separate box and search them with distributed search.
> >
> > Doug
> >
> > Jay Pound wrote:
> > > I'm testing an index of 30 million pages, it requires 1.5gb of
> ram to
> search
> > > using tomcat 5, I plan on having an index with multiple billion
> pages,
> but
> > > if this is to scale then even with 16GB of ram I wont be able to
> have an
> > > index larger than 320million pages? how can I distribute the
> memory
> > > requirements across multiple machines, or is there another
> servlet
> program
> > > (like resin) that will require less memory to operate, has anyone
> else
> run
> > > into this?
> > > Thanks,
> > > -Jay Pound
> > >
> > >
> >
> >
> 
> 
> 
> 
> -------------------------------------------------------
> SF.Net email is sponsored by: Discover Easy Linux Migration
> Strategies
> from IBM. Find simple to follow Roadmaps, straightforward articles,
> informative Webcasts and more! Get everything you need to get up to
> speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
> _______________________________________________
> Nutch-general mailing list
> Nutch-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-general
> 


RE: Memory usage2

Posted by Paul Harrison <pa...@personifi.com>.
I am very interested in the answer to this as we have crawled 100+ million
pages on 5 very fast machines and are seeing similar issues.  At 7 million
pages results return in about 3 to 4 seconds.  On 20 million pages results
were coming back in about 15 seconds.  The funny thing is the result times
to not scale linearly.  Our current theory is there is a problem in the way
lucene and java interact.  I would love to hear from other folks on this.

Thanks,

Paul Harrison

-----Original Message-----
From: Jay Pound [mailto:webmaster@poundwebhosting.com] 
Sent: Tuesday, August 02, 2005 2:44 PM
To: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org
Subject: Re: Memory usage2

whats the bottleneck for the slow searching, I'm monitoring it and its doing
about 57% cpu load when I'm searching , it takes about 50secs to bring up
the results page the first time, then if I search for the same thing again
its much faster.
Doug, can I trash my segments after they are indexed, I don't want to have
cached access to the pages do the segments still need to be there? my 30mil
page index/segment is using over 300gb I have the space, but when I get to
the hundreds of millions of pages I will run out of room on my raid
controler's for hd expansion, I'm planning on moving to lustre if ndfs is
not stable by then. I plan on having a multi billion page index if the
memory requirements for that can be below 16gb per search node. right now
I'm getting pretty crappy results from my 30 million pages, I read the
whitepaper on Authoritative Sources in a Hyperlinked Environment because
someone said thats how the nutch algorithm worked, so I'm assuming as my
index grows the pages that deserve top placement will recieve top placement,
but I don't know if I should re-fetch a new set of segments with root url's
just ending in US extensions(.com.edu etc...) I made a small set testing
this theory (100000 pages) and its results were much better than my results
from the 30mill page index. whats your thought on this, am I right in
thinking that the pages with the most pages linking to them will show up
first? so if I index 500 million pages my results should be on par with the
rest of the "big dogs"?

one last important question, if I merge my indexes will it search faster
than if I don't merge them, I currently have 20 directories of 1-1.7mill
pages each.
and if I split up these indexes across multiple machines will the searching
be faster, I couldent get the nutch-server to work but I'm using 0.6.

I have a very fast server I didnt know if the searching would take advantage
of smp, fetching will and I can run multiple index's at the same time. my HD
array is 200MB a sec i/o I have the new dual core opteron 275 italy core
with 4gb ram, working my way to 16gb when I need it and a second processor
when I need it, 1.28TB of hd space for nutch currently with expansion up to
5.12TB, I'm currently running windows 2000 on it as they havent made a
driver yet for suse 9.3 for my raid cards (highpoint 2220) so my scalability
will be to 960MB a sec with all the drives in the system and 4x2.2 Ghz
processor cores. untill I need to cluster thats what I have to play with for
nutch.
in case you guys needed to know what hardware I'm running
Thank you
-Jay Pound
Fromped.com
BTW windows 2000 is not 100% stable with dual core processors. nutch is ok
but cant do too many things at once or I'll get a kernel inpage error (guess
its time to migrate to 2003.net server-damn)
----- Original Message ----- 
From: "Doug Cutting" <cu...@nutch.org>
To: <nu...@lucene.apache.org>
Sent: Tuesday, August 02, 2005 1:53 PM
Subject: Re: Memory usage


> Try the following settings in your nutch-site.xml:
>
> <property>
>    <name>io.map.index.skip</name>
>    <value>7</value>
> </property>
>
> <property>
>    <name>indexer.termIndexInterval</name>
>    <value>1024</value>
> </property>
>
> The first causes data files to use considerably less memory.
>
> The second affects index creation, so must be done before you create the
> index you search.  It's okay if your segment indexes were created
> without this, you can just (re-)merge indexes and the merged index will
> get the setting and use less memory when searching.
>
> Combining these two I have searched a 40+M page index on a machine using
> about 500MB of RAM.  That said, search times with such a large index are
> not good.  At some point, as your collection grows, you will want to
> merge multiple indexes containing different subsets of segments and put
> each on a separate box and search them with distributed search.
>
> Doug
>
> Jay Pound wrote:
> > I'm testing an index of 30 million pages, it requires 1.5gb of ram to
search
> > using tomcat 5, I plan on having an index with multiple billion pages,
but
> > if this is to scale then even with 16GB of ram I wont be able to have an
> > index larger than 320million pages? how can I distribute the memory
> > requirements across multiple machines, or is there another servlet
program
> > (like resin) that will require less memory to operate, has anyone else
run
> > into this?
> > Thanks,
> > -Jay Pound
> >
> >
>
>



Re: Memory usage2

Posted by Jay Pound <we...@poundwebhosting.com>.
whats the bottleneck for the slow searching, I'm monitoring it and its doing
about 57% cpu load when I'm searching , it takes about 50secs to bring up
the results page the first time, then if I search for the same thing again
its much faster.
Doug, can I trash my segments after they are indexed, I don't want to have
cached access to the pages do the segments still need to be there? my 30mil
page index/segment is using over 300gb I have the space, but when I get to
the hundreds of millions of pages I will run out of room on my raid
controler's for hd expansion, I'm planning on moving to lustre if ndfs is
not stable by then. I plan on having a multi billion page index if the
memory requirements for that can be below 16gb per search node. right now
I'm getting pretty crappy results from my 30 million pages, I read the
whitepaper on Authoritative Sources in a Hyperlinked Environment because
someone said thats how the nutch algorithm worked, so I'm assuming as my
index grows the pages that deserve top placement will recieve top placement,
but I don't know if I should re-fetch a new set of segments with root url's
just ending in US extensions(.com.edu etc...) I made a small set testing
this theory (100000 pages) and its results were much better than my results
from the 30mill page index. whats your thought on this, am I right in
thinking that the pages with the most pages linking to them will show up
first? so if I index 500 million pages my results should be on par with the
rest of the "big dogs"?

one last important question, if I merge my indexes will it search faster
than if I don't merge them, I currently have 20 directories of 1-1.7mill
pages each.
and if I split up these indexes across multiple machines will the searching
be faster, I couldent get the nutch-server to work but I'm using 0.6.

I have a very fast server I didnt know if the searching would take advantage
of smp, fetching will and I can run multiple index's at the same time. my HD
array is 200MB a sec i/o I have the new dual core opteron 275 italy core
with 4gb ram, working my way to 16gb when I need it and a second processor
when I need it, 1.28TB of hd space for nutch currently with expansion up to
5.12TB, I'm currently running windows 2000 on it as they havent made a
driver yet for suse 9.3 for my raid cards (highpoint 2220) so my scalability
will be to 960MB a sec with all the drives in the system and 4x2.2 Ghz
processor cores. untill I need to cluster thats what I have to play with for
nutch.
in case you guys needed to know what hardware I'm running
Thank you
-Jay Pound
Fromped.com
BTW windows 2000 is not 100% stable with dual core processors. nutch is ok
but cant do too many things at once or I'll get a kernel inpage error (guess
its time to migrate to 2003.net server-damn)
----- Original Message ----- 
From: "Doug Cutting" <cu...@nutch.org>
To: <nu...@lucene.apache.org>
Sent: Tuesday, August 02, 2005 1:53 PM
Subject: Re: Memory usage


> Try the following settings in your nutch-site.xml:
>
> <property>
>    <name>io.map.index.skip</name>
>    <value>7</value>
> </property>
>
> <property>
>    <name>indexer.termIndexInterval</name>
>    <value>1024</value>
> </property>
>
> The first causes data files to use considerably less memory.
>
> The second affects index creation, so must be done before you create the
> index you search.  It's okay if your segment indexes were created
> without this, you can just (re-)merge indexes and the merged index will
> get the setting and use less memory when searching.
>
> Combining these two I have searched a 40+M page index on a machine using
> about 500MB of RAM.  That said, search times with such a large index are
> not good.  At some point, as your collection grows, you will want to
> merge multiple indexes containing different subsets of segments and put
> each on a separate box and search them with distributed search.
>
> Doug
>
> Jay Pound wrote:
> > I'm testing an index of 30 million pages, it requires 1.5gb of ram to
search
> > using tomcat 5, I plan on having an index with multiple billion pages,
but
> > if this is to scale then even with 16GB of ram I wont be able to have an
> > index larger than 320million pages? how can I distribute the memory
> > requirements across multiple machines, or is there another servlet
program
> > (like resin) that will require less memory to operate, has anyone else
run
> > into this?
> > Thanks,
> > -Jay Pound
> >
> >
>
>



Re: Memory usage2

Posted by Jay Pound <we...@poundwebhosting.com>.
whats the bottleneck for the slow searching, I'm monitoring it and its doing
about 57% cpu load when I'm searching , it takes about 50secs to bring up
the results page the first time, then if I search for the same thing again
its much faster.
Doug, can I trash my segments after they are indexed, I don't want to have
cached access to the pages do the segments still need to be there? my 30mil
page index/segment is using over 300gb I have the space, but when I get to
the hundreds of millions of pages I will run out of room on my raid
controler's for hd expansion, I'm planning on moving to lustre if ndfs is
not stable by then. I plan on having a multi billion page index if the
memory requirements for that can be below 16gb per search node. right now
I'm getting pretty crappy results from my 30 million pages, I read the
whitepaper on Authoritative Sources in a Hyperlinked Environment because
someone said thats how the nutch algorithm worked, so I'm assuming as my
index grows the pages that deserve top placement will recieve top placement,
but I don't know if I should re-fetch a new set of segments with root url's
just ending in US extensions(.com.edu etc...) I made a small set testing
this theory (100000 pages) and its results were much better than my results
from the 30mill page index. whats your thought on this, am I right in
thinking that the pages with the most pages linking to them will show up
first? so if I index 500 million pages my results should be on par with the
rest of the "big dogs"?

one last important question, if I merge my indexes will it search faster
than if I don't merge them, I currently have 20 directories of 1-1.7mill
pages each.
and if I split up these indexes across multiple machines will the searching
be faster, I couldent get the nutch-server to work but I'm using 0.6.

I have a very fast server I didnt know if the searching would take advantage
of smp, fetching will and I can run multiple index's at the same time. my HD
array is 200MB a sec i/o I have the new dual core opteron 275 italy core
with 4gb ram, working my way to 16gb when I need it and a second processor
when I need it, 1.28TB of hd space for nutch currently with expansion up to
5.12TB, I'm currently running windows 2000 on it as they havent made a
driver yet for suse 9.3 for my raid cards (highpoint 2220) so my scalability
will be to 960MB a sec with all the drives in the system and 4x2.2 Ghz
processor cores. untill I need to cluster thats what I have to play with for
nutch.
in case you guys needed to know what hardware I'm running
Thank you
-Jay Pound
Fromped.com
BTW windows 2000 is not 100% stable with dual core processors. nutch is ok
but cant do too many things at once or I'll get a kernel inpage error (guess
its time to migrate to 2003.net server-damn)
----- Original Message ----- 
From: "Doug Cutting" <cu...@nutch.org>
To: <nu...@lucene.apache.org>
Sent: Tuesday, August 02, 2005 1:53 PM
Subject: Re: Memory usage


> Try the following settings in your nutch-site.xml:
>
> <property>
>    <name>io.map.index.skip</name>
>    <value>7</value>
> </property>
>
> <property>
>    <name>indexer.termIndexInterval</name>
>    <value>1024</value>
> </property>
>
> The first causes data files to use considerably less memory.
>
> The second affects index creation, so must be done before you create the
> index you search.  It's okay if your segment indexes were created
> without this, you can just (re-)merge indexes and the merged index will
> get the setting and use less memory when searching.
>
> Combining these two I have searched a 40+M page index on a machine using
> about 500MB of RAM.  That said, search times with such a large index are
> not good.  At some point, as your collection grows, you will want to
> merge multiple indexes containing different subsets of segments and put
> each on a separate box and search them with distributed search.
>
> Doug
>
> Jay Pound wrote:
> > I'm testing an index of 30 million pages, it requires 1.5gb of ram to
search
> > using tomcat 5, I plan on having an index with multiple billion pages,
but
> > if this is to scale then even with 16GB of ram I wont be able to have an
> > index larger than 320million pages? how can I distribute the memory
> > requirements across multiple machines, or is there another servlet
program
> > (like resin) that will require less memory to operate, has anyone else
run
> > into this?
> > Thanks,
> > -Jay Pound
> >
> >
>
>



Re: Memory usage

Posted by Doug Cutting <cu...@nutch.org>.
Try the following settings in your nutch-site.xml:

<property>
   <name>io.map.index.skip</name>
   <value>7</value>
</property>

<property>
   <name>indexer.termIndexInterval</name>
   <value>1024</value>
</property>

The first causes data files to use considerably less memory.

The second affects index creation, so must be done before you create the 
index you search.  It's okay if your segment indexes were created 
without this, you can just (re-)merge indexes and the merged index will 
get the setting and use less memory when searching.

Combining these two I have searched a 40+M page index on a machine using 
about 500MB of RAM.  That said, search times with such a large index are 
not good.  At some point, as your collection grows, you will want to 
merge multiple indexes containing different subsets of segments and put 
each on a separate box and search them with distributed search.

Doug

Jay Pound wrote:
> I'm testing an index of 30 million pages, it requires 1.5gb of ram to search
> using tomcat 5, I plan on having an index with multiple billion pages, but
> if this is to scale then even with 16GB of ram I wont be able to have an
> index larger than 320million pages? how can I distribute the memory
> requirements across multiple machines, or is there another servlet program
> (like resin) that will require less memory to operate, has anyone else run
> into this?
> Thanks,
> -Jay Pound
> 
> 

Re: Memory usage

Posted by Andy Liu <an...@gmail.com>.
How do you figure that it takes 1.5G ram for 30M pages?  I believe
that when the Lucene indexes are read, it reads all the numbered *.f*
files and the *.tii files into memory.  The numbered *.f* files
contain the length normalization values for each indexed field (1 byte
per doc), and the .tii file contains every kth term (k=128 by default,
I think).

For 30M documents, each *.f* file is 30 megs, and your .tii file
should be less than 100 megs.  For 8 indexed fields, you'd be looking
at a memory footprint of about 340M.  Any extra memory on the server
can be used for buffer caching which will speed up searches.

If you'd like, you can set up search servers to spread the load across
seperate machines.

The servlet container you use shouldn't make much of a difference in
memory usage.

Andy

On 8/2/05, Jay Pound <we...@poundwebhosting.com> wrote:
> I'm testing an index of 30 million pages, it requires 1.5gb of ram to search
> using tomcat 5, I plan on having an index with multiple billion pages, but
> if this is to scale then even with 16GB of ram I wont be able to have an
> index larger than 320million pages? how can I distribute the memory
> requirements across multiple machines, or is there another servlet program
> (like resin) that will require less memory to operate, has anyone else run
> into this?
> Thanks,
> -Jay Pound
> 
> 
>

Memory usage

Posted by Jay Pound <we...@poundwebhosting.com>.
I'm testing an index of 30 million pages, it requires 1.5gb of ram to search
using tomcat 5, I plan on having an index with multiple billion pages, but
if this is to scale then even with 16GB of ram I wont be able to have an
index larger than 320million pages? how can I distribute the memory
requirements across multiple machines, or is there another servlet program
(like resin) that will require less memory to operate, has anyone else run
into this?
Thanks,
-Jay Pound



Re: Detecting CJKV / Asian language pages

Posted by Gavin Thomas Nicol <gt...@rbii.com>.
On Aug 2, 2005, at 6:03 PM, Ken Krugler wrote:
> Thanks for your work in this area! I assume it's RFC 2070 :)

Yes. :-)

> 1. Server doesn't provide any charset info.

Very common in my experience.

> 2. Server provides incorrect charset info.
>     a. Charset is a subset (e.g. 8859-1 vs. 1252)
>     b. Charset is just plain wrong (e.g. 8859-1 vs. 1251)
> 3. Server provides an invalid charset name.
>     a. Charset could be mapped, with a table (e.g. ".UTF8")
>     b. Charset is unknown (e.g. "X-USER-DEFINED").

Yes... I don't know about now, but a few years ago, the data that  
*was* sent back from the server was, for the most part not worth  
counting on.



Re: Detecting CJKV / Asian language pages

Posted by Ken Krugler <kk...@transpac.com>.
>On Aug 2, 2005, at 11:55 AM, Ken Krugler wrote:
>
>>Yes - small chunks of untagged text are going to be a problem, no 
>>matter what you do. But if you're referring to query strings from 
>>an HTML page, the default is to use the encoding of the page (which 
>>from Nutch defaults to UTF-8). And you can use the accept-charset 
>>form attribute to explicitly specify UTF-8.
>
>Yes, that's right (FWIW. I'm one of the authors of RFC 2040)...

Thanks for your work in this area! I assume it's RFC 2070 :)

>it'd be interesting to see how well ICU does with crawl data. Does 
>anyone have any experience?

ICU 3.4 was just released, so I don't think there's any real-world 
data yet. Mozilla's charset detector has been around for a while, and 
I haven't heard people complaining loudly about it (other than issues 
with trying to extract it for use in other apps), but I don't monitor 
those mailing lists.

Maybe Otis would be a good person to give this a try, based on his 
email to the list on 7/17/2005.

He also listed a number of charset names that he was getting back 
from servers, many of which weren't valid IANA names. So there are at 
least three kinds of charset problems:

1. Server doesn't provide any charset info.
2. Server provides incorrect charset info.
	a. Charset is a subset (e.g. 8859-1 vs. 1252)
	b. Charset is just plain wrong (e.g. 8859-1 vs. 1251)
3. Server provides an invalid charset name.
	a. Charset could be mapped, with a table (e.g. ".UTF8")
	b. Charset is unknown (e.g. "X-USER-DEFINED").

There are other issues w/pages, for example ones that use some kind 
of font hack to display "Latin" text in a specialty script - Tibetan 
is a good example of this. Typically the page encoding is specified 
as 8859-1 or 1252 but when you use the appropriate font it displays 
Tibetan glyphs.

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

Re: Detecting CJKV / Asian language pages

Posted by Gavin Thomas Nicol <gt...@rbii.com>.
On Aug 2, 2005, at 11:55 AM, Ken Krugler wrote:

> Yes - small chunks of untagged text are going to be a problem, no  
> matter what you do. But if you're referring to query strings from  
> an HTML page, the default is to use the encoding of the page (which  
> from Nutch defaults to UTF-8). And you can use the accept-charset  
> form attribute to explicitly specify UTF-8.

Yes, that's right (FWIW. I'm one of the authors of RFC 2040)... it'd  
be interesting to see how well ICU does with crawl data. Does anyone  
have any experience?


Memory usage

Posted by Jay Pound <we...@poundwebhosting.com>.
I'm testing an index of 30 million pages, it requires 1.5gb of ram to search
using tomcat 5, I plan on having an index with multiple billion pages, but
if this is to scale then even with 16GB of ram I wont be able to have an
index larger than 320million pages? how can I distribute the memory
requirements across multiple machines, or is there another servlet program
(like resin) that will require less memory to operate, has anyone else run
into this?
Thanks,
-Jay Pound



Re: Detecting CJKV / Asian language pages

Posted by Ken Krugler <kk...@transpac.com>.
>On Aug 1, 2005, at 5:31 PM, Ken Krugler wrote:
>
>>Or you can derive the language from the host URL, if it includes a 
>>country code.
>
>That's not really sufficient... many Japanese sites also have pages 
>in English. Actually, that's true for most non-English sites from 
>what I've seen.

Yes - this is just a last-gasp fallback, in case you're forced to 
guess. Statistically it will be better than always picking en :)

>>>It's hard to detect all the various encodings... EUC-JP, 
>>>SHIFT-JIS, ISO-2022-KR/JP, BIG5, etc. and many servers do not 
>>>correctly identify the encodings.
>>
>>See the latest release of ICU (3.4), which now supports charset detection.
>
>Yes, I forgot about that... but even then I wonder how well it will 
>do. For largish blocks of text (1k or so) it's not bad... you can 
>use statistical modelling to give you accurate probabilities, but 
>for smallish blocks (e.g. query strings) you have a much harder time.

Yes - small chunks of untagged text are going to be a problem, no 
matter what you do. But if you're referring to query strings from an 
HTML page, the default is to use the encoding of the page (which from 
Nutch defaults to UTF-8). And you can use the accept-charset form 
attribute to explicitly specify UTF-8.

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

Re: Detecting CJKV / Asian language pages

Posted by Gavin Thomas Nicol <gt...@rbii.com>.
On Aug 1, 2005, at 5:31 PM, Ken Krugler wrote:

> Or you can derive the language from the host URL, if it includes a  
> country code.

That's not really sufficient... many Japanese sites also have pages  
in English. Actually, that's true for most non-English sites from  
what I've seen.

>> It's hard to detect all the various encodings... EUC-JP, SHIFT- 
>> JIS, ISO-2022-KR/JP, BIG5, etc. and many servers do not correctly  
>> identify the encodings.
>>
>
> See the latest release of ICU (3.4), which now supports charset  
> detection.

Yes, I forgot about that... but even then I wonder how well it will  
do. For largish blocks of text (1k or so) it's not bad... you can use  
statistical modelling to give you accurate probabilities, but for  
smallish blocks (e.g. query strings) you have a much harder time.


Re: Detecting CJKV / Asian language pages

Posted by Ken Krugler <kk...@transpac.com>.
>On Aug 1, 2005, at 12:25 PM, Andy Liu wrote:
>
>>The current Nutch language identifier plugin currently doesn't handle
>>CJKV pages.  Does anybody here have any experience with automatically
>>detecting the language of such pages?
>>
>>I know there are specific encodings which give away what language the
>>page is, but for Asian language pages that use unicode or its
>>variants, I'm out of luck.
>
>For Unicode it's pretty easy... just look for characters that give 
>away the language... for example, Hiragana for Japanese, Hangul for 
>Korean, etc.

Or you can derive the language from the host URL, if it includes a 
country code.

>It's hard to detect all the various encodings... EUC-JP, SHIFT-JIS, 
>ISO-2022-KR/JP, BIG5, etc. and many servers do not correctly 
>identify the encodings.

See the latest release of ICU (3.4), which now supports charset detection.

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

Re: Detecting CJKV / Asian language pages

Posted by Gavin Thomas Nicol <gt...@rbii.com>.
On Aug 1, 2005, at 12:25 PM, Andy Liu wrote:

> The current Nutch language identifier plugin currently doesn't handle
> CJKV pages.  Does anybody here have any experience with automatically
> detecting the language of such pages?
>
> I know there are specific encodings which give away what language the
> page is, but for Asian language pages that use unicode or its
> variants, I'm out of luck.

For Unicode it's pretty easy... just look for characters that give  
away the language... for example, Hiragana for Japanese, Hangul for  
Korean, etc.

It's hard to detect all the various encodings... EUC-JP, SHIFT-JIS,  
ISO-2022-KR/JP, BIG5, etc. and many servers do not correctly identify  
the encodings.