You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jimmy Forrester <ji...@gmail.com> on 2005/09/16 23:46:32 UTC

hello nutchers!

hi, I'm 21 and currently writing my own search engine in PHP & MySQL. I 
wanted to build a search engine that only searches fashion, entertainment, 
nightlife and gay websites. I've built this using PHP & MySQL you can see an 
example of it running here:

http://onescene.com/search/

As you can see it isn't branded yet - still finding a good domain - and its 
tiny - only run it for a few hours and it filled 200MB on my database so my 
hosts told me to stop or they would charge me an obscene amount for using 
over the 200MB allowance. Its really very basic just using a full text 
search over the none common words within the page and the meta data. It 
kinda works (yet very inaccurate) but Im worried that if I move hosts and 
keep developing it, it will become too slow to use once I get 100k web pages 
in there - even if I optimize the code loads.

I'm worried that im not going to be good enough at server config and stuff 
to get Nutch running well for me. I've been working for an hour so far and 
have just finally got java downloading, I may not even manage to get tomcat 
running at all! here are my few questions to the community:

   1. How tricky is it to get nutch running for a server newbie like 
   myself? 
   2. Whats nutch like for limiting the type of site which gets crawled? 
   my current site asses if the site is "gay enough" to be added to the search 
   domains 
   3. I'm building my seach engine as a hobby - will I need to purchase a 
   dedicated server to run Nutch? (I so can't afford that) or does anyone know 
   a good cheap hosting company which can defiantly get nutch up and running 
   with? 
   4. Is my own search engine worth continuing? or will it simply be too 
   slow & inaccurate for people to use? 

thank you all for taking the time to read this,

looking forward to any responses!

Jimmy

Re: hello nutchers!

Posted by gekkokid <me...@gekkokid.org.uk>.
Hi,

can you enlighten me on how one could classify if a web page is "gay 
enough"? just certain keywords or is it pages from a particular source

if search engines is your hobby than it might be worth sticking with your 
php search engine implementation, just cos it will be more fun and you can 
tailor it to your requirements easier (not that nutch is hard to customise), 
lucene is worth looking at for development projects which nutch is built on

also hosting a nutch engine on anything other that your own machine will be 
a pain in the butt, if your at university they might be able to help you out 
:)

_regards
gk

----- Original Message ----- 
From: "Jimmy Forrester" <ji...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Friday, September 16, 2005 10:46 PM
Subject: hello nutchers!


hi, I'm 21 and currently writing my own search engine in PHP & MySQL. I
wanted to build a search engine that only searches fashion, entertainment,
nightlife and gay websites. I've built this using PHP & MySQL you can see an
example of it running here:

http://onescene.com/search/

As you can see it isn't branded yet - still finding a good domain - and its
tiny - only run it for a few hours and it filled 200MB on my database so my
hosts told me to stop or they would charge me an obscene amount for using
over the 200MB allowance. Its really very basic just using a full text
search over the none common words within the page and the meta data. It
kinda works (yet very inaccurate) but Im worried that if I move hosts and
keep developing it, it will become too slow to use once I get 100k web pages
in there - even if I optimize the code loads.

I'm worried that im not going to be good enough at server config and stuff
to get Nutch running well for me. I've been working for an hour so far and
have just finally got java downloading, I may not even manage to get tomcat
running at all! here are my few questions to the community:

   1. How tricky is it to get nutch running for a server newbie like
   myself?
   2. Whats nutch like for limiting the type of site which gets crawled?
   my current site asses if the site is "gay enough" to be added to the 
search
   domains
   3. I'm building my seach engine as a hobby - will I need to purchase a
   dedicated server to run Nutch? (I so can't afford that) or does anyone 
know
   a good cheap hosting company which can defiantly get nutch up and running
   with?
   4. Is my own search engine worth continuing? or will it simply be too
   slow & inaccurate for people to use?

thank you all for taking the time to read this,

looking forward to any responses!

Jimmy



Re: hello nutchers!

Posted by EM <em...@cpuedge.com>.
Jimmy Forrester wrote:

>   1. How tricky is it to get nutch running for a server newbie like 
>   myself? 
>  
>
If you are familiar with Linux + configuration (editors, config 
files...), you'll be just fine.

>   2. Whats nutch like for limiting the type of site which gets crawled? 
> 
>  
>
Nutch can be 100% customized, the only thing is that you'll have to do 
the customization yourself, sometimes site by site.This will be the most 
time consuming part.

>   3. I'm building my seach engine as a hobby - will I need to purchase a 
>   dedicated server to run Nutch? (I so can't afford that) or does anyone know 
>   a good cheap hosting company which can defiantly get nutch up and running 
>   with? 
>  
>
For small size index as yours, you don't need dedicated server, I can 
refer you to the person where I had vps of a sort, (22$ month but I got 
a special offer, you might be paying more) tell me if you are interested.
However, I didn't do the crawling from his server, I just used it as a 
place for storing the index database. ( few gigs, several hours 
uploading, it was worth the wait). Last time I used his server I 
uploaded almost a million pages and it worked without a glitch.
For starters, do the crawling from home, you'll get used to nutch, types 
of errors, configuration things, etc. All this will a bit more 
complicated if you have to do it remotely, and it can be slight 
discouragement.
On a home 1.5 Mbps line (DSL) , you can get maybe half a million pages a 
day if you tweak your regex enough to skip all kinds of junk. This will 
of course depends on what you want to crawl in the first place, types of 
hosts, etc. Also, use a firewall, sometimes bunch of hosts will try to 
ping you back on bunch of ports as soon as you hit them, and will 
continue pinging you on and on. Quick change of the ip helps in this case.

>   4. Is my own search engine worth continuing? or will it simply be too 
>   slow & inaccurate for people to use? 
>  
>
Nutch is fast and powerful engine, you'll discover that in time.

Re: Exception in mergesegs

Posted by Jérôme Charron <je...@gmail.com>.
Checks your plugins configuration, it seems that none is activated:
Registered Plugin:
NONE

Regards

Jérôme

On 9/17/05, Gal Nitzan <gn...@usa.net> wrote:
> 
> Hi,
> 
> Can someone figure this out:
> 
> 050917 092429 * Merging took 780147 ms
> 050917 092430 * Creating new segment index(es)...
> 050917 092430 * Opening segment 20050917091129
> 050917 092430 * Indexing segment 20050917091129
> 050917 092430 Plugins: looking in: /nutch/build/plugins
> 050917 092430 Plugin Auto-activation mode: [true]
> 050917 092430 Registered Plugins:
> 050917 092430 NONE
> 050917 092430 Registered Extension-Points:
> 050917 092430 NONE
> Exception in thread "main" java.lang.ExceptionInInitializerError
> at
> org.apache.nutch.indexer.IndexSegment.indexPages(IndexSegment.java:144)
> at
> org.apache.nutch.tools.SegmentMergeTool.run(SegmentMergeTool.java:484)
> at
> org.apache.nutch.tools.SegmentMergeTool.main(SegmentMergeTool.java:573)
> Caused by: java.lang.RuntimeException:
> org.apache.nutch.indexer.IndexingFilter not found.
> at
> org.apache.nutch.indexer.IndexingFilters.<clinit>(IndexingFilters.java:36)
> ... 3 more
> 
> Why org.apache.nutch.indexer.IndexingFilter not found ?
> 
> Thanks,
> 
> Gal
> 



-- 
http://motrech.free.fr/
http://www.frutch.org/

Exception in mergesegs

Posted by Gal Nitzan <gn...@usa.net>.
Hi,

Can someone figure this out:

050917 092429 * Merging took 780147 ms
050917 092430 * Creating new segment index(es)...
050917 092430 * Opening segment 20050917091129
050917 092430 * Indexing segment 20050917091129
050917 092430 Plugins: looking in: /nutch/build/plugins
050917 092430 Plugin Auto-activation mode: [true]
050917 092430 Registered Plugins:
050917 092430   NONE
050917 092430 Registered Extension-Points:
050917 092430   NONE
Exception in thread "main" java.lang.ExceptionInInitializerError
        at 
org.apache.nutch.indexer.IndexSegment.indexPages(IndexSegment.java:144)
        at 
org.apache.nutch.tools.SegmentMergeTool.run(SegmentMergeTool.java:484)
        at 
org.apache.nutch.tools.SegmentMergeTool.main(SegmentMergeTool.java:573)
Caused by: java.lang.RuntimeException: 
org.apache.nutch.indexer.IndexingFilter not found.
        at 
org.apache.nutch.indexer.IndexingFilters.<clinit>(IndexingFilters.java:36)
        ... 3 more

Why org.apache.nutch.indexer.IndexingFilter not found ?

Thanks,

Gal

pinging

Posted by EM <em...@cpuedge.com>.
A while ago, I'll be fetching pages start being pinged on bunch of 
ports. The inbound traffic would quickly saturate my line.

As I put a hardware firewall, this fully stopped, but I'm still curious 
why this was happening.

Has anyone encountered a simmilar thing?



Re: hello nutchers!

Posted by sub paul <su...@gmail.com>.
If you want to use a cheap host and still use nutch, you will have to
be creative.

No cheap host will let you run the crawler, its just too much cpu
power being used by a single host. However, you pc migh be strong
enough to handle the task. You might be able to do all of the fetching
and indexing on your pc, move the index to your server, and nutch
could read that index.

I have not tried that myself, but in theory it should work. Your index
might be a big file, so ftping it over to your server might be a
little bit of an issue, but you could probably use some automated ftp
client to the task for you.

So your nutch server will just look at the index, and search from it (
very fast)
Your home pc will build the index, and ftp it over. 
Once your index is on the server, you should restart nutch and now
nutch would be up-to-date with the new websites.

You can probably update the index as frequently as you can finish creating one.

If you want me to suggest a cheap jsp host, just let me know.

Regards,
Paul


On 9/16/05, Michael Ji <fj...@yahoo.com> wrote:
> for a developing and testing propose, only a bit
> powerful PC is far enough ( I used a Dell P4, nutch is
> running well there), but you definitely need high
> speed internet connection for everything Nutch
> required;
> 
> Michael Ji,
> 
> --- Jimmy Forrester <ji...@gmail.com> wrote:
> 
> > hi, I'm 21 and currently writing my own search
> > engine in PHP & MySQL. I
> > wanted to build a search engine that only searches
> > fashion, entertainment,
> > nightlife and gay websites. I've built this using
> > PHP & MySQL you can see an
> > example of it running here:
> >
> > http://onescene.com/search/
> >
> > As you can see it isn't branded yet - still finding
> > a good domain - and its
> > tiny - only run it for a few hours and it filled
> > 200MB on my database so my
> > hosts told me to stop or they would charge me an
> > obscene amount for using
> > over the 200MB allowance. Its really very basic just
> > using a full text
> > search over the none common words within the page
> > and the meta data. It
> > kinda works (yet very inaccurate) but Im worried
> > that if I move hosts and
> > keep developing it, it will become too slow to use
> > once I get 100k web pages
> > in there - even if I optimize the code loads.
> >
> > I'm worried that im not going to be good enough at
> > server config and stuff
> > to get Nutch running well for me. I've been working
> > for an hour so far and
> > have just finally got java downloading, I may not
> > even manage to get tomcat
> > running at all! here are my few questions to the
> > community:
> >
> >    1. How tricky is it to get nutch running for a
> > server newbie like
> >    myself?
> >    2. Whats nutch like for limiting the type of site
> > which gets crawled?
> >    my current site asses if the site is "gay enough"
> > to be added to the search
> >    domains
> >    3. I'm building my seach engine as a hobby - will
> > I need to purchase a
> >    dedicated server to run Nutch? (I so can't afford
> > that) or does anyone know
> >    a good cheap hosting company which can defiantly
> > get nutch up and running
> >    with?
> >    4. Is my own search engine worth continuing? or
> > will it simply be too
> >    slow & inaccurate for people to use?
> >
> > thank you all for taking the time to read this,
> >
> > looking forward to any responses!
> >
> > Jimmy
> >
> 
> 
> 
> 
> __________________________________
> Yahoo! Mail - PC Magazine Editors' Choice 2005
> http://mail.yahoo.com
>

Re: hello nutchers!

Posted by Michael Ji <fj...@yahoo.com>.
for a developing and testing propose, only a bit
powerful PC is far enough ( I used a Dell P4, nutch is
running well there), but you definitely need high
speed internet connection for everything Nutch
required;

Michael Ji,

--- Jimmy Forrester <ji...@gmail.com> wrote:

> hi, I'm 21 and currently writing my own search
> engine in PHP & MySQL. I 
> wanted to build a search engine that only searches
> fashion, entertainment, 
> nightlife and gay websites. I've built this using
> PHP & MySQL you can see an 
> example of it running here:
> 
> http://onescene.com/search/
> 
> As you can see it isn't branded yet - still finding
> a good domain - and its 
> tiny - only run it for a few hours and it filled
> 200MB on my database so my 
> hosts told me to stop or they would charge me an
> obscene amount for using 
> over the 200MB allowance. Its really very basic just
> using a full text 
> search over the none common words within the page
> and the meta data. It 
> kinda works (yet very inaccurate) but Im worried
> that if I move hosts and 
> keep developing it, it will become too slow to use
> once I get 100k web pages 
> in there - even if I optimize the code loads.
> 
> I'm worried that im not going to be good enough at
> server config and stuff 
> to get Nutch running well for me. I've been working
> for an hour so far and 
> have just finally got java downloading, I may not
> even manage to get tomcat 
> running at all! here are my few questions to the
> community:
> 
>    1. How tricky is it to get nutch running for a
> server newbie like 
>    myself? 
>    2. Whats nutch like for limiting the type of site
> which gets crawled? 
>    my current site asses if the site is "gay enough"
> to be added to the search 
>    domains 
>    3. I'm building my seach engine as a hobby - will
> I need to purchase a 
>    dedicated server to run Nutch? (I so can't afford
> that) or does anyone know 
>    a good cheap hosting company which can defiantly
> get nutch up and running 
>    with? 
>    4. Is my own search engine worth continuing? or
> will it simply be too 
>    slow & inaccurate for people to use? 
> 
> thank you all for taking the time to read this,
> 
> looking forward to any responses!
> 
> Jimmy
> 



		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

RE: hello nutchers!

Posted by Sébastien LE CALLONNEC <sl...@yahoo.ie>.
Hi Jimmy,


>    1. How tricky is it to get nutch running for a server newbie like 
>    myself? 

Hmmm...  It can be somewhat tricky, especially if you encounter
problems.  I guess you'll have to look closely at Tomcat documentation
first and then at Nutch documentation to manage to grasp the concepts
involved.

>    2. Whats nutch like for limiting the type of site which gets
> crawled? 
>    my current site asses if the site is "gay enough" to be added to
> the search 
>    domains

You can't really say to Nutch: "Go and get gay stuff": it would require
it to be somewhat more intelligent than it is.  You can filter the URLs
and that's about how close you can get to getting the interesting
pages.

>    3. I'm building my seach engine as a hobby - will I need to
> purchase a 
>    dedicated server to run Nutch? (I so can't afford that) or does
> anyone know 
>    a good cheap hosting company which can defiantly get nutch up and
> running 
>    with? 

I don't know any free/cheap web hosting company that supports Java and
provides the space you need for your project. 



>    4. Is my own search engine worth continuing? or will it simply be
> too 
>    slow & inaccurate for people to use? 

I guess it all depends on how far you want to/can get into the
developments.  However, as a Nutch enthusiast, I can only praise Nutch
and recommand its use...

I hope this will help...


Regards,
Sebastien.



	

	
		
___________________________________________________________________________ 
Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger 
Téléchargez cette version sur http://fr.messenger.yahoo.com