You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Paul Jones <pa...@yahoo.co.uk> on 2009/06/22 01:17:21 UTC

adding pre-indexed DB's together

Hi

A newbie to the world of lucene, nutch , mahout, spent all weekend on Mahout, and now looking at Nutch. So I have a question, its seems (after reading the archives) that alot of people are using Nutch to index the web, whether for vertical searches, or just the web as a whole. Now rather than everyone starting again from scratch, and since very little (if any) "IP" would exist in the index, since nothing clever has been done to them except being processed by Nutch, would it not be possible to "share" all these indexes with each other, i.e if someone has built an index of all blogs, or all car related websites, or just indexed 100 million webpages at random. Maybe there is some tech reason I am missing.

Paul



      

Re: adding pre-indexed DB's together

Posted by Otis Gospodnetic <og...@yahoo.com>.
I wonder what seeding it via bit torrent (not sure if I'm using the right terminology here) would be helpful.  Probably not because there would be too few people interested in it, running torrent clients and sharing it, but who knows...  I think those AOL search logs were once available via bit torrent.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Dennis Kubes <ku...@apache.org>
> To: nutch-user@lucene.apache.org
> Sent: Monday, June 22, 2009 2:45:10 PM
> Subject: Re: adding pre-indexed DB's together
> 
> There is still the url crawl db which had over 1Billion urls at last count.  So 
> it might be a good starting point for crawling the web.  At last count though it 
> was 250G in size so no downloadable unless you have a fast connection.  It is 
> available for anyone that wants it though.
> 
> Dennis
> 
> Otis Gospodnetic wrote:
> > Paul,
> > 
> > There was talk of this in the past, at least between some other people here 
> and me, possibly "off-line".  Your best bet may be going to what's left of Wikia 
> Search and getting their old index.  But, you see, this is exactly the problem - 
> the index may be quite outdated by now.
> > 
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > 
> > 
> > 
> > ----- Original Message ----
> >> From: Paul Jones 
> >> To: nutch-user@lucene.apache.org
> >> Sent: Sunday, June 21, 2009 7:17:21 PM
> >> Subject: adding pre-indexed DB's together
> >> 
> >> Hi
> >> 
> >> A newbie to the world of lucene, nutch , mahout, spent all weekend on Mahout, 
> and now looking at Nutch. So I have a question, its seems (after reading the 
> archives) that alot of people are using Nutch to index the web, whether for 
> vertical searches, or just the web as a whole. Now rather than everyone starting 
> again from scratch, and since very little (if any) "IP" would exist in the 
> index, since nothing clever has been done to them except being processed by 
> Nutch, would it not be possible to "share" all these indexes with each other, 
> i.e if someone has built an index of all blogs, or all car related websites, or 
> just indexed 100 million webpages at random. Maybe there is some tech reason I 
> am missing.
> >> 
> >> Paul
> > 


Re: adding pre-indexed DB's together

Posted by Paul Jones <pa...@yahoo.co.uk>.
tks Dennis

Are there any further details of this DB?

Paul




________________________________
From: Dennis Kubes <ku...@apache.org>
To: nutch-user@lucene.apache.org
Sent: Monday, 22 June, 2009 19:45:10
Subject: Re: adding pre-indexed DB's together

There is still the url crawl db which had over 1Billion urls at last count.  So it might be a good starting point for crawling the web.  At last count though it was 250G in size so no downloadable unless you have a fast connection.  It is available for anyone that wants it though.

Dennis

Otis Gospodnetic wrote:
> Paul,
> 
> There was talk of this in the past, at least between some other people here and me, possibly "off-line".  Your best bet may be going to what's left of Wikia Search and getting their old index.  But, you see, this is exactly the problem - the index may be quite outdated by now.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> ----- Original Message ----
>> From: Paul Jones <pa...@yahoo.co.uk>
>> To: nutch-user@lucene.apache.org
>> Sent: Sunday, June 21, 2009 7:17:21 PM
>> Subject: adding pre-indexed DB's together
>> 
>> Hi
>> 
>> A newbie to the world of lucene, nutch , mahout, spent all weekend on Mahout, and now looking at Nutch. So I have a question, its seems (after reading the archives) that alot of people are using Nutch to index the web, whether for vertical searches, or just the web as a whole. Now rather than everyone starting again from scratch, and since very little (if any) "IP" would exist in the index, since nothing clever has been done to them except being processed by Nutch, would it not be possible to "share" all these indexes with each other, i.e if someone has built an index of all blogs, or all car related websites, or just indexed 100 million webpages at random. Maybe there is some tech reason I am missing.
>> 
>> Paul
> 



      

Re: adding pre-indexed DB's together

Posted by Dennis Kubes <ku...@apache.org>.
There is still the url crawl db which had over 1Billion urls at last 
count.  So it might be a good starting point for crawling the web.  At 
last count though it was 250G in size so no downloadable unless you have 
a fast connection.  It is available for anyone that wants it though.

Dennis

Otis Gospodnetic wrote:
> Paul,
> 
> There was talk of this in the past, at least between some other people here and me, possibly "off-line".  Your best bet may be going to what's left of Wikia Search and getting their old index.  But, you see, this is exactly the problem - the index may be quite outdated by now.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> ----- Original Message ----
>> From: Paul Jones <pa...@yahoo.co.uk>
>> To: nutch-user@lucene.apache.org
>> Sent: Sunday, June 21, 2009 7:17:21 PM
>> Subject: adding pre-indexed DB's together
>>
>> Hi
>>
>> A newbie to the world of lucene, nutch , mahout, spent all weekend on Mahout, 
>> and now looking at Nutch. So I have a question, its seems (after reading the 
>> archives) that alot of people are using Nutch to index the web, whether for 
>> vertical searches, or just the web as a whole. Now rather than everyone starting 
>> again from scratch, and since very little (if any) "IP" would exist in the 
>> index, since nothing clever has been done to them except being processed by 
>> Nutch, would it not be possible to "share" all these indexes with each other, 
>> i.e if someone has built an index of all blogs, or all car related websites, or 
>> just indexed 100 million webpages at random. Maybe there is some tech reason I 
>> am missing.
>>
>> Paul
> 

Re: adding pre-indexed DB's together

Posted by Otis Gospodnetic <og...@yahoo.com>.
Paul,

There was talk of this in the past, at least between some other people here and me, possibly "off-line".  Your best bet may be going to what's left of Wikia Search and getting their old index.  But, you see, this is exactly the problem - the index may be quite outdated by now.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Paul Jones <pa...@yahoo.co.uk>
> To: nutch-user@lucene.apache.org
> Sent: Sunday, June 21, 2009 7:17:21 PM
> Subject: adding pre-indexed DB's together
> 
> Hi
> 
> A newbie to the world of lucene, nutch , mahout, spent all weekend on Mahout, 
> and now looking at Nutch. So I have a question, its seems (after reading the 
> archives) that alot of people are using Nutch to index the web, whether for 
> vertical searches, or just the web as a whole. Now rather than everyone starting 
> again from scratch, and since very little (if any) "IP" would exist in the 
> index, since nothing clever has been done to them except being processed by 
> Nutch, would it not be possible to "share" all these indexes with each other, 
> i.e if someone has built an index of all blogs, or all car related websites, or 
> just indexed 100 million webpages at random. Maybe there is some tech reason I 
> am missing.
> 
> Paul


Re: adding pre-indexed DB's together

Posted by MilleBii <mi...@gmail.com>.
I think you underestimate the potential applications of NUTCH, because there
can be quite a lot of intelligence ("IP") in the plug-in architecture.
You can choose to focus your crawl &/or content, you can choose to add
specific fields which fits a vertical search field, you may want to adapt
scoring of URL and content.

So the limitations are not technical but related to the fact that search
applications are different and therefore the data will be different.

2009/6/22 Paul Jones <pa...@yahoo.co.uk>

> Hi
>
> A newbie to the world of lucene, nutch , mahout, spent all weekend on
> Mahout, and now looking at Nutch. So I have a question, its seems (after
> reading the archives) that alot of people are using Nutch to index the web,
> whether for vertical searches, or just the web as a whole. Now rather than
> everyone starting again from scratch, and since very little (if any) "IP"
> would exist in the index, since nothing clever has been done to them except
> being processed by Nutch, would it not be possible to "share" all these
> indexes with each other, i.e if someone has built an index of all blogs, or
> all car related websites, or just indexed 100 million webpages at random.
> Maybe there is some tech reason I am missing.
>
> Paul
>
>
>
>




-- 
-MilleBii-