You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by luocanrao <lu...@sohu.com> on 2010/06/09 16:18:17 UTC

A question bout google search index?

A news bout google search index. Index system of Lucene can also support
realtime search, 

Is there some difference between them?

 

With Caffeine, we analyze the web in small portions and update our search
index on a continuous basis, globally. As we find new pages, or new
information on existing pages, we can add these straight to the index. That
means you can find fresher information than ever before-no matter when or
where it was published.

 

Caffeine lets us index web pages on an enormous scale. In fact, every second
Caffeine processes hundreds of thousands of pages in parallel. If this were
a pile of paper it would grow three miles taller every second. Caffeine
takes up nearly 100 million gigabytes of storage in one database and adds
new information at a rate of hundreds of thousands of gigabytes per day. You
would need 625,000 of the largest iPods to store that much information; if
these were stacked end-to-end they would go for more than 40 miles

Re: A question bout google search index?

Posted by Lance Norskog <go...@gmail.com>.

http://research.google.com/pubs/DistributedSystemsandParallelComputing.html

On Thu, Jun 10, 2010 at 1:51 AM, Yuval Feinstein <yu...@answers.com> wrote:
> Most of the implementation of Google's search index is kept secret by Google.
> Based on publicly available information, the indexes are quite different -
> Google uses its BigTable and MapReduce technologies to efficiently distribute the index.
> There are similar efforts in the Lucene ecosystem - Solr Cloud is an advanced one,
> Which is currently in development.
> As Google's scoring algorithm uses hundreds of signals, I guess they store data pertinent to these signals in the index.
> Lucene's index holds relatively few pieces of information about every document (posting lists, term vectors,
> Sometimes norms and payloads).
> I believe there are other differences as well,
> But one could only guess what they are...
> Cheers,
> Yuval
>
>
> -----Original Message-----
> From: luocanrao [mailto:luocan19826164@sohu.com]
> Sent: Wednesday, June 09, 2010 5:18 PM
> To: java-user@lucene.apache.org
> Subject: A question bout google search index?
>
> A news bout google search index. Index system of Lucene can also support
> realtime search,
>
> Is there some difference between them?
>
>
>
> With Caffeine, we analyze the web in small portions and update our search
> index on a continuous basis, globally. As we find new pages, or new
> information on existing pages, we can add these straight to the index. That
> means you can find fresher information than ever before-no matter when or
> where it was published.
>
>
>
> Caffeine lets us index web pages on an enormous scale. In fact, every second
> Caffeine processes hundreds of thousands of pages in parallel. If this were
> a pile of paper it would grow three miles taller every second. Caffeine
> takes up nearly 100 million gigabytes of storage in one database and adds
> new information at a rate of hundreds of thousands of gigabytes per day. You
> would need 625,000 of the largest iPods to store that much information; if
> these were stacked end-to-end they would go for more than 40 miles
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Lance Norskog
goksron@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: A question bout google search index?

Posted by Yuval Feinstein <yu...@answers.com>.

Most of the implementation of Google's search index is kept secret by Google.
Based on publicly available information, the indexes are quite different - 
Google uses its BigTable and MapReduce technologies to efficiently distribute the index.
There are similar efforts in the Lucene ecosystem - Solr Cloud is an advanced one,
Which is currently in development. 
As Google's scoring algorithm uses hundreds of signals, I guess they store data pertinent to these signals in the index.
Lucene's index holds relatively few pieces of information about every document (posting lists, term vectors, 
Sometimes norms and payloads).
I believe there are other differences as well, 
But one could only guess what they are...
Cheers,
Yuval


-----Original Message-----
From: luocanrao [mailto:luocan19826164@sohu.com] 
Sent: Wednesday, June 09, 2010 5:18 PM
To: java-user@lucene.apache.org
Subject: A question bout google search index?

A news bout google search index. Index system of Lucene can also support
realtime search, 

Is there some difference between them?

 

With Caffeine, we analyze the web in small portions and update our search
index on a continuous basis, globally. As we find new pages, or new
information on existing pages, we can add these straight to the index. That
means you can find fresher information than ever before-no matter when or
where it was published.

 

Caffeine lets us index web pages on an enormous scale. In fact, every second
Caffeine processes hundreds of thousands of pages in parallel. If this were
a pile of paper it would grow three miles taller every second. Caffeine
takes up nearly 100 million gigabytes of storage in one database and adds
new information at a rate of hundreds of thousands of gigabytes per day. You
would need 625,000 of the largest iPods to store that much information; if
these were stacked end-to-end they would go for more than 40 miles


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org