You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Robert Schultz <ro...@cosmicrealms.com> on 2005/08/08 00:17:49 UTC

New Site Live Using Lucene

Not sure if this is appropriate or not, but I just put live a web site 
that I have been working on for over a year, and it uses Lucene for all 
it's searching.

I have 46 million documents in 15 Lucene index's, although the vast 
majority of those consist of only a few words.
The Lucene index's take up about 6GB of space.

I wrote a Java daemon to listen on a socket, and accept connections from 
my PHP scripts in order to do the searching.

The results from Lucene include ID numbers that are linked up with MySQL 
records thus forming the resulting web page.

You can see the site here: http://csourcesearch.net

It's a website that allows you to search over 99 million lines of open 
source C/C++ code :)

Anyways, just wanted to say thanks a lot for such a great product (even 
if it is java *snicker*)

Thanks again Lucene! :)


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: New Site Live Using Lucene

Posted by Robert Schultz <ro...@cosmicrealms.com>.
Yup, the C/C++ code is parsed using some templates I wrote utilizing 
CodeWorker.
It would be possible to do the same thing to any other language such as 
Java or PHP or Perl.
Although you'd need an expert understanding of that language's syntax in 
order to successfully parse it correctly :)

Initially Lucene was never part of the site.
I was using MySQL to store the data, and used MySQL's FULLTEXT searching.
However once I reached 25 million+ rows in a single table, MySQL's 
FULLTEXT searching ground to a halt.
After speaking with the MySQL folks, they told me to use Lucene as their 
FULLTEXT support doesn't scale well and Lucene is supposed to be one of 
the best engines around for that.

Since I was already several months into the project with the vast 
majority of the website written to use the MySQL database, converting 
entirely over to Lucene would have meant a complete code re-write.

I didn't want to do that so I combined both MySQL and Lucene and used both.

It took over 5 FULL MONTHS of 24/7 100% CPU time to PARSE the C/C++ code 
and insert it into the database.
And I only did 3,200 of the more than 25,000 projects I still need to parse.

In hindsight I might have chosen to house everything in Lucene, however 
it would be a major re-write at this point and I'm happy enough right 
now with my 'merged' approach of PHP, MySQL and Lucene.

Chris Lu wrote:
> This is cool!
> 
> Seems you parsed the C/C++ code. Is this easy to extend to other
> languages, like Java?
> 
> And you choose to display the data stored in database, any reason for
> that compared to reading it from Lucene index itself?
> 
> I feel using Lucene's highlighter may make it easier to read the search results.
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: New Site Live Using Lucene

Posted by Chris Hostetter <ho...@fucit.org>.
: I feel using Lucene's highlighter may make it easier to read the search
: results.

I'm of the opinion that since the result pages are all source code, syntax
highlighting is definitely the way to go, but given the existing
presentation, it does seem like it would make sense to "highlight" the
lines containing results by emphasising those line numbers ... perhaps by
bolding or chaning the color of the line number (since that doesn't affect
the syntax highlighting of the code).  I would also suggest listing the
line number(s) of matches at the top of hte page as links to local (named)
anchors (one per line number with a match).


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: New Site Live Using Lucene

Posted by Chris Lu <ch...@gmail.com>.
This is cool!

Seems you parsed the C/C++ code. Is this easy to extend to other
languages, like Java?

And you choose to display the data stored in database, any reason for
that compared to reading it from Lucene index itself?

I feel using Lucene's highlighter may make it easier to read the search results.

-- 
Chris Lu
------------
Lucene Search RAD on Any Database
http://www.dbsight.net

On 8/7/05, Robert Schultz <ro...@cosmicrealms.com> wrote:
> Not sure if this is appropriate or not, but I just put live a web site
> that I have been working on for over a year, and it uses Lucene for all
> it's searching.
> 
> I have 46 million documents in 15 Lucene index's, although the vast
> majority of those consist of only a few words.
> The Lucene index's take up about 6GB of space.
> 
> I wrote a Java daemon to listen on a socket, and accept connections from
> my PHP scripts in order to do the searching.
> 
> The results from Lucene include ID numbers that are linked up with MySQL
> records thus forming the resulting web page.
> 
> You can see the site here: http://csourcesearch.net
> 
> It's a website that allows you to search over 99 million lines of open
> source C/C++ code :)
> 
> Anyways, just wanted to say thanks a lot for such a great product (even
> if it is java *snicker*)
> 
> Thanks again Lucene! :)
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org