You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Jo...@platts.com on 2003/06/04 16:08:49 UTC

commercial websites powered by Lucene?


Hello All,

I've been trying to find examples of large commercial websites that
use Lucene to power their search.  Having such examples would
make Lucene an easy sell to management

Does anyone know of any good examples?  The bigger the better, and
the more the better.

TIA,
-John



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: commercial websites powered by Lucene?

Posted by Che Dong <ch...@hotmail.com>.

http://search.163.com  China portal: NetEase use lucene as directory search and news search.


Che, Dong
http://www.chedong.com

----- Original Message ----- 
From: <Jo...@platts.com>
To: <lu...@jakarta.apache.org>
Sent: Wednesday, June 04, 2003 10:08 PM
Subject: commercial websites powered by Lucene?


> 
> 
> Hello All,
> 
> I've been trying to find examples of large commercial websites that
> use Lucene to power their search.  Having such examples would
> make Lucene an easy sell to management
> 
> Does anyone know of any good examples?  The bigger the better, and
> the more the better.
> 
> TIA,
> -John
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
>

RE: commercial websites powered by Lucene?

Posted by "Nader S. Henein" <ns...@bayt.net>.

About 100 documents every twenty minutes, but it fluctuates depending on
how much traffic is on the site

-----Original Message-----
From: news [mailto:news@main.gmane.org] On Behalf Of Chris Miller
Sent: Tuesday, June 24, 2003 3:28 PM
To: lucene-user@jakarta.apache.org
Subject: Re: commercial websites powered by Lucene?


Hmm, good point with the cost of copying indicies in a distributed
environment, although that is unlikely to affect us in the foreseeable
future. But, noted!

Do you have any rough statistics on how many documents you index/day, or
how many every 20 minutes?

This discussion is fantastic by the way, lots of great experience and
comments coming out here. Thanks, it's really appreciated.

"Nader S. Henein" <ns...@bayt.net> wrote in message
news:002401c33a42$6a350ce0$1801a8c0@naderit...
> We thought of that in the beginning and then we became more 
> comfortable with multiple indices for simple backup purposes, and now 
> our indices are in excess of 100megs, and transferring that kind of 
> data between three machines sitting in the same data center is 
> passable, but once you start thinking of distributed webservers in 
> different hosting facilities, copying  100Megs every 20 minutes, or 
> even every hour becomes financially expensive.
>
> Our webservers are on Single Processor Sun Ultra Sparc III 400 Mhz 
> with two gegs of memory, and I've never seen the CPU usage go over 0.8

> at peek time with the indexer running. Try it out first, take your 
> time to gather your own numbers so you can really get  a feel of what 
> set up fits you best.
>
> Nader




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: commercial websites powered by Lucene?

Posted by Chris Miller <ch...@hotmail.com>.

Hmm, good point with the cost of copying indicies in a distributed
environment, although that is unlikely to affect us in the foreseeable
future. But, noted!

Do you have any rough statistics on how many documents you index/day, or how
many every 20 minutes?

This discussion is fantastic by the way, lots of great experience and
comments coming out here. Thanks, it's really appreciated.

"Nader S. Henein" <ns...@bayt.net> wrote in message
news:002401c33a42$6a350ce0$1801a8c0@naderit...
> We thought of that in the beginning and then we became more comfortable
> with multiple indices for simple backup purposes, and now our indices
> are in excess of 100megs, and transferring that kind of data between
> three machines sitting in the same data center is passable, but once you
> start thinking of distributed webservers in different hosting
> facilities, copying  100Megs every 20 minutes, or even every hour
> becomes financially expensive.
>
> Our webservers are on Single Processor Sun Ultra Sparc III 400 Mhz with
> two gegs of memory, and I've never seen the CPU usage go over 0.8 at
> peek time with the indexer running. Try it out first, take your time to
> gather your own numbers so you can really get  a feel of what set up
> fits you best.
>
> Nader




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: commercial websites powered by Lucene?

Posted by David Medinets <me...@mtolive.com>.

----- Original Message -----
From: "Chris Miller" <ch...@hotmail.com>
> Thanks David, that's about what I figured. Of course if the servers are
> pulling the information then a central holding table that contains only
new
> data doesn't make much sense anymore. Instead I guess the easiest approach
> would be to have a central table that contains the entire dataset

The following commentary might have no bearing on Lucene nor relevance with
today's technology, but I feel garrulous this morning.

Each pulling server did a three-step dance when updating. First, the central
server (Oracle) was polled to get the latest data (actually we sucked it all
because there were only 30,000 records). A text file was created (format is
unimportant, use the easiest for your application). Then that text file was
read to update the local datastore.

The advantage of this rigamarole was to allow the servers to fail and be
restored without needing to poll the central server. We had 400 servers in
the cluster. And at times, many of them would be fail (this was in 1999,
don't be critical!). If many systems pulled data from the central server,
the process would slow down. Which started another round of failures. To
avoid that vicious circle of failure all of the systems could reboot
independently.

David Medinets
http://www.codebits.com




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: commercial websites powered by Lucene?

Posted by "Nader S. Henein" <ns...@bayt.net>.

We thought of that in the beginning and then we became more comfortable
with multiple indices for simple backup purposes, and now our indices
are in excess of 100megs, and transferring that kind of data between
three machines sitting in the same data center is passable, but once you
start thinking of distributed webservers in different hosting
facilities, copying  100Megs every 20 minutes, or even every hour
becomes financially expensive. 

Our webservers are on Single Processor Sun Ultra Sparc III 400 Mhz with
two gegs of memory, and I've never seen the CPU usage go over 0.8 at
peek time with the indexer running. Try it out first, take your time to
gather your own numbers so you can really get  a feel of what set up
fits you best.

Nader



-----Original Message-----
From: news [mailto:news@main.gmane.org] On Behalf Of Chris Miller
Sent: Tuesday, June 24, 2003 2:58 PM
To: lucene-user@jakarta.apache.org
Subject: Re: commercial websites powered by Lucene?


Thanks David, that's about what I figured. Of course if the servers are
pulling the information then a central holding table that contains only
new data doesn't make much sense anymore. Instead I guess the easiest
approach would be to have a central table that contains the entire
dataset, and has last-modified timestamps on each record so the
individual webservers can grab just the data that was changed since they
last ran an index update. My concern still is that the effort of
indexing (which is potentially quite
high) is being duplicated across all the webservers.

Is there any reason why it would be a bad idea to have one machine
responsible for grabbing updates and adding documents to a master index,
so the other servers could periodically grab a copy of that index and
hot-swap it with their previous copy? Is Lucene capable of handling that
scenario? Seems to me that this approach would reduce the stress on a
webservers even more, and even if the indexing server went down the
webservers would still have a stale index to search against. Has anyone
attempted something like this?


"David Medinets" <me...@mtolive.com> wrote in message
news:059601c33a3d$423547f0$6722a8c0@medined01...
> ----- Original Message -----
> From: "Chris Miller" <ch...@hotmail.com>
> > Did you look at having just a single process that was responsible 
> > for updating the index, and then pushing copies out to all the 
> > webservers?
I'm
> > wondering if that might be worth investigating (since it would take 
> > a
lot
> of
> > load off the webservers that are running the searches), or if it 
> > will be
> too
> > troublesome in practice.
>
> I've found that pulling information from a central source is simpler 
> than pushing information. When information is pushing, there is much 
> administration on the central server to track the recipient machines. 
> It seems like servers are added and dropped from the push list. 
> Additionally, you need to account for servers that stop responding. 
> When information is pulled from the central source, these issues of 
> coordination are
eliminated.
>
> David Medinets
> http://www.codebits.com




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: commercial websites powered by Lucene?

Posted by Chris Miller <ch...@hotmail.com>.

Thanks David, that's about what I figured. Of course if the servers are
pulling the information then a central holding table that contains only new
data doesn't make much sense anymore. Instead I guess the easiest approach
would be to have a central table that contains the entire dataset, and has
last-modified timestamps on each record so the individual webservers can
grab just the data that was changed since they last ran an index update. My
concern still is that the effort of indexing (which is potentially quite
high) is being duplicated across all the webservers.

Is there any reason why it would be a bad idea to have one machine
responsible for grabbing updates and adding documents to a master index, so
the other servers could periodically grab a copy of that index and hot-swap
it with their previous copy? Is Lucene capable of handling that scenario?
Seems to me that this approach would reduce the stress on a webservers even
more, and even if the indexing server went down the webservers would still
have a stale index to search against. Has anyone attempted something like
this?


"David Medinets" <me...@mtolive.com> wrote in message
news:059601c33a3d$423547f0$6722a8c0@medined01...
> ----- Original Message -----
> From: "Chris Miller" <ch...@hotmail.com>
> > Did you look at having just a single process that was responsible for
> > updating the index, and then pushing copies out to all the webservers?
I'm
> > wondering if that might be worth investigating (since it would take a
lot
> of
> > load off the webservers that are running the searches), or if it will be
> too
> > troublesome in practice.
>
> I've found that pulling information from a central source is simpler than
> pushing information. When information is pushing, there is much
> administration on the central server to track the recipient machines. It
> seems like servers are added and dropped from the push list. Additionally,
> you need to account for servers that stop responding. When information is
> pulled from the central source, these issues of coordination are
eliminated.
>
> David Medinets
> http://www.codebits.com




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: commercial websites powered by Lucene?

Posted by David Medinets <me...@mtolive.com>.

----- Original Message -----
From: "Chris Miller" <ch...@hotmail.com>
> Did you look at having just a single process that was responsible for
> updating the index, and then pushing copies out to all the webservers? I'm
> wondering if that might be worth investigating (since it would take a lot
of
> load off the webservers that are running the searches), or if it will be
too
> troublesome in practice.

I've found that pulling information from a central source is simpler than
pushing information. When information is pushing, there is much
administration on the central server to track the recipient machines. It
seems like servers are added and dropped from the push list. Additionally,
you need to account for servers that stop responding. When information is
pulled from the central source, these issues of coordination are eliminated.

David Medinets
http://www.codebits.com



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: commercial websites powered by Lucene?

Posted by "Nader S. Henein" <ns...@bayt.net>.

I have to store the information I am indexing in the database because
the nature of our application requires it, on update of certain columns
in a table I create an XML file which is then copied to directories on
each of my web servers, then separate lucene apps, running on separates
machines digest the information into separate indices, you also have to
provide procedures that will run periodically to ensure that all you
indices are in sync with each other and in sync with the DB ( I run this
once every three days when the CPU usage on the machines is low) 

To update the index I have a servlet running off a scheduler in Resin
(you could use any webserver, Orion's cool too), the up-side to
distributing your search engines like this is that you have three active
back ups in case one got corrupted (hasn't happened in two years), and
the load on each machine is pretty low even during updates/optimizations
every 20 minutes.

If the server crashes, it's not  a problem unless it happens
mid-indexing, then you have to somehow remove the write locks created in
the index directory ( I just delete them, optimize, and re-start the
update that crashed) 

Lucene destroyed Oracle on speed tests and we use to have to use our
single DB monster machine for all the searching and indexing which made
the load on it pretty high, but now I have 0.5 loads on all my CPUs and
no need to buy new hardware

-----Original Message-----
From: news [mailto:news@main.gmane.org] On Behalf Of Chris Miller
Sent: Tuesday, June 24, 2003 1:12 PM
To: lucene-user@jakarta.apache.org
Subject: Re: commercial websites powered by Lucene?


So you have a holding table in a database (or directory on disk?) where
you store the incoming documents correct? Does each webserver run it's
own indexing thread which grabs any new documents every 20 minutes, or
is there a central process that manages that? I'm trying to understand
how you know when you can safely clean out the holding table.

Did you look at having just a single process that was responsible for
updating the index, and then pushing copies out to all the webservers?
I'm wondering if that might be worth investigating (since it would take
a lot of load off the webservers that are running the searches), or if
it will be too troublesome in practice.

Also, I'm interested to see how you handle the situation when a server
gets shutdown/restarted - does it just take a copy of the index from one
of the other servers (since it's own index is likely out of date)? I
take it it's not safe to copy an index while it is being updated, so you
have to block on that somehow?

PS: It's great to hear Lucene blows Oracle out of the water! I've got
some skeptical management that need some convincing, hearing stories
like this helps a lot :-)

"Nader S. Henein" <ns...@bayt.net> wrote in message
news:000b01c33a2a$d2675290$1801a8c0@naderit...
> I handle updates or inserts the same way first I delete the document 
> from the index and then I insert it (better safe than sorry), I batch 
> my updates/inserts every twenty minutes, I would do it in smaller 
> intervals but since I have to sync the XML files created from the DB 
> to three machines (I maintain three separate Lucene indices on my 
> three separate
> web-servers) it takes a little longer. You have to batch your changes
> because Updating the index takes time as opposed to deleted which I
> batch every two minutes. You won't have a problem updating the index
and
> searching at the same time because lucene updates the index on a
> separate set of files and then when It's done it overwrites the old
> version. I've had to provide for Backups, and things like server
crashes
> mid-indexing, but I was using Oracle Intermedia before and Lucene
BLOWS
> IT AWAY.




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: commercial websites powered by Lucene?

Posted by Chris Miller <ch...@hotmail.com>.

So you have a holding table in a database (or directory on disk?) where you
store the incoming documents correct? Does each webserver run it's own
indexing thread which grabs any new documents every 20 minutes, or is there
a central process that manages that? I'm trying to understand how you know
when you can safely clean out the holding table.

Did you look at having just a single process that was responsible for
updating the index, and then pushing copies out to all the webservers? I'm
wondering if that might be worth investigating (since it would take a lot of
load off the webservers that are running the searches), or if it will be too
troublesome in practice.

Also, I'm interested to see how you handle the situation when a server gets
shutdown/restarted - does it just take a copy of the index from one of the
other servers (since it's own index is likely out of date)? I take it it's
not safe to copy an index while it is being updated, so you have to block on
that somehow?

PS: It's great to hear Lucene blows Oracle out of the water! I've got some
skeptical management that need some convincing, hearing stories like this
helps a lot :-)

"Nader S. Henein" <ns...@bayt.net> wrote in message
news:000b01c33a2a$d2675290$1801a8c0@naderit...
> I handle updates or inserts the same way first I delete the document
> from the index and then I insert it (better safe than sorry), I batch my
> updates/inserts every twenty minutes, I would do it in smaller intervals
> but since I have to sync the XML files created from the DB to three
> machines (I maintain three separate Lucene indices on my three separate
> web-servers) it takes a little longer. You have to batch your changes
> because Updating the index takes time as opposed to deleted which I
> batch every two minutes. You won't have a problem updating the index and
> searching at the same time because lucene updates the index on a
> separate set of files and then when It's done it overwrites the old
> version. I've had to provide for Backups, and things like server crashes
> mid-indexing, but I was using Oracle Intermedia before and Lucene BLOWS
> IT AWAY.




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: commercial websites powered by Lucene?

Posted by John Takacs <ta...@highway61.com>.

Ulrich, Vince,

I think a big, "I'm a dummy" post may be in order.  ;-)

I'll do as you suggested immediately.

Regards,

John

-----Original Message-----
From: news [mailto:news@main.gmane.org]On Behalf Of Ulrich Mayring
Sent: Thursday, June 26, 2003 1:30 AM
To: lucene-user@jakarta.apache.org
Subject: Re: commercial websites powered by Lucene?


John Takacs wrote:
> Good idea.  I was just following the install directions, but if I don't
have
> to pay attention to the install directions, I'll find a much better one.
>
> Any hints?  Previous email discussion maybe?  I found some references via
> searching the archives, but I'm not 100% convinced they are applicable to
my
> situation.

I'm not sure what you mean with install directions, Lucene is just a JAR
file and you use it like any other Java class library. There's also the
WAR file with a few demos, which you can just drop into Tomcat.

Perhaps you were trying to build it? I just downloaded the binary
distribution and used it.

Ulrich



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: commercial websites powered by Lucene?

Posted by Ulrich Mayring <ul...@denic.de>.

John Takacs wrote:
> Good idea.  I was just following the install directions, but if I don't have
> to pay attention to the install directions, I'll find a much better one.
> 
> Any hints?  Previous email discussion maybe?  I found some references via
> searching the archives, but I'm not 100% convinced they are applicable to my
> situation.

I'm not sure what you mean with install directions, Lucene is just a JAR 
file and you use it like any other Java class library. There's also the 
WAR file with a few demos, which you can just drop into Tomcat.

Perhaps you were trying to build it? I just downloaded the binary 
distribution and used it.

Ulrich

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: commercial websites powered by Lucene?

Posted by Tatu Saloranta <ta...@hypermall.net>.

On Wednesday 25 June 2003 09:47, Ulrich Mayring wrote:
> John Takacs wrote:
> > I'd love to try Lucene with the above, but the Lucene install fails
> > because of JavaCC issues.  Surprised more people haven't encountered this
> > problem, as the install instructions are out of date.
>
> Well, what do you need JavaCC for? Isn't it just the technology for
> building the supplied HTML-Parser? There are much better HTML parsers
> out there, which you can use.

On a related note; has anyone done performance measurements for various
HTML parsers used for indexing?

I have written couple of XML/HTML parsers that were optimized for speed 
(and/or leniency to be able to handle/fix non-valid documents), and was 
wondering if they might be useful for indexing purposes for other people (one 
is in general pretty optimal if document contents are fully in memory 
already, like when fetching from DB; another uses very little memory, while 
being only slightly slower). However, using those as opposed to more standard 
ones would only make sense if there are significant speed improvements.
And to do that, it would be good to have baseline measurements, and/or to know 
what are current best candidates, from performance perspective.

The thing is that creating a parser that only cares about textual content (and 
perhaps in some cases about surrounding element, but not about attributes, or 
structure, or DTD/Schema, validity etc) is fairly easy, and since indexing is 
often the most CPU-intensive part of search engine, it may make sense to try 
to optimize this part heavily, up to and including using specialized parsers.

-+ Tatu +-

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: commercial websites powered by Lucene?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

> Well, what do you need JavaCC for? Isn't it just the technology for 
> building the supplied HTML-Parser? There are much better HTML parsers
> out there, which you can use.

Its primary use in Lucene package is for parsing users' queries.

Otis


__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: commercial websites powered by Lucene?

Posted by John Takacs <ta...@highway61.com>.

Good idea.  I was just following the install directions, but if I don't have
to pay attention to the install directions, I'll find a much better one.

Any hints?  Previous email discussion maybe?  I found some references via
searching the archives, but I'm not 100% convinced they are applicable to my
situation.

John

-----Original Message-----
From: news [mailto:news@main.gmane.org]On Behalf Of Ulrich Mayring
Sent: Thursday, June 26, 2003 12:48 AM
To: lucene-user@jakarta.apache.org
Subject: Re: commercial websites powered by Lucene?

John Takacs wrote:
 >
> I'd love to try Lucene with the above, but the Lucene install fails
because
> of JavaCC issues.  Surprised more people haven't encountered this problem,
> as the install instructions are out of date.

Well, what do you need JavaCC for? Isn't it just the technology for
building the supplied HTML-Parser? There are much better HTML parsers
out there, which you can use.

Ulrich

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: commercial websites powered by Lucene?

Posted by Ulrich Mayring <ul...@denic.de>.

John Takacs wrote:
 >
> I'd love to try Lucene with the above, but the Lucene install fails because
> of JavaCC issues.  Surprised more people haven't encountered this problem,
> as the install instructions are out of date.

Well, what do you need JavaCC for? Isn't it just the technology for 
building the supplied HTML-Parser? There are much better HTML parsers 
out there, which you can use.

Ulrich



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: commercial websites powered by Lucene?

Posted by Leo Galambos <Le...@seznam.cz>.

>
>
>BUT, looking at the Full text indexing/searching part...it not up to snuff.
>
>Currently, I'm using mysql's full text search support. I have a database of
>3-5 million rows. Each row is unique, let's say a product. Each row has
>several columns, but the two I search on are title and description. I
>created a full text index on title and description. Title has approximately
>100 characters, and description has 255 characters.
>
store the two columns in an extra table. it would help you.

>At the moment, mysql is taking 50 seconds plus to return results on simple
>one word searches. My dedicated server is a P4, 2.0 Gighz, 1.5 Gig RAM
>RedHat Linux 7.3 platform, with nothing else running on it, i.e. another
>server is handling HTTP requests. It is a dedicated mysql box.  In addition,
>I'm the only person making queries.
>  
>
did you write it to mysql team?

-g-



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: commercial websites powered by Lucene?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

> I'd love to try Lucene with the above, but the Lucene install fails
> because
> of JavaCC issues.  Surprised more people haven't encountered this
> problem,
> as the install instructions are out of date.

The JavaCC fix is in the queue.  Check Bugzilla for details (link on
Lucene home page).

Otis


> -----Original Message-----
> From: Tatu Saloranta [mailto:tatu@hypermall.net]
> Sent: Wednesday, June 25, 2003 12:26 PM
> To: Lucene Users List
> Subject: Re: commercial websites powered by Lucene?
> 
> 
> On Tuesday 24 June 2003 07:36, Ulrich Mayring wrote:
> > Chris Miller wrote:
> ...
> > Well, nothing against Lucene, but it doesn't solve your problem,
> which
> > is an overloaded DB-Server. It may temporarily alleviate the
> effects,
> > but you'll soon be at the same load again. So I'd recommend to
> install
> 
> I don't think that would necessarily be the case. Like you mention
> later on,
> indexing data stored in DB does flatten it to allow faster indexing
> (and
> retrieval), and faster in this context means more efficient, not only
> sharing
> the load between DB and search engine, but potentially lowering total
> load?
> 
> The alternative, data warehouse - like preprocessing of data, for
> faster
> search, would likely be doable too, but it's usually more useful for
> running
> reports. For actual searches Lucene does it job nicely and
> efficiently,
> biggest problems I've seen are more related to relevancy questions.
> But
> that's where tuning of Lucene ranking should be easier than trying to
> build
> your own ranking from raw database hits (except if one uses
> OracleText or
> such that's pretty much a search engine on top of DB itself).
> 
> So, to me it all comes down to "right tool for the job" aspect;  DBs
> are
> good
> at mass retrieval of data, or using aggregate functions (in read-only
> side),
> whereas dedicated search engines are better for, well, searching.
> 
> ...
> > Of course, in real life there may be political obstacles which will
> > prevent you from doing the right thing as detailed above for
> example,
> > and your only chance is to circumvent in some way - and then Lucene
> is a
> > great way to do that. But keep in mind that you are basically
> > reinventing the functionality that is already built-in in a
> database :)
> 
> It depends on type of queries, but Lucene certainly has much more
> advanced
> text searching functionality, even if indexed content comes from a
> rigid
> structure like RDBMS. I'm not sure using a ready product like Lucene
> is
> reinventing much functionality, even considering synchronization
> issues?
> 
> So I would go as far saying that for searching purposes, plain
> vanilla
> RDBMSs
> are not all that great in the first place. Even if queries need not
> use
> advanced search features (advanced as in not just using % and _ in
> addition
> to exact matches) Lucene may well offer better search performance and
> functionality.
> 
> -+ Tatu +-
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: commercial websites powered by Lucene?

Posted by John Takacs <ta...@highway61.com>.

Tatu,

I agree 100% with everything you've said.

Let's look at MySQL for example.  Great database.  No doubt about it.

BUT, looking at the Full text indexing/searching part...it not up to snuff.

Currently, I'm using mysql's full text search support. I have a database of
3-5 million rows. Each row is unique, let's say a product. Each row has
several columns, but the two I search on are title and description. I
created a full text index on title and description. Title has approximately
100 characters, and description has 255 characters.

At the moment, mysql is taking 50 seconds plus to return results on simple
one word searches. My dedicated server is a P4, 2.0 Gighz, 1.5 Gig RAM
RedHat Linux 7.3 platform, with nothing else running on it, i.e. another
server is handling HTTP requests. It is a dedicated mysql box.  In addition,
I'm the only person making queries.

Obviously, the above performance is unacceptable for real world web
applications.

I'd love to try Lucene with the above, but the Lucene install fails because
of JavaCC issues.  Surprised more people haven't encountered this problem,
as the install instructions are out of date.

Regards,

John

-----Original Message-----
From: Tatu Saloranta [mailto:tatu@hypermall.net]
Sent: Wednesday, June 25, 2003 12:26 PM
To: Lucene Users List
Subject: Re: commercial websites powered by Lucene?

On Tuesday 24 June 2003 07:36, Ulrich Mayring wrote:
> Chris Miller wrote:
...
> Well, nothing against Lucene, but it doesn't solve your problem, which
> is an overloaded DB-Server. It may temporarily alleviate the effects,
> but you'll soon be at the same load again. So I'd recommend to install

I don't think that would necessarily be the case. Like you mention later on,
indexing data stored in DB does flatten it to allow faster indexing (and
retrieval), and faster in this context means more efficient, not only
sharing
the load between DB and search engine, but potentially lowering total load?

The alternative, data warehouse - like preprocessing of data, for faster
search, would likely be doable too, but it's usually more useful for running
reports. For actual searches Lucene does it job nicely and efficiently,
biggest problems I've seen are more related to relevancy questions. But
that's where tuning of Lucene ranking should be easier than trying to build
your own ranking from raw database hits (except if one uses OracleText or
such that's pretty much a search engine on top of DB itself).

So, to me it all comes down to "right tool for the job" aspect;  DBs are
good
at mass retrieval of data, or using aggregate functions (in read-only side),
whereas dedicated search engines are better for, well, searching.

...
> Of course, in real life there may be political obstacles which will
> prevent you from doing the right thing as detailed above for example,
> and your only chance is to circumvent in some way - and then Lucene is a
> great way to do that. But keep in mind that you are basically
> reinventing the functionality that is already built-in in a database :)

It depends on type of queries, but Lucene certainly has much more advanced
text searching functionality, even if indexed content comes from a rigid
structure like RDBMS. I'm not sure using a ready product like Lucene is
reinventing much functionality, even considering synchronization issues?

So I would go as far saying that for searching purposes, plain vanilla
RDBMSs
are not all that great in the first place. Even if queries need not use
advanced search features (advanced as in not just using % and _ in addition
to exact matches) Lucene may well offer better search performance and
functionality.

-+ Tatu +-

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: commercial websites powered by Lucene?

Posted by Tatu Saloranta <ta...@hypermall.net>.

On Tuesday 24 June 2003 07:36, Ulrich Mayring wrote:
> Chris Miller wrote:
...
> Well, nothing against Lucene, but it doesn't solve your problem, which
> is an overloaded DB-Server. It may temporarily alleviate the effects,
> but you'll soon be at the same load again. So I'd recommend to install

I don't think that would necessarily be the case. Like you mention later on, 
indexing data stored in DB does flatten it to allow faster indexing (and 
retrieval), and faster in this context means more efficient, not only sharing 
the load between DB and search engine, but potentially lowering total load?

The alternative, data warehouse - like preprocessing of data, for faster 
search, would likely be doable too, but it's usually more useful for running 
reports. For actual searches Lucene does it job nicely and efficiently, 
biggest problems I've seen are more related to relevancy questions. But 
that's where tuning of Lucene ranking should be easier than trying to build 
your own ranking from raw database hits (except if one uses OracleText or 
such that's pretty much a search engine on top of DB itself).

So, to me it all comes down to "right tool for the job" aspect;  DBs are good 
at mass retrieval of data, or using aggregate functions (in read-only side), 
whereas dedicated search engines are better for, well, searching.

...
> Of course, in real life there may be political obstacles which will
> prevent you from doing the right thing as detailed above for example,
> and your only chance is to circumvent in some way - and then Lucene is a
> great way to do that. But keep in mind that you are basically
> reinventing the functionality that is already built-in in a database :)

It depends on type of queries, but Lucene certainly has much more advanced 
text searching functionality, even if indexed content comes from a rigid 
structure like RDBMS. I'm not sure using a ready product like Lucene is 
reinventing much functionality, even considering synchronization issues?

So I would go as far saying that for searching purposes, plain vanilla RDBMSs 
are not all that great in the first place. Even if queries need not use 
advanced search features (advanced as in not just using % and _ in addition 
to exact matches) Lucene may well offer better search performance and 
functionality.

-+ Tatu +-

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: commercial websites powered by Lucene?

Posted by "Nader S. Henein" <ns...@bayt.net>.

On a more realistic note, We were running our search off Intermedia
which is an Oracle proprietary search engine. Our Oracle setup was
running off our monster DB server, 8 Sun Ultra Sprarcs III , 450Mhz,
with 8 Gegs of ram. We were looking at alternative search engine
solutions, and we tried about six, before I read about Lucene, had
everything been going smooth we wouldn't need anything but, the more our
site grew, the slower the searches were getting and I was seeing loads
of 3 and 4 on the CPUs (on an 8 CPU machine). So we decided to go with
an external search engine, when Lucene finally came on-line CPU leads
came down to 0.5 and DB was serving things faster and Lucene hasn't
given me trouble since.

You say "However, once the DB server actually has capacity again, flocks
of people will demand  (rightly so) that their wishes be realized now",
If you're talking about using Lucene to query statistical information
about data you have in the DB than you don't need a search engine, you
need a data-miner, Lucene will pull data based on dates, keywords and
ranges, but if you want to do sums and group bys than you shouldn't be
looking at Lucene . And as for joins, the way you structure the data
that lucene digests is up to you, I pass Lucene XML files that collect
information from 12 tables (12 table join), you have to think in the
context of giving Lucene your business objects, not just raw data from
tables. You can run Lucene off a smaller computer and totally isolate
the load off your DB server. 


Nader Henein


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: commercial websites powered by Lucene?

Posted by Ulrich Mayring <ul...@denic.de>.

Chris Miller wrote:
> 
> I'm not clear on why you think we'll soon be back up to the same load on the
> DB server?

Experience ;-)

If my DB Server is overloaded and everyone knows that, people will not 
come with some less-than-important ideas for additional searches and 
stuff. And if they come I can turn them back. However, once the DB 
server actually has capacity again, flocks of people will demand 
(rightly so) that their wishes be realized now: "Cool, now you can run 
my report every minute instead of every hour."

 > What is going to increase the load? Our volume of data is not
> increasing, all that will change is that the DB will no longer get hit for
> searches. We'll still be pulling content etc from the database at roughly
> the same rate, but that doesn't appear to be a source of any problems.
> Whether we offload the searching to MySQL DBs or Lucene makes no difference
> as far as I can see.

Not in as far as load increase is concerned. But there are some 
differences you'll have to consider:

SQL vs. Lucene Query Language
Users/Groups/Permissions vs. nothing
Transactions vs. nothing
Vanilla indexing vs. powerful, flexible, customizable indexing
All kinds of APIs vs. Java API
need to write a replicator vs. need to write an indexer

cheers,

Ulrich

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: commercial websites powered by Lucene?

Posted by Chris Miller <ch...@hotmail.com>.

Thanks for the pointers, and rest assurred we are looking into such
approaches. However the data that we have is coming in from a wide variety
of customers and is unfortunately not nearly as structured as we would like
(and we are powerless to change that). So while we do have some fields that
are database-friendly, a large portion of what we have to search against is
plain text, which is why I'm looking into Lucene as a possible solution. To
be honest I'm still a bit torn between using a database or Lucene for
searching since the data we have falls into the grey area between the two.
Once we have decent proof-of-concepts up and running of each approach I
guess a clearer picture will emerge.

I'm not clear on why you think we'll soon be back up to the same load on the
DB server? What is going to increase the load? Our volume of data is not
increasing, all that will change is that the DB will no longer get hit for
searches. We'll still be pulling content etc from the database at roughly
the same rate, but that doesn't appear to be a source of any problems.
Whether we offload the searching to MySQL DBs or Lucene makes no difference
as far as I can see.

> Well, nothing against Lucene, but it doesn't solve your problem, which
> is an overloaded DB-Server. It may temporarily alleviate the effects,
> but you'll soon be at the same load again. So I'd recommend to install
> additional databases (MySQL comes to mind), which contain duplicates of
> your data, but in a form that is customized to your searches. Then do
> the searches on these databases and use the SQL Server merely as a
> storage backend and definitive data source.
>
> What makes searches complex in databases are usually joins. It is
> therefore a good idea to join only once (i.e. at data creation time) and
> then copy the aggregated data in a flat form into a search database.
> That is basically what you are doing with Lucene right now, but Lucene
> is a full-text indexer, it is geared towards unstructured data. If your
> data is already in a database in a structured form, it doesn't make much
> sense IMHO to use Lucene.
>
> Of course, in real life there may be political obstacles which will
> prevent you from doing the right thing as detailed above for example,
> and your only chance is to circumvent in some way - and then Lucene is a
> great way to do that. But keep in mind that you are basically
> reinventing the functionality that is already built-in in a database :)
>
> Ulrich




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: commercial websites powered by Lucene?

Posted by Ulrich Mayring <ul...@denic.de>.

Chris Miller wrote:
> 
> Fair enough, I haven't tried much in the way of profiling yet. I just
> thought you might have found some Lucene settings that made a big difference
> for you, or you'd found indexing into a RAMDirectory then dumping it to disk
> was faster, etc. But it sounds like you're pretty happy with near default
> settings.

Yes, definitely.

> Our current DB server (running SQL Server) is under enormous strain, partly
> due to the complex searches that are being performed against it. We've got
> it pretty heavily tweaked already, so I don't think there's too much room to
> improve on that front. The idea is to use Lucene to take the searching load
> off it so it can get on with all the other tasks it has to perform. The
> Lucene implementation I'm working on here is just a proof of concept - it
> may be that we stay with SQL Server in the long run anyway, but Lucene
> definitely seems to be worth investigating - it has certainly worked well
> for us on smaller projects.

Well, nothing against Lucene, but it doesn't solve your problem, which 
is an overloaded DB-Server. It may temporarily alleviate the effects, 
but you'll soon be at the same load again. So I'd recommend to install 
additional databases (MySQL comes to mind), which contain duplicates of 
your data, but in a form that is customized to your searches. Then do 
the searches on these databases and use the SQL Server merely as a 
storage backend and definitive data source.

What makes searches complex in databases are usually joins. It is 
therefore a good idea to join only once (i.e. at data creation time) and 
then copy the aggregated data in a flat form into a search database. 
That is basically what you are doing with Lucene right now, but Lucene 
is a full-text indexer, it is geared towards unstructured data. If your 
data is already in a database in a structured form, it doesn't make much 
sense IMHO to use Lucene.

Of course, in real life there may be political obstacles which will 
prevent you from doing the right thing as detailed above for example, 
and your only chance is to circumvent in some way - and then Lucene is a 
great way to do that. But keep in mind that you are basically 
reinventing the functionality that is already built-in in a database :)

Ulrich

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: commercial websites powered by Lucene?

Posted by Chris Miller <ch...@hotmail.com>.

> This is a good approach if the number of total documents doesn't grow
> too much. There's obviously a limit to full index runs at some point.

Well I was actually going to go with incremental indexing, since a full
reindex will probably take ~1 hour. We have a relatively fixed size of data,
but the data is updated very frequently - almost 100% turnover/day.

> You need to find out where you lose most of the time:

Fair enough, I haven't tried much in the way of profiling yet. I just
thought you might have found some Lucene settings that made a big difference
for you, or you'd found indexing into a RAMDirectory then dumping it to disk
was faster, etc. But it sounds like you're pretty happy with near default
settings.

> However, what I wonder: if you have your data in a database anyway, why
> not use the database's indexing features? It seems like Lucene is an
> additional layer on top of your data, which you don't really need.

Our current DB server (running SQL Server) is under enormous strain, partly
due to the complex searches that are being performed against it. We've got
it pretty heavily tweaked already, so I don't think there's too much room to
improve on that front. The idea is to use Lucene to take the searching load
off it so it can get on with all the other tasks it has to perform. The
Lucene implementation I'm working on here is just a proof of concept - it
may be that we stay with SQL Server in the long run anyway, but Lucene
definitely seems to be worth investigating - it has certainly worked well
for us on smaller projects.




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: commercial websites powered by Lucene?

Posted by "Nader S. Henein" <ns...@bayt.net>.

We were using Oracle Internedia before we switched to Lucene, and Lucene
has been much faster and it has allowed us to distribute our search
functionality over multiple servers, Intermedia which is supposedly one
of the best in the business couldn't hold a candle to Lucene, and our
Oracle installation and setup is impeccable, we spent years perfecting
it before we decided to separate from Intermedia and use Oracle as DBMS
not a search engine, also when you use lucene and not a proprietary
product like Intermedia we can switch databases at will if Licensing
fees become to high to ignore.

Nader

-----Original Message-----
From: news [mailto:news@main.gmane.org] On Behalf Of Ulrich Mayring
Sent: Tuesday, June 24, 2003 3:40 PM
To: lucene-user@jakarta.apache.org
Subject: Re: commercial websites powered by Lucene?

Chris Miller wrote:
> Thanks for your commments Ulrich. I just posted a message asking if 
> anyone had attempted this approach! Sounds like you have, and it works

> :-)  Thanks for information, this sounds pretty close to what my 
> preferred approach would be.

This is a good approach if the number of total documents doesn't grow 
too much. There's obviously a limit to full index runs at some point.

> You say you get 2000 docs/minute. I've done some benchmarking and 
> managed to get our data indexing at ~1000/minute on an Athlon 1800+ 
> (and most of that speed was acheived by bumping the 
> IndexWriter.mergeFactor up to 100 or so). Our data is coming from a 
> database table, each record contains about 40 fields, and I'm indexing

> 8 of those fields (an ID, 4 number fields, 3 text fields including one

> that has ~2k text). Does this sound reasonable to you, or do you have 
> any tips that might improve that performance?

You need to find out where you lose most of the time:

a) in data access (like your database could be too slow, in my case I am

scanning the local filesystem)
b) in parsing (probably not an issue when reading from a DB, but in my 
case it is, I have HTML files)
c) in indexing

I haven't gone to the trouble to find that out for my app, because it is

fast enough the way it is.

However, what I wonder: if you have your data in a database anyway, why 
not use the database's indexing features? It seems like Lucene is an 
additional layer on top of your data, which you don't really need.

cheers,

Ulrich

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: commercial websites powered by Lucene?

Posted by Ulrich Mayring <ul...@denic.de>.

Chris Miller wrote:
> Thanks for your commments Ulrich. I just posted a message asking if anyone
> had attempted this approach! Sounds like you have, and it works :-)  Thanks
> for information, this sounds pretty close to what my preferred approach
> would be.

This is a good approach if the number of total documents doesn't grow 
too much. There's obviously a limit to full index runs at some point.

> You say you get 2000 docs/minute. I've done some benchmarking and managed to
> get our data indexing at ~1000/minute on an Athlon 1800+ (and most of that
> speed was acheived by bumping the IndexWriter.mergeFactor up to 100 or so).
> Our data is coming from a database table, each record contains about 40
> fields, and I'm indexing 8 of those fields (an ID, 4 number fields, 3 text
> fields including one that has ~2k text). Does this sound reasonable to you,
> or do you have any tips that might improve that performance?

You need to find out where you lose most of the time:

a) in data access (like your database could be too slow, in my case I am 
scanning the local filesystem)
b) in parsing (probably not an issue when reading from a DB, but in my 
case it is, I have HTML files)
c) in indexing

I haven't gone to the trouble to find that out for my app, because it is 
fast enough the way it is.

However, what I wonder: if you have your data in a database anyway, why 
not use the database's indexing features? It seems like Lucene is an 
additional layer on top of your data, which you don't really need.

cheers,

Ulrich



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: commercial websites powered by Lucene?

Posted by Chris Miller <ch...@hotmail.com>.

Thanks for your commments Ulrich. I just posted a message asking if anyone
had attempted this approach! Sounds like you have, and it works :-)  Thanks
for information, this sounds pretty close to what my preferred approach
would be.

You say you get 2000 docs/minute. I've done some benchmarking and managed to
get our data indexing at ~1000/minute on an Athlon 1800+ (and most of that
speed was acheived by bumping the IndexWriter.mergeFactor up to 100 or so).
Our data is coming from a database table, each record contains about 40
fields, and I'm indexing 8 of those fields (an ID, 4 number fields, 3 text
fields including one that has ~2k text). Does this sound reasonable to you,
or do you have any tips that might improve that performance?

"Ulrich Mayring" <ul...@denic.de> wrote in message
news:bd9a15$vk4$1@main.gmane.org...
> Chris Miller wrote:
> >
> > The main thing I'm interested in is how you handle updates to Lucene's
> > index. I'd imagine you have a fairly high turnover of CVs and jobs, so
index
> > updates must place a reasonable load on the CPU/disk. Do you keep CVs
and
> > jobs in the same index or two different ones? And what is the process
you
> > use to update the index(es) - do you batch-process updates or do you
handle
> > them in real-time as changes are made?
>
> The way we do it: we re-index everything periodically in a temporary
> directory and then rename the temporary directory. That way the index
> remains accessible at all times and its currency is simply determined by
> the interval I run the re-indexing in.
>
> >  We need to be able to handle indexing about 60,000 documents/day,
> > while allowing (many) searches to continue operating alongside.
>
> On an entry-level Sun I can index about 23 documents per second and
> these are real-life HTML pages. Thus in less than one hour you would be
> finished with a complete index run and save yourself all kinds of
> trouble with crashes during indexing etc.
>
> On my 2 GHz Linux workstation it's even faster: more than 2000 documents
> per minute, so you'd be done in half an hour.
>
> BTW, we're not using the supplied JavaCC-based HTML parser, instead we
> got htmlparser.sourceforge.net, which is a joy to use and pretty fast.
>
> Ulrich

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: commercial websites powered by Lucene?

Posted by "Nader S. Henein" <ns...@bayt.net>.

Take care that the indexing speed is also dependant on what type of
files you're working with, indexing simple HTML files will be faster
than indexing a 30 field XML file with date fields let's say. With us a
full re-index takes about 4 hours on 500 k documents, so doing
periodical full re-indexing would have too high a cost on the system

-----Original Message-----
From: news [mailto:news@main.gmane.org] On Behalf Of Ulrich Mayring
Sent: Tuesday, June 24, 2003 2:42 PM
To: lucene-user@jakarta.apache.org
Subject: Re: commercial websites powered by Lucene?

Chris Miller wrote:
> 
> The main thing I'm interested in is how you handle updates to Lucene's

> index. I'd imagine you have a fairly high turnover of CVs and jobs, so

> index updates must place a reasonable load on the CPU/disk. Do you 
> keep CVs and jobs in the same index or two different ones? And what is

> the process you use to update the index(es) - do you batch-process 
> updates or do you handle them in real-time as changes are made?

The way we do it: we re-index everything periodically in a temporary 
directory and then rename the temporary directory. That way the index 
remains accessible at all times and its currency is simply determined by

the interval I run the re-indexing in.

>  We need to be able to handle indexing about 60,000 documents/day, 
> while allowing (many) searches to continue operating alongside.

On an entry-level Sun I can index about 23 documents per second and 
these are real-life HTML pages. Thus in less than one hour you would be 
finished with a complete index run and save yourself all kinds of 
trouble with crashes during indexing etc.

On my 2 GHz Linux workstation it's even faster: more than 2000 documents

per minute, so you'd be done in half an hour.

BTW, we're not using the supplied JavaCC-based HTML parser, instead we 
got htmlparser.sourceforge.net, which is a joy to use and pretty fast.

Ulrich

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: commercial websites powered by Lucene?

Posted by Ulrich Mayring <ul...@denic.de>.

Chris Miller wrote:
> 
> The main thing I'm interested in is how you handle updates to Lucene's
> index. I'd imagine you have a fairly high turnover of CVs and jobs, so index
> updates must place a reasonable load on the CPU/disk. Do you keep CVs and
> jobs in the same index or two different ones? And what is the process you
> use to update the index(es) - do you batch-process updates or do you handle
> them in real-time as changes are made?

The way we do it: we re-index everything periodically in a temporary 
directory and then rename the temporary directory. That way the index 
remains accessible at all times and its currency is simply determined by 
the interval I run the re-indexing in.

>  We need to be able to handle indexing about 60,000 documents/day,
> while allowing (many) searches to continue operating alongside.

On an entry-level Sun I can index about 23 documents per second and 
these are real-life HTML pages. Thus in less than one hour you would be 
finished with a complete index run and save yourself all kinds of 
trouble with crashes during indexing etc.

On my 2 GHz Linux workstation it's even faster: more than 2000 documents 
per minute, so you'd be done in half an hour.

BTW, we're not using the supplied JavaCC-based HTML parser, instead we 
got htmlparser.sourceforge.net, which is a joy to use and pretty fast.

Ulrich

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: commercial websites powered by Lucene?

Posted by "Nader S. Henein" <ns...@bayt.net>.

Because I've setup Lucene as a webapp with a centralized Init file and
setup properties file, I do my sanity check in the Init, because if the
serer crashes mid-indexing, I have to delete the lock files optimize and
re-index the files that were indexing when the crash occurred, there was
long discussion about this back in August, search for "Crash / Recovery
Scenario" in the lucene-dev archived discussions. Should answer all your
questions

Nader Henein

-----Original Message-----
From: Gareth Griffiths [mailto:Gareth.Griffiths@bridgeheadsoftware.com] 
Sent: Tuesday, June 24, 2003 1:11 PM
To: Lucene Users List; nsh@bayt.net
Subject: Re: commercial websites powered by Lucene?


Nader,
You say you have to cope with server crash mid-indexing. I think I'm
seeing lots of garbage files created by server crash mid merge/optimise
while lucene is creating a new index. Did you write code specifically to
handle this or is there something more automated. (I was thinking of
writing a sanity check for before start-up that looked in 'segments' and
'deletable and got rid of any files in the catalog directory that are
not referenced.)

Did you do something similar or have I missed something...

TIA

Gareth


----- Original Message -----
From: "Nader S. Henein" <ns...@bayt.net>
To: "'Lucene Users List'" <lu...@jakarta.apache.org>
Sent: Tuesday, June 24, 2003 9:30 AM
Subject: RE: commercial websites powered by Lucene?


> I handle updates or inserts the same way first I delete the document 
> from the index and then I insert it (better safe than sorry), I batch 
> my updates/inserts every twenty minutes, I would do it in smaller 
> intervals but since I have to sync the XML files created from the DB 
> to three machines (I maintain three separate Lucene indices on my 
> three separate
> web-servers) it takes a little longer. You have to batch your changes
> because Updating the index takes time as opposed to deleted which I
> batch every two minutes. You won't have a problem updating the index
and
> searching at the same time because lucene updates the index on a
> separate set of files and then when It's done it overwrites the old
> version. I've had to provide for Backups, and things like server
crashes
> mid-indexing, but I was using Oracle Intermedia before and Lucene
BLOWS
> IT AWAY.
>
> -----Original Message-----
> From: news [mailto:news@main.gmane.org] On Behalf Of Chris Miller
> Sent: Tuesday, June 24, 2003 12:06 PM
> To: lucene-user@jakarta.apache.org
> Subject: Re: commercial websites powered by Lucene?
>
>
> Hi Nader,
>
> I was wondering if you'd mind me asking you a couple of questions 
> about your implementation?
>
> The main thing I'm interested in is how you handle updates to Lucene's

> index. I'd imagine you have a fairly high turnover of CVs and jobs, so

> index updates must place a reasonable load on the CPU/disk. Do you 
> keep CVs and jobs in the same index or two different ones? And what is

> the process you use to update the index(es) - do you batch-process 
> updates or do you handle them in real-time as changes are made?
>
> Any insight you can offer would be much appreciated as I'm about to 
> implement something similar and am a little unsure of the best 
> approach to take. We need to be able to handle indexing about 60,000 
> documents/day, while allowing (many) searches to continue operating 
> alongside.
>
> Thanks!
> Chris
>
> "Nader S. Henein" <ns...@bayt.net> wrote in message 
> news:001401c32b38$32aa2440$d501a8c0@naderit...
> > We use Lucene http://www.bayt.com , we're basically an on-line 
> > Recruitment site and up until now we've got around 500 000 CVs and 
> > documents indexed with results that stump Oracle Intermedia.
> >
> > Nader Henein
> > Senior Web Dev
> >
> > Bayt.com
> >
> > -----Original Message-----
> > From: John_Chun@platts.com [mailto:John_Chun@platts.com]
> > Sent: Wednesday, June 04, 2003 6:09 PM
> > To: lucene-user@jakarta.apache.org
> > Subject: commercial websites powered by Lucene?
> >
> >
> >
> > Hello All,
> >
> > I've been trying to find examples of large commercial websites that 
> > use Lucene to power their search.  Having such examples would make 
> > Lucene an easy sell to management
> >
> > Does anyone know of any good examples?  The bigger the better, and 
> > the
>
> > more the better.
> >
> > TIA,
> > -John
> >
> >
> >
> > --------------------------------------------------------------------
> > -
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: commercial websites powered by Lucene?

Posted by Gareth Griffiths <Ga...@bridgeheadsoftware.com>.

Nader,
You say you have to cope with server crash mid-indexing. I think I'm seeing
lots of garbage files created by server crash mid merge/optimise while
lucene is creating a new index. Did you write code specifically to handle
this or is there something more automated. (I was thinking of writing a
sanity check for before start-up that looked in 'segments' and 'deletable
and got rid of any files in the catalog directory that are not referenced.)

Did you do something similar or have I missed something...

TIA

Gareth


----- Original Message -----
From: "Nader S. Henein" <ns...@bayt.net>
To: "'Lucene Users List'" <lu...@jakarta.apache.org>
Sent: Tuesday, June 24, 2003 9:30 AM
Subject: RE: commercial websites powered by Lucene?


> I handle updates or inserts the same way first I delete the document
> from the index and then I insert it (better safe than sorry), I batch my
> updates/inserts every twenty minutes, I would do it in smaller intervals
> but since I have to sync the XML files created from the DB to three
> machines (I maintain three separate Lucene indices on my three separate
> web-servers) it takes a little longer. You have to batch your changes
> because Updating the index takes time as opposed to deleted which I
> batch every two minutes. You won't have a problem updating the index and
> searching at the same time because lucene updates the index on a
> separate set of files and then when It's done it overwrites the old
> version. I've had to provide for Backups, and things like server crashes
> mid-indexing, but I was using Oracle Intermedia before and Lucene BLOWS
> IT AWAY.
>
> -----Original Message-----
> From: news [mailto:news@main.gmane.org] On Behalf Of Chris Miller
> Sent: Tuesday, June 24, 2003 12:06 PM
> To: lucene-user@jakarta.apache.org
> Subject: Re: commercial websites powered by Lucene?
>
>
> Hi Nader,
>
> I was wondering if you'd mind me asking you a couple of questions about
> your implementation?
>
> The main thing I'm interested in is how you handle updates to Lucene's
> index. I'd imagine you have a fairly high turnover of CVs and jobs, so
> index updates must place a reasonable load on the CPU/disk. Do you keep
> CVs and jobs in the same index or two different ones? And what is the
> process you use to update the index(es) - do you batch-process updates
> or do you handle them in real-time as changes are made?
>
> Any insight you can offer would be much appreciated as I'm about to
> implement something similar and am a little unsure of the best approach
> to take. We need to be able to handle indexing about 60,000
> documents/day, while allowing (many) searches to continue operating
> alongside.
>
> Thanks!
> Chris
>
> "Nader S. Henein" <ns...@bayt.net> wrote in message
> news:001401c32b38$32aa2440$d501a8c0@naderit...
> > We use Lucene http://www.bayt.com , we're basically an on-line
> > Recruitment site and up until now we've got around 500 000 CVs and
> > documents indexed with results that stump Oracle Intermedia.
> >
> > Nader Henein
> > Senior Web Dev
> >
> > Bayt.com
> >
> > -----Original Message-----
> > From: John_Chun@platts.com [mailto:John_Chun@platts.com]
> > Sent: Wednesday, June 04, 2003 6:09 PM
> > To: lucene-user@jakarta.apache.org
> > Subject: commercial websites powered by Lucene?
> >
> >
> >
> > Hello All,
> >
> > I've been trying to find examples of large commercial websites that
> > use Lucene to power their search.  Having such examples would make
> > Lucene an easy sell to management
> >
> > Does anyone know of any good examples?  The bigger the better, and the
>
> > more the better.
> >
> > TIA,
> > -John
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: commercial websites powered by Lucene?

Posted by "Nader S. Henein" <ns...@bayt.net>.

The search is a little sluggish because our initial architecture was
based on TCL, not java, so until we complete the full java overhaul,
every time I perform a search the AOL Webserver (tcl) has to call the
servlet in Resin (where lucene is)  and then perform the search, then
this is the killer , I have to parse all the results from a Java
Collection into a TCL List, the most intense search with thousands of
results takes less than a second, it's all the things I have to do
around it that take time.

Nader

-----Original Message-----
From: John Takacs [mailto:takacsj@highway61.com] 
Sent: Tuesday, June 24, 2003 1:52 PM
To: Lucene Users List
Subject: RE: commercial websites powered by Lucene?


Hi Nader,

This thread is by far one of the best, and most practical.  It will only
be topped when someone provides benchmarks for a DMOZ.org type directory
of 3 million plus urls.  I would love to, but the whole JavaCC thing is
a show stopper.

Questions:

I noticed that search is a little slow.  What has been your experience?
Perhaps it was a bandwidth issue, but I'm living in a country with the
greatest internet connectivity and penetration in the world (South
Korea), so I don't think that is an issue on my end.

You have 500,000 resumes.  Based on the steps you took to get to
500,000, do you think your current setup will scale to millions, like
say, 3 million or so?

What is your hardware like?  CPU/RAM?

Warm regards, and thanks for sharing.  If I can ever get passed the
Lucene/JavaCC installation failure, I'll share my benchmarks on the
above directory scenario.

John



-----Original Message-----
From: Nader S. Henein [mailto:nsh@bayt.net]
Sent: Tuesday, June 24, 2003 5:30 PM
To: 'Lucene Users List'
Subject: RE: commercial websites powered by Lucene?


 I handle updates or inserts the same way first I delete the document
from the index and then I insert it (better safe than sorry), I batch my
updates/inserts every twenty minutes, I would do it in smaller intervals
but since I have to sync the XML files created from the DB to three
machines (I maintain three separate Lucene indices on my three separate
web-servers) it takes a little longer. You have to batch your changes
because Updating the index takes time as opposed to deleted which I
batch every two minutes. You won't have a problem updating the index and
searching at the same time because lucene updates the index on a
separate set of files and then when It's done it overwrites the old
version. I've had to provide for Backups, and things like server crashes
mid-indexing, but I was using Oracle Intermedia before and Lucene BLOWS
IT AWAY.

-----Original Message-----
From: news [mailto:news@main.gmane.org] On Behalf Of Chris Miller
Sent: Tuesday, June 24, 2003 12:06 PM
To: lucene-user@jakarta.apache.org
Subject: Re: commercial websites powered by Lucene?


Hi Nader,

I was wondering if you'd mind me asking you a couple of questions about
your implementation?

The main thing I'm interested in is how you handle updates to Lucene's
index. I'd imagine you have a fairly high turnover of CVs and jobs, so
index updates must place a reasonable load on the CPU/disk. Do you keep
CVs and jobs in the same index or two different ones? And what is the
process you use to update the index(es) - do you batch-process updates
or do you handle them in real-time as changes are made?

Any insight you can offer would be much appreciated as I'm about to
implement something similar and am a little unsure of the best approach
to take. We need to be able to handle indexing about 60,000
documents/day, while allowing (many) searches to continue operating
alongside.

Thanks!
Chris

"Nader S. Henein" <ns...@bayt.net> wrote in message
news:001401c32b38$32aa2440$d501a8c0@naderit...
> We use Lucene http://www.bayt.com , we're basically an on-line 
> Recruitment site and up until now we've got around 500 000 CVs and 
> documents indexed with results that stump Oracle Intermedia.
>
> Nader Henein
> Senior Web Dev
>
> Bayt.com
>
> -----Original Message-----
> From: John_Chun@platts.com [mailto:John_Chun@platts.com]
> Sent: Wednesday, June 04, 2003 6:09 PM
> To: lucene-user@jakarta.apache.org
> Subject: commercial websites powered by Lucene?
>
>
>
> Hello All,
>
> I've been trying to find examples of large commercial websites that 
> use Lucene to power their search.  Having such examples would make 
> Lucene an easy sell to management
>
> Does anyone know of any good examples?  The bigger the better, and the

> more the better.
>
> TIA,
> -John
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: commercial websites powered by Lucene?

Posted by John Takacs <ta...@highway61.com>.

Hi Nader,

This thread is by far one of the best, and most practical.  It will only be
topped when someone provides benchmarks for a DMOZ.org type directory of 3
million plus urls.  I would love to, but the whole JavaCC thing is a show
stopper.

Questions:

I noticed that search is a little slow.  What has been your experience?
Perhaps it was a bandwidth issue, but I'm living in a country with the
greatest internet connectivity and penetration in the world (South Korea),
so I don't think that is an issue on my end.

You have 500,000 resumes.  Based on the steps you took to get to 500,000, do
you think your current setup will scale to millions, like say, 3 million or
so?

What is your hardware like?  CPU/RAM?

Warm regards, and thanks for sharing.  If I can ever get passed the
Lucene/JavaCC installation failure, I'll share my benchmarks on the above
directory scenario.

John



-----Original Message-----
From: Nader S. Henein [mailto:nsh@bayt.net]
Sent: Tuesday, June 24, 2003 5:30 PM
To: 'Lucene Users List'
Subject: RE: commercial websites powered by Lucene?


 I handle updates or inserts the same way first I delete the document
from the index and then I insert it (better safe than sorry), I batch my
updates/inserts every twenty minutes, I would do it in smaller intervals
but since I have to sync the XML files created from the DB to three
machines (I maintain three separate Lucene indices on my three separate
web-servers) it takes a little longer. You have to batch your changes
because Updating the index takes time as opposed to deleted which I
batch every two minutes. You won't have a problem updating the index and
searching at the same time because lucene updates the index on a
separate set of files and then when It's done it overwrites the old
version. I've had to provide for Backups, and things like server crashes
mid-indexing, but I was using Oracle Intermedia before and Lucene BLOWS
IT AWAY.

-----Original Message-----
From: news [mailto:news@main.gmane.org] On Behalf Of Chris Miller
Sent: Tuesday, June 24, 2003 12:06 PM
To: lucene-user@jakarta.apache.org
Subject: Re: commercial websites powered by Lucene?


Hi Nader,

I was wondering if you'd mind me asking you a couple of questions about
your implementation?

The main thing I'm interested in is how you handle updates to Lucene's
index. I'd imagine you have a fairly high turnover of CVs and jobs, so
index updates must place a reasonable load on the CPU/disk. Do you keep
CVs and jobs in the same index or two different ones? And what is the
process you use to update the index(es) - do you batch-process updates
or do you handle them in real-time as changes are made?

Any insight you can offer would be much appreciated as I'm about to
implement something similar and am a little unsure of the best approach
to take. We need to be able to handle indexing about 60,000
documents/day, while allowing (many) searches to continue operating
alongside.

Thanks!
Chris

"Nader S. Henein" <ns...@bayt.net> wrote in message
news:001401c32b38$32aa2440$d501a8c0@naderit...
> We use Lucene http://www.bayt.com , we're basically an on-line
> Recruitment site and up until now we've got around 500 000 CVs and
> documents indexed with results that stump Oracle Intermedia.
>
> Nader Henein
> Senior Web Dev
>
> Bayt.com
>
> -----Original Message-----
> From: John_Chun@platts.com [mailto:John_Chun@platts.com]
> Sent: Wednesday, June 04, 2003 6:09 PM
> To: lucene-user@jakarta.apache.org
> Subject: commercial websites powered by Lucene?
>
>
>
> Hello All,
>
> I've been trying to find examples of large commercial websites that
> use Lucene to power their search.  Having such examples would make
> Lucene an easy sell to management
>
> Does anyone know of any good examples?  The bigger the better, and the

> more the better.
>
> TIA,
> -John
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: commercial websites powered by Lucene?

Posted by "Nader S. Henein" <ns...@bayt.net>.

 I handle updates or inserts the same way first I delete the document
from the index and then I insert it (better safe than sorry), I batch my
updates/inserts every twenty minutes, I would do it in smaller intervals
but since I have to sync the XML files created from the DB to three
machines (I maintain three separate Lucene indices on my three separate
web-servers) it takes a little longer. You have to batch your changes
because Updating the index takes time as opposed to deleted which I
batch every two minutes. You won't have a problem updating the index and
searching at the same time because lucene updates the index on a
separate set of files and then when It's done it overwrites the old
version. I've had to provide for Backups, and things like server crashes
mid-indexing, but I was using Oracle Intermedia before and Lucene BLOWS
IT AWAY.

-----Original Message-----
From: news [mailto:news@main.gmane.org] On Behalf Of Chris Miller
Sent: Tuesday, June 24, 2003 12:06 PM
To: lucene-user@jakarta.apache.org
Subject: Re: commercial websites powered by Lucene?


Hi Nader,

I was wondering if you'd mind me asking you a couple of questions about
your implementation?

The main thing I'm interested in is how you handle updates to Lucene's
index. I'd imagine you have a fairly high turnover of CVs and jobs, so
index updates must place a reasonable load on the CPU/disk. Do you keep
CVs and jobs in the same index or two different ones? And what is the
process you use to update the index(es) - do you batch-process updates
or do you handle them in real-time as changes are made?

Any insight you can offer would be much appreciated as I'm about to
implement something similar and am a little unsure of the best approach
to take. We need to be able to handle indexing about 60,000
documents/day, while allowing (many) searches to continue operating
alongside.

Thanks!
Chris

"Nader S. Henein" <ns...@bayt.net> wrote in message
news:001401c32b38$32aa2440$d501a8c0@naderit...
> We use Lucene http://www.bayt.com , we're basically an on-line 
> Recruitment site and up until now we've got around 500 000 CVs and 
> documents indexed with results that stump Oracle Intermedia.
>
> Nader Henein
> Senior Web Dev
>
> Bayt.com
>
> -----Original Message-----
> From: John_Chun@platts.com [mailto:John_Chun@platts.com]
> Sent: Wednesday, June 04, 2003 6:09 PM
> To: lucene-user@jakarta.apache.org
> Subject: commercial websites powered by Lucene?
>
>
>
> Hello All,
>
> I've been trying to find examples of large commercial websites that 
> use Lucene to power their search.  Having such examples would make 
> Lucene an easy sell to management
>
> Does anyone know of any good examples?  The bigger the better, and the

> more the better.
>
> TIA,
> -John
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: commercial websites powered by Lucene?

Posted by Chris Miller <ch...@hotmail.com>.

Hi Nader,

I was wondering if you'd mind me asking you a couple of questions about your
implementation?

The main thing I'm interested in is how you handle updates to Lucene's
index. I'd imagine you have a fairly high turnover of CVs and jobs, so index
updates must place a reasonable load on the CPU/disk. Do you keep CVs and
jobs in the same index or two different ones? And what is the process you
use to update the index(es) - do you batch-process updates or do you handle
them in real-time as changes are made?

Any insight you can offer would be much appreciated as I'm about to
implement something similar and am a little unsure of the best approach to
take. We need to be able to handle indexing about 60,000 documents/day,
while allowing (many) searches to continue operating alongside.

Thanks!
Chris

"Nader S. Henein" <ns...@bayt.net> wrote in message
news:001401c32b38$32aa2440$d501a8c0@naderit...
> We use Lucene http://www.bayt.com , we're basically an on-line
> Recruitment site and up until now we've got around 500 000 CVs and
> documents indexed with results that stump Oracle Intermedia.
>
> Nader Henein
> Senior Web Dev
>
> Bayt.com
>
> -----Original Message-----
> From: John_Chun@platts.com [mailto:John_Chun@platts.com]
> Sent: Wednesday, June 04, 2003 6:09 PM
> To: lucene-user@jakarta.apache.org
> Subject: commercial websites powered by Lucene?
>
>
>
> Hello All,
>
> I've been trying to find examples of large commercial websites that
> use Lucene to power their search.  Having such examples would
> make Lucene an easy sell to management
>
> Does anyone know of any good examples?  The bigger the better, and
> the more the better.
>
> TIA,
> -John
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: commercial websites powered by Lucene?

Posted by "Nader S. Henein" <ns...@bayt.net>.

We use Lucene http://www.bayt.com , we're basically an on-line
Recruitment site and up until now we've got around 500 000 CVs and
documents indexed with results that stump Oracle Intermedia.

Nader Henein
Senior Web Dev

Bayt.com

-----Original Message-----
From: John_Chun@platts.com [mailto:John_Chun@platts.com] 
Sent: Wednesday, June 04, 2003 6:09 PM
To: lucene-user@jakarta.apache.org
Subject: commercial websites powered by Lucene?



Hello All,

I've been trying to find examples of large commercial websites that
use Lucene to power their search.  Having such examples would
make Lucene an easy sell to management

Does anyone know of any good examples?  The bigger the better, and
the more the better.

TIA,
-John



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: java.lang.IllegalArgumentException: attempt to access a deleted document

Posted by Rob Outar <ro...@ideorlando.org>.

I added the following code:

   for (int i = 0; i < numOfDocs; i++) {
                if ( !reader.isDeleted(i)) {
                    doc = reader.document(i);
                    docs[i] =
doc.get(SearchEngineConstants.REPOSITORY_PATH);
                }
            }
            return docs;

but it never goes in the if statement, for every value of i, isDeleted(i) is
returning true?!?  Am I doing something wrong?  I was trying to do what Doug
outlined below.


Thanks,

Rob
-----Original Message-----
From: Doug Cutting [mailto:cutting@lucene.com]
Sent: Wednesday, June 04, 2003 12:34 PM
To: Lucene Users List
Subject: Re: java.lang.IllegalArgumentException: attempt to access a
deleted document


Rob Outar wrote:
>  public synchronized String[] getDocuments() throws IOException {
>
>         IndexReader reader = null;
>         try {
>             reader = IndexReader.open(this.indexLocation);
>             int numOfDocs      = reader.numDocs();
>             String[] docs      = new String[numOfDocs];
>             Document doc       = null;
>
>             for (int i = 0; i < numOfDocs; i++) {
>                 doc = reader.document(i);
>                 docs[i] = doc.get(SearchEngineConstants.REPOSITORY_PATH);
>             }
>             return docs;
>         }
>         finally {
>             if (reader != null) {
>                 reader.close();
>             }
>         }
>     }

The limit of your iteration should be IndexReader.maxDoc(), not
IndexReader.numDocs():

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReade
r.html#maxDoc()

Also, you should first check that each document is not deleted before
calling IndexReader.document(int):

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReade
r.html#isDeleted(int)

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: java.lang.IllegalArgumentException: attempt to access a deleted document

Posted by Doug Cutting <cu...@lucene.com>.

Rob Outar wrote:
>  public synchronized String[] getDocuments() throws IOException {
> 
>         IndexReader reader = null;
>         try {
>             reader = IndexReader.open(this.indexLocation);
>             int numOfDocs      = reader.numDocs();
>             String[] docs      = new String[numOfDocs];
>             Document doc       = null;
> 
>             for (int i = 0; i < numOfDocs; i++) {
>                 doc = reader.document(i);
>                 docs[i] = doc.get(SearchEngineConstants.REPOSITORY_PATH);
>             }
>             return docs;
>         }
>         finally {
>             if (reader != null) {
>                 reader.close();
>             }
>         }
>     }

The limit of your iteration should be IndexReader.maxDoc(), not 
IndexReader.numDocs():

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#maxDoc()

Also, you should first check that each document is not deleted before 
calling IndexReader.document(int):

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#isDeleted(int)

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

java.lang.IllegalArgumentException: attempt to access a deleted document

Posted by Rob Outar <ro...@ideorlando.org>.

Hi all,

	I have written a GUI program that looks for files on a given file system
that are not in the index and vice versa, files that are in the index but no
on the file system.  It is basically finds orphan files.  We do not want
files on the file system that are not in the index and documents in the
index that do not have a corresponding physical file on the file system.

	In the former case I get a list of all the files on the file system and
look in the index for the corresponding Document.  In the latter case I get
a list of all documents and then look for the files on the file system.
During this process I get the above exception.  Has anyone seen this before?
I am not sure how I am trying to access a deleted document, I think the
problem might be in the below code:

 public synchronized String[] getDocuments() throws IOException {

        IndexReader reader = null;
        try {
            reader = IndexReader.open(this.indexLocation);
            int numOfDocs      = reader.numDocs();
            String[] docs      = new String[numOfDocs];
            Document doc       = null;

            for (int i = 0; i < numOfDocs; i++) {
                doc = reader.document(i);
                docs[i] = doc.get(SearchEngineConstants.REPOSITORY_PATH);
            }
            return docs;
        }
        finally {
            if (reader != null) {
                reader.close();
            }
        }
    }

but I am not sure.

Any help would be appreciated.

Thanks,

Rob


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: commercial websites powered by Lucene?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

A few big names are listed in the 1st Lucene article on Onjava.com, if
I recall correctly.

Otis


--- John_Chun@platts.com wrote:
> 
> 
> Hello All,
> 
> I've been trying to find examples of large commercial websites that
> use Lucene to power their search.  Having such examples would
> make Lucene an easy sell to management
> 
> Does anyone know of any good examples?  The bigger the better, and
> the more the better.
> 
> TIA,
> -John
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org