You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Tim Jones <ti...@mongoosetech.com> on 2002/10/29 21:02:38 UTC

Your experiences with Lucene

Hi,
 
I am currently starting work on a project that requires indexing and
searching on potentially thousands, maybe tens of thousands, of text
documents.
 
I'm hoping that someone has a great success story about using Lucene for
a project that required indexing and searching of a large number of
documents.
Like maybe more than 10,000. I guess what I'm trying to figure out is if
Lucene's performance will be acceptable where the number of documents is
very large.
I realize this is a very general question but I just need a general
answer.
 
Thanks,
 
Tim J.

Re: Your experiences with Lucene

Posted by Che Dong <ch...@hotmail.com>.

my experiences:
1 caching if source document doesn't update frequntly.
2 caching first 100 results only. when read above 100 results, lucene search twice and make a 200 results buffer, reach again lucene search again ,make 400 results buffer.

http://search.163.com use lucene as category search and news search. handle 10 querys/sec with 2 pIII(1G linux) 

Che, Dong
----- Original Message ----- 
From: "Tim Jones" <ti...@mongoosetech.com>
To: <lu...@jakarta.apache.org>
Sent: Wednesday, October 30, 2002 4:02 AM
Subject: Your experiences with Lucene

> Hi,
>  
> I am currently starting work on a project that requires indexing and
> searching on potentially thousands, maybe tens of thousands, of text
> documents.
>  
> I'm hoping that someone has a great success story about using Lucene for
> a project that required indexing and searching of a large number of
> documents.
> Like maybe more than 10,000. I guess what I'm trying to figure out is if
> Lucene's performance will be acceptable where the number of documents is
> very large.
> I realize this is a very general question but I just need a general
> answer.
>  
> Thanks,
>  
> Tim J.
>

Re: LARM web crawler: use lucene itself for visited URLs

Posted by Brian Goetz <br...@quiotix.com>.

>Yes, I thought of that, but it always felt like a weird idea to me.  I
>can't really explain why....  Clemens, what do you think about this?  I
>was imagining something like skipping the link parts that are the same
>in the previous link....and now I know where I got that :)

This seems dangerous to me, since Lucene is free to take liberties with 
tokens, such as stemming and filtering out stop words.  So a URL like
  /path/to/foo
might get mapped to
  /path/foo
if you used a stopword analyzer.

A very common trick for compressing paths is this: give each known URL 
prefix a code.  Example:

/foo -> 1 = ("foo")
/foo/bar -> 2 = (1, "bar")
/foo/blah -> 3 = (1, "blah")
/foo/bar/moo -> 4 = (2, "moo")

This trick is used often in caching, to reduce the number of lookups 
required to find an element in a hierarchical cache.



--
Brian Goetz
Quiotix Corporation
brian@quiotix.com           Tel: 650-843-1300            Fax: 650-324-8032

http://www.quiotix.com


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Rebuilding indexes for new version: future proof index format

Posted by Ype Kingma <yk...@xs4all.nl>.

Dear Lucene'd,

As the new PrefixQuery will probably make me a lucene CVS user,
I have a question about the changed index format.

I gathered that the newly introduced document boost factor changes the 
scoring enough to make rebuilding the indexes necessary.

In the documentation of the index format it sais (at the bottom) that there
are a few places in the current format where the index is not future proof 
because a 32 bit integer is used.
Would the new index format be an opportunity to change these 32 bit integers 
to 64 bits?
When so, I'd be happy to try and provide patches.

Kind regards,
Ype Kingma

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: LARM web crawler: use lucene itself for visited URLs

Posted by Clemens Marschner <cm...@lanlab.de>.

yeah but if it is only disk-based it is way too slow.
What is needed is a mixture of both.
We thought about the following:

- links show a high grade of locality. That means 90% of the links go back
to the same host. We have to take advantage of that. We want to hold the
links of the hosts currently crawled in RAM if possible

- The data structure behind that may be what is called a red-black tree:
Some nodes are in RAM, some on disk, some are compressed, some not.
http://citeseer.nj.nec.com/shkapenyuk01design.html gives insights on this.

- Prerequisite for that is that only a distinct number of hosts is crawled
at each time. We want to change the crawler threads such that one thread
loads subsequent URLs only from _one_ host, which also allows for adding the
politeness features.

--Clemens

----- Original Message -----
From: "Ype Kingma" <yk...@xs4all.nl>
To: "Lucene Developers List" <lu...@jakarta.apache.org>; "Clemens
Marschner" <cm...@lanlab.de>
Sent: Thursday, October 31, 2002 9:10 AM
Subject: Re: LARM web crawler: use lucene itself for visited URLs


> On Wednesday 30 October 2002 23:30, Clemens Marschner wrote:
> > There's a good paper on compressing URLs in
> > http://citeseer.nj.nec.com/suel01compressing.html It takes advantage of
the
> > regular structure of the sorted list of URLs and compresses the
resulting
> > structure with some Huffman encoding.
> > I have already implemented a somewhat simpler algorithm that can
compress
> > URLs based on their prefixes. I maybe contribute that a little later.
>
> Compressing is one part, storing the visited URL's on disk (to save RAM)
> is another. Once the hashtable being used now grows over a max size,
> it could be added to a lucene db, after which a new indexreader can be
opened
> and table can be flushed from RAM.
> No analyzer is needed to create the lucene documents, as the URL's are
> already normalized.
> Lookup can be done on directly with an indexreader, in case
> the lookup in RAM fails.
> The nice thing about it is that this lucene scales up quite a bit.
>
> Have fun,
> Ype
>
>
> > ----- Original Message -----
> > From: "Otis Gospodnetic" <ot...@yahoo.com>
> > To: <lu...@jakarta.apache.org>
> > Sent: Wednesday, October 30, 2002 11:00 PM
> > Subject: Re: LARM web crawler: use lucene itself for visited URLs
> >
> > > Redirecting this to lucene-dev, seems more appropriate.
> > >
> > > Clemens is the person to talk to.
> > > Yes, I thought of that, but it always felt like a weird idea to me.  I
> > > can't really explain why....  Clemens, what do you think about this?
I
> > > was imagining something like skipping the link parts that are the same
> > > in the previous link....and now I know where I got that :)
> > >
> > > Otis
> > >
> > > --- Ype Kingma <yk...@xs4all.nl> wrote:
> > > > I managed to loose some recent messages on the LARM crawler and the
> > > > lucene
> > > > file formats, so I don't know whom to address.
> > > >
> > > > Anyway, I noticed this on the LARM crawler info page
> > >
> > >
http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html
> > >
> > > > <<<
> > > > Something worth while would be to compress the URLs. A lot of parts
> > > > of URLs
> > > > are the same between hundreds of URLs (i.e. the host name). And
since
> > > > only a
> > > > limited number of characters are allowed in URLs, Huffman
compression
> > > > will
> > > > lead to a good compression rate.
> > > >
> > > >
> > > > and this on the file formats page
> > > > http://jakarta.apache.org/lucene/docs/fileformats.html
> > > > <<<
> > > > Term text prefixes are shared. The PrefixLength is the number of
> > > > initial
> > > > characters from the previous term which must be pre-pended to a
> > > > term's suffix
> > > > in order to form the term's text. Thus, if the previous term's text
> > > > was
> > > > "bone" and the term is "boy", the PrefixLength is two and the suffix
> > > > is "y".
> > > >
> > > >
> > > > Somehow I get the impression that lucene itself would be quite
> > > > helpful for
> > > > the crawler by using indexed, non stored fields for the normalized
> > > > visited
> > > > URLs.
> > > >
> > > > Have fun,
> > > > Ype
> > > >
> > > > --
> > > > To unsubscribe, e-mail:
> > > > <ma...@jakarta.apache.org>
> > > > For additional commands, e-mail:
> > > > <ma...@jakarta.apache.org>
> > >
> > > __________________________________________________
> > > Do you Yahoo!?
> > > HotJobs - Search new jobs daily now
> > > http://hotjobs.yahoo.com/
> > >
> > > --
> > > To unsubscribe, e-mail:
> >
> > <ma...@jakarta.apache.org>
> >
> > > For additional commands, e-mail:
> >
> > <ma...@jakarta.apache.org>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: LARM web crawler: use lucene itself for visited URLs

Posted by Ype Kingma <yk...@xs4all.nl>.

On Wednesday 30 October 2002 23:30, Clemens Marschner wrote:
> There's a good paper on compressing URLs in
> http://citeseer.nj.nec.com/suel01compressing.html It takes advantage of the
> regular structure of the sorted list of URLs and compresses the resulting
> structure with some Huffman encoding.
> I have already implemented a somewhat simpler algorithm that can compress
> URLs based on their prefixes. I maybe contribute that a little later.

Compressing is one part, storing the visited URL's on disk (to save RAM)
is another. Once the hashtable being used now grows over a max size,
it could be added to a lucene db, after which a new indexreader can be opened 
and table can be flushed from RAM.
No analyzer is needed to create the lucene documents, as the URL's are 
already normalized.
Lookup can be done on directly with an indexreader, in case
the lookup in RAM fails.
The nice thing about it is that this lucene scales up quite a bit.

Have fun,
Ype


> ----- Original Message -----
> From: "Otis Gospodnetic" <ot...@yahoo.com>
> To: <lu...@jakarta.apache.org>
> Sent: Wednesday, October 30, 2002 11:00 PM
> Subject: Re: LARM web crawler: use lucene itself for visited URLs
>
> > Redirecting this to lucene-dev, seems more appropriate.
> >
> > Clemens is the person to talk to.
> > Yes, I thought of that, but it always felt like a weird idea to me.  I
> > can't really explain why....  Clemens, what do you think about this?  I
> > was imagining something like skipping the link parts that are the same
> > in the previous link....and now I know where I got that :)
> >
> > Otis
> >
> > --- Ype Kingma <yk...@xs4all.nl> wrote:
> > > I managed to loose some recent messages on the LARM crawler and the
> > > lucene
> > > file formats, so I don't know whom to address.
> > >
> > > Anyway, I noticed this on the LARM crawler info page
> >
> > http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html
> >
> > > <<<
> > > Something worth while would be to compress the URLs. A lot of parts
> > > of URLs
> > > are the same between hundreds of URLs (i.e. the host name). And since
> > > only a
> > > limited number of characters are allowed in URLs, Huffman compression
> > > will
> > > lead to a good compression rate.
> > >
> > >
> > > and this on the file formats page
> > > http://jakarta.apache.org/lucene/docs/fileformats.html
> > > <<<
> > > Term text prefixes are shared. The PrefixLength is the number of
> > > initial
> > > characters from the previous term which must be pre-pended to a
> > > term's suffix
> > > in order to form the term's text. Thus, if the previous term's text
> > > was
> > > "bone" and the term is "boy", the PrefixLength is two and the suffix
> > > is "y".
> > >
> > >
> > > Somehow I get the impression that lucene itself would be quite
> > > helpful for
> > > the crawler by using indexed, non stored fields for the normalized
> > > visited
> > > URLs.
> > >
> > > Have fun,
> > > Ype
> > >
> > > --
> > > To unsubscribe, e-mail:
> > > <ma...@jakarta.apache.org>
> > > For additional commands, e-mail:
> > > <ma...@jakarta.apache.org>
> >
> > __________________________________________________
> > Do you Yahoo!?
> > HotJobs - Search new jobs daily now
> > http://hotjobs.yahoo.com/
> >
> > --
> > To unsubscribe, e-mail:
>
> <ma...@jakarta.apache.org>
>
> > For additional commands, e-mail:
>
> <ma...@jakarta.apache.org>

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: LARM web crawler: use lucene itself for visited URLs

Posted by Clemens Marschner <cm...@lanlab.de>.

There's a good paper on compressing URLs in
http://citeseer.nj.nec.com/suel01compressing.html It takes advantage of the
regular structure of the sorted list of URLs and compresses the resulting
structure with some Huffman encoding.
I have already implemented a somewhat simpler algorithm that can compress
URLs based on their prefixes. I maybe contribute that a little later.

----- Original Message -----
From: "Otis Gospodnetic" <ot...@yahoo.com>
To: <lu...@jakarta.apache.org>
Sent: Wednesday, October 30, 2002 11:00 PM
Subject: Re: LARM web crawler: use lucene itself for visited URLs


> Redirecting this to lucene-dev, seems more appropriate.
>
> Clemens is the person to talk to.
> Yes, I thought of that, but it always felt like a weird idea to me.  I
> can't really explain why....  Clemens, what do you think about this?  I
> was imagining something like skipping the link parts that are the same
> in the previous link....and now I know where I got that :)
>
> Otis
>
>
>
> --- Ype Kingma <yk...@xs4all.nl> wrote:
> >
> > I managed to loose some recent messages on the LARM crawler and the
> > lucene
> > file formats, so I don't know whom to address.
> >
> > Anyway, I noticed this on the LARM crawler info page
> >
> http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html
> > <<<
> > Something worth while would be to compress the URLs. A lot of parts
> > of URLs
> > are the same between hundreds of URLs (i.e. the host name). And since
> > only a
> > limited number of characters are allowed in URLs, Huffman compression
> > will
> > lead to a good compression rate.
> > >>>
> >
> > and this on the file formats page
> > http://jakarta.apache.org/lucene/docs/fileformats.html
> > <<<
> > Term text prefixes are shared. The PrefixLength is the number of
> > initial
> > characters from the previous term which must be pre-pended to a
> > term's suffix
> > in order to form the term's text. Thus, if the previous term's text
> > was
> > "bone" and the term is "boy", the PrefixLength is two and the suffix
> > is "y".
> > >>>
> >
> > Somehow I get the impression that lucene itself would be quite
> > helpful for
> > the crawler by using indexed, non stored fields for the normalized
> > visited
> > URLs.
> >
> > Have fun,
> > Ype
> >
> > --
> > To unsubscribe, e-mail:
> > <ma...@jakarta.apache.org>
> > For additional commands, e-mail:
> > <ma...@jakarta.apache.org>
> >
>
>
> __________________________________________________
> Do you Yahoo!?
> HotJobs - Search new jobs daily now
> http://hotjobs.yahoo.com/
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: LARM web crawler: use lucene itself for visited URLs

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Redirecting this to lucene-dev, seems more appropriate.

Clemens is the person to talk to.
Yes, I thought of that, but it always felt like a weird idea to me.  I
can't really explain why....  Clemens, what do you think about this?  I
was imagining something like skipping the link parts that are the same
in the previous link....and now I know where I got that :)

Otis



--- Ype Kingma <yk...@xs4all.nl> wrote:
> 
> I managed to loose some recent messages on the LARM crawler and the
> lucene
> file formats, so I don't know whom to address.
> 
> Anyway, I noticed this on the LARM crawler info page
>
http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html
> <<<
> Something worth while would be to compress the URLs. A lot of parts
> of URLs 
> are the same between hundreds of URLs (i.e. the host name). And since
> only a 
> limited number of characters are allowed in URLs, Huffman compression
> will 
> lead to a good compression rate. 
> >>>
> 
> and this on the file formats page
> http://jakarta.apache.org/lucene/docs/fileformats.html
> <<<
> Term text prefixes are shared. The PrefixLength is the number of
> initial 
> characters from the previous term which must be pre-pended to a
> term's suffix 
> in order to form the term's text. Thus, if the previous term's text
> was 
> "bone" and the term is "boy", the PrefixLength is two and the suffix
> is "y". 
> >>>
> 
> Somehow I get the impression that lucene itself would be quite
> helpful for 
> the crawler by using indexed, non stored fields for the normalized
> visited 
> URLs.
> 
> Have fun,
> Ype
> 
> --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> 


__________________________________________________
Do you Yahoo!?
HotJobs - Search new jobs daily now
http://hotjobs.yahoo.com/

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

LARM web crawler: use lucene itself for visited URLs

Posted by Ype Kingma <yk...@xs4all.nl>.

I managed to loose some recent messages on the LARM crawler and the lucene
file formats, so I don't know whom to address.

Anyway, I noticed this on the LARM crawler info page
http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html
<<<
Something worth while would be to compress the URLs. A lot of parts of URLs 
are the same between hundreds of URLs (i.e. the host name). And since only a 
limited number of characters are allowed in URLs, Huffman compression will 
lead to a good compression rate. 
>>>

and this on the file formats page
http://jakarta.apache.org/lucene/docs/fileformats.html
<<<
Term text prefixes are shared. The PrefixLength is the number of initial 
characters from the previous term which must be pre-pended to a term's suffix 
in order to form the term's text. Thus, if the previous term's text was 
"bone" and the term is "boy", the PrefixLength is two and the suffix is "y". 
>>>

Somehow I get the impression that lucene itself would be quite helpful for 
the crawler by using indexed, non stored fields for the normalized visited 
URLs.

Have fun,
Ype

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Your experiences with Lucene

Posted by Scott Ganyo <sc...@etapestry.com>.

Actually, 10k isn't very large.  We have indexes with more than 1M 
records.  It hasn't been a problem.

Scott

Tim Jones wrote:

> Hi,
>
> I am currently starting work on a project that requires indexing and
> searching on potentially thousands, maybe tens of thousands, of text
> documents.
>
> I'm hoping that someone has a great success story about using Lucene for
> a project that required indexing and searching of a large number of
> documents.
> Like maybe more than 10,000. I guess what I'm trying to figure out is if
> Lucene's performance will be acceptable where the number of documents is
> very large.
> I realize this is a very general question but I just need a general
> answer.
>
> Thanks,
>
> Tim J.
>

-- 
Brain: Pinky, are you pondering what I’m pondering?
Pinky: I think so, Brain, but calling it a pu-pu platter? Huh, what were 
they thinking?


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Your experiences with Lucene

Posted by Chris Sibert <ch...@attbi.com>.

I thought that you couldn't do date indexing / searching with Lucene. How do
you do it ?

----- Original Message -----
From: "Jonathan Pace" <jm...@fedex.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Tuesday, October 29, 2002 3:08 PM
Subject: RE: Your experiences with Lucene


> Our implementation contains 1.4 million documents for a 1 GB index.  We
use
> date sorting and term highlighting with a searcher "pool" created from the
> Jakarta Commons project.  Performance is extremely fast.
>
> Jonathan Pace
> FedEx Services
>
>
> -----Original Message-----
> From: Tim Jones [mailto:timothy.jones@mongoosetech.com]
> Sent: Tuesday, October 29, 2002 2:03 PM
> To: lucene-user@jakarta.apache.org
> Subject: Your experiences with Lucene
>
>
> Hi,
>
> I am currently starting work on a project that requires indexing and
> searching on potentially thousands, maybe tens of thousands, of text
> documents.
>
> I'm hoping that someone has a great success story about using Lucene for
> a project that required indexing and searching of a large number of
> documents.
> Like maybe more than 10,000. I guess what I'm trying to figure out is if
> Lucene's performance will be acceptable where the number of documents is
> very large.
> I realize this is a very general question but I just need a general
> answer.
>
> Thanks,
>
> Tim J.
>
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

RE: Your experiences with Lucene

Posted by Jonathan Pace <jm...@fedex.com>.

Our implementation contains 1.4 million documents for a 1 GB index.  We use
date sorting and term highlighting with a searcher "pool" created from the
Jakarta Commons project.  Performance is extremely fast.

Jonathan Pace
FedEx Services


-----Original Message-----
From: Tim Jones [mailto:timothy.jones@mongoosetech.com]
Sent: Tuesday, October 29, 2002 2:03 PM
To: lucene-user@jakarta.apache.org
Subject: Your experiences with Lucene


Hi,

I am currently starting work on a project that requires indexing and
searching on potentially thousands, maybe tens of thousands, of text
documents.

I'm hoping that someone has a great success story about using Lucene for
a project that required indexing and searching of a large number of
documents.
Like maybe more than 10,000. I guess what I'm trying to figure out is if
Lucene's performance will be acceptable where the number of documents is
very large.
I realize this is a very general question but I just need a general
answer.

Thanks,

Tim J.


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>