You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by howard chen <ho...@gmail.com> on 2006/12/16 11:29:40 UTC

[Interesting Question] How to implement Indexes Grouping?

Consider the following interesting situation,

A library has around 100K book, and want to be indexed by Lucene, this
seems to be straight forward, but....

The target is:

0. You can search all books in the whole library [easy, just index it]

1. users in this system can own a numbers of books in their personal
bookshelf, the users might only want to search book in their bookshelf
ONLY.

2. if each users own a copy of the index of their personal bookshelf,
this seems to be waste of storage space as books are shared by many
users.

3. If no matter users own what book, the whole indexes is to be
searched, this seems to be waste of computation power if he just own a
few books only.


In this situation, how would you design a indexing + search system?

Any idea can share?

:)

Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: [Interesting Question] How to implement Indexes Grouping?

Posted by howard chen <ho...@gmail.com>.

On 12/16/06, Erick Erickson <er...@gmail.com> wrote:
> You can't tell until you get some numbers. So try it. I'm indexing 4,600
> books in about 45 minutes on a laptop as part of my current project. So it
> shouldn't be much of a problem to index, say, 10,000 books as a starter set.
> This will give you some idea of the size of your index(es), and some idea of
> the performance. You're almost required to do this since nobody can answer
> performance questions in the abstract. It depends.... how much are you
> indexing? What is your index structure? etc. etc. etc.
>
> Be aware that Lucene indexes the first 10,000 tokens by default. You can
> make this as large as you want, but you have to do this consciously.
>
> It should take you less than a day to create a test harness that fires off N
> threads at your searcher to measure load. I can't emphasize enough how
> valuable this will be as you design your system.
>
> Changing from a single index to a distributed one isn't difficult, see
> Multisearcher. Partitioning the index is something you'll have to do anyway
> and I'd build it on multiple instances of a simple indexer, so starting with
> the simple, single index case doesn't waste any time.
>
> You need to answer some questions for yourself... stored or unstored text?
> How many other fields do you want to store and/or index? What is acceptable
> performance?
>
> My point is that you can get quite a ways with a very simple design, without
> doing much in the way of throw-away work. And the answers you get from the
> simple case will give you actual data to make further decisions. Otherwise,
> you risk making a complex solution that you don't need. Do you have any
> basis at all for estimating that 20 subgroups is sufficient and necessary?
>
> Your goal here is to get the answer for your final design as quickly as
> possible. At the same time, you want to waste as little time writing code
> that you'll discard later. So try the simple case on a test data set. This
> will get your index design into a firmer state and you can load-test it with
> your presumed load and get actual data for your system. Until you do this,
> any answer you have is just a guess.
>
> Best
> Erick
>
> On 12/16/06, howard chen <ho...@gmail.com> wrote:
> >
> > On 12/16/06, Erick Erickson <er...@gmail.com> wrote:
> > > I'd start with just one big index and test <G>. My point is that you
> > can't
> > > speculate. The first question you have to answer is "is searching the
> > whole
> > > index fast enough given my architecture?" and we can't answer that. Nor
> > can
> > > you until you try.......
> > >
> > > We especially can't speculate since you've provided no clue how many
> > users
> > > you're talking about. 10? 1,000,000? How many books do you expect them
> > to
> > > own? 10? 100,000? I can't imagine separate indexes for 1M users each
> > owning
> > > all 1000 books. I can imagine it for 10 users owning 100 books.....
> > >
> > > Assuming that you get decent performance in a single index, I'd create a
> > > filter at query time for a user. The filter has the bits turned on for
> > the
> > > books the user owns and include the filter as part of a BooleanQuery
> > when I
> > > searched the text. The filters could even be permanently stored rather
> > than
> > > created each time, but I'd save that refinement for later.....
> > >
> > > Note that if you do store a filter, they are quite small. 1 bit per book
> > (+
> > > very small overhead)....
> > >
> > > Best
> > > Erick
> > >
> > > On 12/16/06, howard chen <ho...@gmail.com> wrote:
> > > >
> > > > Consider the following interesting situation,
> > > >
> > > > A library has around 100K book, and want to be indexed by Lucene, this
> > > > seems to be straight forward, but....
> > > >
> > > > The target is:
> > > >
> > > > 0. You can search all books in the whole library [easy, just index it]
> > > >
> > > > 1. users in this system can own a numbers of books in their personal
> > > > bookshelf, the users might only want to search book in their bookshelf
> > > > ONLY.
> > > >
> > > > 2. if each users own a copy of the index of their personal bookshelf,
> > > > this seems to be waste of storage space as books are shared by many
> > > > users.
> > > >
> > > > 3. If no matter users own what book, the whole indexes is to be
> > > > searched, this seems to be waste of computation power if he just own a
> > > > few books only.
> > > >
> > > >
> > > > In this situation, how would you design a indexing + search system?
> > > >
> > > > Any idea can share?
> > > >
> > > > :)
> > > >
> > > > Thanks.
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> > >
> >
> > I agree that filter is a way of implement it. My concern is that with
> > such big index, say 100K book full text indexed, this will become the
> > bottom neck and it is difficult to distribute the indexing and
> > searching.
> >
> > My initial thinking is to group the index by Call. No, say to divide
> > 100K books into 20 subgroups, and when user search it, it will create
> > 20 threads to search for the book in different servers.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>

Thanks for your help, really useful!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: [Interesting Question] How to implement Indexes Grouping?

Posted by Erick Erickson <er...@gmail.com>.

You can't tell until you get some numbers. So try it. I'm indexing 4,600
books in about 45 minutes on a laptop as part of my current project. So it
shouldn't be much of a problem to index, say, 10,000 books as a starter set.
This will give you some idea of the size of your index(es), and some idea of
the performance. You're almost required to do this since nobody can answer
performance questions in the abstract. It depends.... how much are you
indexing? What is your index structure? etc. etc. etc.

Be aware that Lucene indexes the first 10,000 tokens by default. You can
make this as large as you want, but you have to do this consciously.

It should take you less than a day to create a test harness that fires off N
threads at your searcher to measure load. I can't emphasize enough how
valuable this will be as you design your system.

Changing from a single index to a distributed one isn't difficult, see
Multisearcher. Partitioning the index is something you'll have to do anyway
and I'd build it on multiple instances of a simple indexer, so starting with
the simple, single index case doesn't waste any time.

You need to answer some questions for yourself... stored or unstored text?
How many other fields do you want to store and/or index? What is acceptable
performance?

My point is that you can get quite a ways with a very simple design, without
doing much in the way of throw-away work. And the answers you get from the
simple case will give you actual data to make further decisions. Otherwise,
you risk making a complex solution that you don't need. Do you have any
basis at all for estimating that 20 subgroups is sufficient and necessary?

Your goal here is to get the answer for your final design as quickly as
possible. At the same time, you want to waste as little time writing code
that you'll discard later. So try the simple case on a test data set. This
will get your index design into a firmer state and you can load-test it with
your presumed load and get actual data for your system. Until you do this,
any answer you have is just a guess.

Best
Erick

On 12/16/06, howard chen <ho...@gmail.com> wrote:
>
> On 12/16/06, Erick Erickson <er...@gmail.com> wrote:
> > I'd start with just one big index and test <G>. My point is that you
> can't
> > speculate. The first question you have to answer is "is searching the
> whole
> > index fast enough given my architecture?" and we can't answer that. Nor
> can
> > you until you try.......
> >
> > We especially can't speculate since you've provided no clue how many
> users
> > you're talking about. 10? 1,000,000? How many books do you expect them
> to
> > own? 10? 100,000? I can't imagine separate indexes for 1M users each
> owning
> > all 1000 books. I can imagine it for 10 users owning 100 books.....
> >
> > Assuming that you get decent performance in a single index, I'd create a
> > filter at query time for a user. The filter has the bits turned on for
> the
> > books the user owns and include the filter as part of a BooleanQuery
> when I
> > searched the text. The filters could even be permanently stored rather
> than
> > created each time, but I'd save that refinement for later.....
> >
> > Note that if you do store a filter, they are quite small. 1 bit per book
> (+
> > very small overhead)....
> >
> > Best
> > Erick
> >
> > On 12/16/06, howard chen <ho...@gmail.com> wrote:
> > >
> > > Consider the following interesting situation,
> > >
> > > A library has around 100K book, and want to be indexed by Lucene, this
> > > seems to be straight forward, but....
> > >
> > > The target is:
> > >
> > > 0. You can search all books in the whole library [easy, just index it]
> > >
> > > 1. users in this system can own a numbers of books in their personal
> > > bookshelf, the users might only want to search book in their bookshelf
> > > ONLY.
> > >
> > > 2. if each users own a copy of the index of their personal bookshelf,
> > > this seems to be waste of storage space as books are shared by many
> > > users.
> > >
> > > 3. If no matter users own what book, the whole indexes is to be
> > > searched, this seems to be waste of computation power if he just own a
> > > few books only.
> > >
> > >
> > > In this situation, how would you design a indexing + search system?
> > >
> > > Any idea can share?
> > >
> > > :)
> > >
> > > Thanks.
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
>
> I agree that filter is a way of implement it. My concern is that with
> such big index, say 100K book full text indexed, this will become the
> bottom neck and it is difficult to distribute the indexing and
> searching.
>
> My initial thinking is to group the index by Call. No, say to divide
> 100K books into 20 subgroups, and when user search it, it will create
> 20 threads to search for the book in different servers.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: [Interesting Question] How to implement Indexes Grouping?

Posted by howard chen <ho...@gmail.com>.

On 12/16/06, Erick Erickson <er...@gmail.com> wrote:
> I'd start with just one big index and test <G>. My point is that you can't
> speculate. The first question you have to answer is "is searching the whole
> index fast enough given my architecture?" and we can't answer that. Nor can
> you until you try.......
>
> We especially can't speculate since you've provided no clue how many users
> you're talking about. 10? 1,000,000? How many books do you expect them to
> own? 10? 100,000? I can't imagine separate indexes for 1M users each owning
> all 1000 books. I can imagine it for 10 users owning 100 books.....
>
> Assuming that you get decent performance in a single index, I'd create a
> filter at query time for a user. The filter has the bits turned on for the
> books the user owns and include the filter as part of a BooleanQuery when I
> searched the text. The filters could even be permanently stored rather than
> created each time, but I'd save that refinement for later.....
>
> Note that if you do store a filter, they are quite small. 1 bit per book (+
> very small overhead)....
>
> Best
> Erick
>
> On 12/16/06, howard chen <ho...@gmail.com> wrote:
> >
> > Consider the following interesting situation,
> >
> > A library has around 100K book, and want to be indexed by Lucene, this
> > seems to be straight forward, but....
> >
> > The target is:
> >
> > 0. You can search all books in the whole library [easy, just index it]
> >
> > 1. users in this system can own a numbers of books in their personal
> > bookshelf, the users might only want to search book in their bookshelf
> > ONLY.
> >
> > 2. if each users own a copy of the index of their personal bookshelf,
> > this seems to be waste of storage space as books are shared by many
> > users.
> >
> > 3. If no matter users own what book, the whole indexes is to be
> > searched, this seems to be waste of computation power if he just own a
> > few books only.
> >
> >
> > In this situation, how would you design a indexing + search system?
> >
> > Any idea can share?
> >
> > :)
> >
> > Thanks.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>

I agree that filter is a way of implement it. My concern is that with
such big index, say 100K book full text indexed, this will become the
bottom neck and it is difficult to distribute the indexing and
searching.

My initial thinking is to group the index by Call. No, say to divide
100K books into 20 subgroups, and when user search it, it will create
20 threads to search for the book in different servers.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: [Interesting Question] How to implement Indexes Grouping?

Posted by Erick Erickson <er...@gmail.com>.

I'd start with just one big index and test <G>. My point is that you can't
speculate. The first question you have to answer is "is searching the whole
index fast enough given my architecture?" and we can't answer that. Nor can
you until you try.......

We especially can't speculate since you've provided no clue how many users
you're talking about. 10? 1,000,000? How many books do you expect them to
own? 10? 100,000? I can't imagine separate indexes for 1M users each owning
all 1000 books. I can imagine it for 10 users owning 100 books.....

Assuming that you get decent performance in a single index, I'd create a
filter at query time for a user. The filter has the bits turned on for the
books the user owns and include the filter as part of a BooleanQuery when I
searched the text. The filters could even be permanently stored rather than
created each time, but I'd save that refinement for later.....

Note that if you do store a filter, they are quite small. 1 bit per book (+
very small overhead)....

Best
Erick

On 12/16/06, howard chen <ho...@gmail.com> wrote:
>
> Consider the following interesting situation,
>
> A library has around 100K book, and want to be indexed by Lucene, this
> seems to be straight forward, but....
>
> The target is:
>
> 0. You can search all books in the whole library [easy, just index it]
>
> 1. users in this system can own a numbers of books in their personal
> bookshelf, the users might only want to search book in their bookshelf
> ONLY.
>
> 2. if each users own a copy of the index of their personal bookshelf,
> this seems to be waste of storage space as books are shared by many
> users.
>
> 3. If no matter users own what book, the whole indexes is to be
> searched, this seems to be waste of computation power if he just own a
> few books only.
>
>
> In this situation, how would you design a indexing + search system?
>
> Any idea can share?
>
> :)
>
> Thanks.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: [Interesting Question] How to implement Indexes Grouping?

Posted by Erick Erickson <er...@gmail.com>.

This one I can pretty much guarantee will take too long. I tried something
similar with a single book and found it unacceptably slow. Although I didn't
think too much about the possibility of extracting it from an already
existing index, but I'd be surprised if it would work quickly enough....

Although if there is a way to do something like this I'd be really, really
interested.....

Erick

On 12/16/06, Phil Rosen <pr...@optaros.com> wrote:
>
> I wonder if storing the ids of a users bookshelf only, and then at runtime
> when they perform a search create a temporary ramdirectory index of only
> the
> books in the user's bookshelf could satisfy both points 2 & 3.
>
> Speaking of Ramdirectory, is there a simple way to convert hits to an
> index
> without iterating?
>
> -----Original Message-----
> From: howard chen [mailto:howachen@gmail.com]
> Sent: Saturday, December 16, 2006 5:30 AM
> To: java-user@lucene.apache.org
> Subject: [Interesting Question] How to implement Indexes Grouping?
>
> Consider the following interesting situation,
>
> A library has around 100K book, and want to be indexed by Lucene, this
> seems to be straight forward, but....
>
> The target is:
>
> 0. You can search all books in the whole library [easy, just index it]
>
> 1. users in this system can own a numbers of books in their personal
> bookshelf, the users might only want to search book in their bookshelf
> ONLY.
>
> 2. if each users own a copy of the index of their personal bookshelf,
> this seems to be waste of storage space as books are shared by many
> users.
>
> 3. If no matter users own what book, the whole indexes is to be
> searched, this seems to be waste of computation power if he just own a
> few books only.
>
>
> In this situation, how would you design a indexing + search system?
>
> Any idea can share?
>
> :)
>
> Thanks.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: [Interesting Question] How to implement Indexes Grouping?

Posted by Phil Rosen <pr...@optaros.com>.

I wonder if storing the ids of a users bookshelf only, and then at runtime 
when they perform a search create a temporary ramdirectory index of only the 
books in the user's bookshelf could satisfy both points 2 & 3.

Speaking of Ramdirectory, is there a simple way to convert hits to an index 
without iterating?

-----Original Message-----
From: howard chen [mailto:howachen@gmail.com]
Sent: Saturday, December 16, 2006 5:30 AM
To: java-user@lucene.apache.org
Subject: [Interesting Question] How to implement Indexes Grouping?

Consider the following interesting situation,

A library has around 100K book, and want to be indexed by Lucene, this
seems to be straight forward, but....

The target is:

0. You can search all books in the whole library [easy, just index it]

1. users in this system can own a numbers of books in their personal
bookshelf, the users might only want to search book in their bookshelf
ONLY.

2. if each users own a copy of the index of their personal bookshelf,
this seems to be waste of storage space as books are shared by many
users.

3. If no matter users own what book, the whole indexes is to be
searched, this seems to be waste of computation power if he just own a
few books only.


In this situation, how would you design a indexing + search system?

Any idea can share?

:)

Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org