You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Geebee Coder <g....@gmail.com> on 2016/06/15 13:25:24 UTC

Using Lucene to model ownership of documents

Hi there,
I would like to use Lucene to solve the following problem:

1.We have about 100k customers and we have 25 millions of documents.

2.When a customer performs a text search on the document space, we want to
return only documents that the customer has access to.

3.The # of documents a customer owns varies a lot. some have close to 23
million, some have close to 10k and some own a third of the documents etc.

What is an efficient way to use Lucene in this scenario in terms of
performance and indexing?
We have tried a number of solutions such as

 a)100k boolean fields per document that indicates whether a customer has
access to the document.
 b)A single text field that has a list of customers who owns the document
e.g. (customers field : "abc abd cfx...")
c) the above option with shards by customers

The search&index performance for a was bad. b,c performed better for search
but lengthened the time needed for indexing & index size.
We are also thinking about using a custom filter but we are concerned about
the memory requirements.

Any ideas/suggestions would be really appreciated.

Re: Using Lucene to model ownership of documents

Posted by Geebee Coder <g....@gmail.com>.

Thanks Denis. My mistake. For a and b, indexing speed, size and search
performance was similar.

I agree on the simplicity comment.
For anyone who might come across this, here's our best solution so far.
(for Elastic search)


for every customer, use Elastic Search's nested fields
e.g. ownership of a document by customers aab and aac is represented as
a: ab, ac
This solution compacts the index size but increases the indexing time
somewhat. similar search performance as having one "ownership" field with
all the customers concatenated.




On Thu, Jun 16, 2016 at 9:27 PM, Denis Bazhenov <do...@gmail.com> wrote:

> The speed for a and b, should be the same, at least from conceptual point
> of view. The number of terms generated for each scenario is equal.
> Therefore, index size and vocabulary size should be the same.
>
> I’m wondering why there is difference. It seems like there is some penalty
> for writing/reading terms for different fields, but I can’t elaborate on
> that. Could you provide index size for scenarios a and b?
>
> Scenario c could be the fastest in terms of search and indexing speed, but
> it’s far more complex and make sense only if you have a need for scaling
> your system. Which imply you can’t solve problem on the single box.
>
> So, if there is no need for scaling, I’d go with b because of simplicity.
>
> > On Jun 15, 2016, at 23:25, Geebee Coder <g....@gmail.com> wrote:
> >
> > Hi there,
> > I would like to use Lucene to solve the following problem:
> >
> > 1.We have about 100k customers and we have 25 millions of documents.
> >
> > 2.When a customer performs a text search on the document space, we want
> to
> > return only documents that the customer has access to.
> >
> > 3.The # of documents a customer owns varies a lot. some have close to 23
> > million, some have close to 10k and some own a third of the documents
> etc.
> >
> > What is an efficient way to use Lucene in this scenario in terms of
> > performance and indexing?
> > We have tried a number of solutions such as
> >
> > a)100k boolean fields per document that indicates whether a customer has
> > access to the document.
> > b)A single text field that has a list of customers who owns the document
> > e.g. (customers field : "abc abd cfx...")
> > c) the above option with shards by customers
> >
> > The search&index performance for a was bad. b,c performed better for
> search
> > but lengthened the time needed for indexing & index size.
> > We are also thinking about using a custom filter but we are concerned
> about
> > the memory requirements.
> >
> > Any ideas/suggestions would be really appreciated.
>
> ---
> Denis Bazhenov <do...@gmail.com>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Using Lucene to model ownership of documents

Posted by Denis Bazhenov <do...@gmail.com>.

The speed for a and b, should be the same, at least from conceptual point of view. The number of terms generated for each scenario is equal. Therefore, index size and vocabulary size should be the same.

I’m wondering why there is difference. It seems like there is some penalty for writing/reading terms for different fields, but I can’t elaborate on that. Could you provide index size for scenarios a and b?

Scenario c could be the fastest in terms of search and indexing speed, but it’s far more complex and make sense only if you have a need for scaling your system. Which imply you can’t solve problem on the single box.

So, if there is no need for scaling, I’d go with b because of simplicity.

> On Jun 15, 2016, at 23:25, Geebee Coder <g....@gmail.com> wrote:
> 
> Hi there,
> I would like to use Lucene to solve the following problem:
> 
> 1.We have about 100k customers and we have 25 millions of documents.
> 
> 2.When a customer performs a text search on the document space, we want to
> return only documents that the customer has access to.
> 
> 3.The # of documents a customer owns varies a lot. some have close to 23
> million, some have close to 10k and some own a third of the documents etc.
> 
> What is an efficient way to use Lucene in this scenario in terms of
> performance and indexing?
> We have tried a number of solutions such as
> 
> a)100k boolean fields per document that indicates whether a customer has
> access to the document.
> b)A single text field that has a list of customers who owns the document
> e.g. (customers field : "abc abd cfx...")
> c) the above option with shards by customers
> 
> The search&index performance for a was bad. b,c performed better for search
> but lengthened the time needed for indexing & index size.
> We are also thinking about using a custom filter but we are concerned about
> the memory requirements.
> 
> Any ideas/suggestions would be really appreciated.

---
Denis Bazhenov <do...@gmail.com>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Using Lucene to model ownership of documents

Posted by Geebee Coder <g....@gmail.com>.

Thank you all.
Michael, do you mean grouping customers by categories? (e.g. customer A has
premium access and so does customer B so they will have access to same set
of documents)
if that's the case, unfortunately, we don't have such categories of
customers, their access rights are over specific documents and not tiers.


On Thu, Jun 16, 2016 at 9:37 AM, Michael Wilkowski <mw...@silenteight.com>
wrote:

> Definitely b). I would also suggest groups and expanding user groups at
> user sign in time.
>
> MW
>
> On Thu, Jun 16, 2016 at 12:36 PM, Ian Lea <ia...@gmail.com> wrote:
>
> > I'd definitely go for b).  The index will of course be larger for every
> > extra bit of data you store but it doesn't sound like this would make
> much
> > difference.  Likewise for speed of indexing.
> >
> >
> > --
> > Ian.
> >
> >
> > On Wed, Jun 15, 2016 at 2:25 PM, Geebee Coder <g....@gmail.com>
> wrote:
> >
> > > Hi there,
> > > I would like to use Lucene to solve the following problem:
> > >
> > > 1.We have about 100k customers and we have 25 millions of documents.
> > >
> > > 2.When a customer performs a text search on the document space, we want
> > to
> > > return only documents that the customer has access to.
> > >
> > > 3.The # of documents a customer owns varies a lot. some have close to
> 23
> > > million, some have close to 10k and some own a third of the documents
> > etc.
> > >
> > > What is an efficient way to use Lucene in this scenario in terms of
> > > performance and indexing?
> > > We have tried a number of solutions such as
> > >
> > >  a)100k boolean fields per document that indicates whether a customer
> has
> > > access to the document.
> > >  b)A single text field that has a list of customers who owns the
> document
> > > e.g. (customers field : "abc abd cfx...")
> > > c) the above option with shards by customers
> > >
> > > The search&index performance for a was bad. b,c performed better for
> > search
> > > but lengthened the time needed for indexing & index size.
> > > We are also thinking about using a custom filter but we are concerned
> > about
> > > the memory requirements.
> > >
> > > Any ideas/suggestions would be really appreciated.
> > >
> >
>

Re: Using Lucene to model ownership of documents

Posted by Michael Wilkowski <mw...@silenteight.com>.

Definitely b). I would also suggest groups and expanding user groups at
user sign in time.

MW

On Thu, Jun 16, 2016 at 12:36 PM, Ian Lea <ia...@gmail.com> wrote:

> I'd definitely go for b).  The index will of course be larger for every
> extra bit of data you store but it doesn't sound like this would make much
> difference.  Likewise for speed of indexing.
>
>
> --
> Ian.
>
>
> On Wed, Jun 15, 2016 at 2:25 PM, Geebee Coder <g....@gmail.com> wrote:
>
> > Hi there,
> > I would like to use Lucene to solve the following problem:
> >
> > 1.We have about 100k customers and we have 25 millions of documents.
> >
> > 2.When a customer performs a text search on the document space, we want
> to
> > return only documents that the customer has access to.
> >
> > 3.The # of documents a customer owns varies a lot. some have close to 23
> > million, some have close to 10k and some own a third of the documents
> etc.
> >
> > What is an efficient way to use Lucene in this scenario in terms of
> > performance and indexing?
> > We have tried a number of solutions such as
> >
> >  a)100k boolean fields per document that indicates whether a customer has
> > access to the document.
> >  b)A single text field that has a list of customers who owns the document
> > e.g. (customers field : "abc abd cfx...")
> > c) the above option with shards by customers
> >
> > The search&index performance for a was bad. b,c performed better for
> search
> > but lengthened the time needed for indexing & index size.
> > We are also thinking about using a custom filter but we are concerned
> about
> > the memory requirements.
> >
> > Any ideas/suggestions would be really appreciated.
> >
>

Re: Using Lucene to model ownership of documents

Posted by Ian Lea <ia...@gmail.com>.

I'd definitely go for b).  The index will of course be larger for every
extra bit of data you store but it doesn't sound like this would make much
difference.  Likewise for speed of indexing.


--
Ian.


On Wed, Jun 15, 2016 at 2:25 PM, Geebee Coder <g....@gmail.com> wrote:

> Hi there,
> I would like to use Lucene to solve the following problem:
>
> 1.We have about 100k customers and we have 25 millions of documents.
>
> 2.When a customer performs a text search on the document space, we want to
> return only documents that the customer has access to.
>
> 3.The # of documents a customer owns varies a lot. some have close to 23
> million, some have close to 10k and some own a third of the documents etc.
>
> What is an efficient way to use Lucene in this scenario in terms of
> performance and indexing?
> We have tried a number of solutions such as
>
>  a)100k boolean fields per document that indicates whether a customer has
> access to the document.
>  b)A single text field that has a list of customers who owns the document
> e.g. (customers field : "abc abd cfx...")
> c) the above option with shards by customers
>
> The search&index performance for a was bad. b,c performed better for search
> but lengthened the time needed for indexing & index size.
> We are also thinking about using a custom filter but we are concerned about
> the memory requirements.
>
> Any ideas/suggestions would be really appreciated.
>