You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Yaniv Ben Yosef <ya...@gmail.com> on 2010/01/07 21:54:22 UTC

Implementing filtering based on multiple fields

Hi,

I'm very new to Lucene. In fact, I'm at the beginning of an evaluation
phase, trying to figure whether Lucene is the right fit for my needs.
The project I'm involved in requires something similar to the Google Custom
Search Engine <http://www.google.com/cse/> (CSE). In CSE, each user can
define a set (could be a large set) of websites, and limit the search to
only those websites. So for example, I can create a CSE that searches all
web pages on cnn.com, msnbc.com and nytimes.com only.
I am trying to understand whether and how I can do something similar in
Lucene.

The FAQ hints about this possibility
here<http://wiki.apache.org/lucene-java/LuceneFAQ#How_can_I_search_over_multiple_fields.3F>,
but it mentions a class that no longer exists in 3.0 (QueryFilter), and is
very laconic about the suggested options. Also I'm not sure how well it will
perform in my use case (or even if it fits at all).
I thought about creating a separate index for each user or CSE. However, my
system should be able to handle tens of thousands of concurrent users. I
haven't done any analysis yet on how this will affect CPU, RAM, I/O and
storage size, but was wondering if any of you experienced Lucene
users/developers think it's a good direction.
If that's not a good idea, what would be a good strategy here?

Any help will be much appreciated,
Yaniv

Re: Implementing filtering based on multiple fields

Posted by Lucifer Hammer <lu...@gmail.com>.
Why not just add custom terms onto the end of each query for each user?
i.e.  When user X queries for "bananas", and has previously set their
domains to search in cnn, and yahoo, then why not append the following onto
the search query:   "fullText:bananas AND (domain:cnn OR domain:yahoo)"

Off the top of my head there's a few caveats:

1) if the domain list is large, you'll have to deal with the maxbooleans
setting
2) parsing the query can be slow, however, there's a tradeoff between
managing thousands of indexes vs a slight performance hit (Or, you can put
the query together without parsing - depends on how you handle the users
query terms)

This seems like too simple an approach, I'm sure I'm not understanding
something...

LH
On Fri, Jan 8, 2010 at 5:16 AM, Yaniv Ben Yosef <ya...@gmail.com> wrote:

> Thanks Otis, that's very helpful.
>
> On Fri, Jan 8, 2010 at 2:08 AM, Otis Gospodnetic <
> otis_gospodnetic@yahoo.com
>  > wrote:
>
> > Ah, well, masking it didn't help.  Yes, ignore Bixo, Nutch, and Droids
> > then.
> > Consider DataImportHandler from Solr or wait a bit for Lucene Connectors
> > Framework to materialize.  Or use LuSql, or DbSight, or Sematext's
> Database
> > Indexer.
> >
> > Yes, I was suggesting a separate index for each user.  That's what Simpy
> > uses and has some 200K indices on 1 box.... and I think dozens of QPS
> > without any caching, if I remember correctly.  Load is under 1.0.
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> >
> >
> >
> > ----- Original Message ----
> > > From: Yaniv Ben Yosef <ya...@gmail.com>
> > > To: java-user@lucene.apache.org
> > > Sent: Thu, January 7, 2010 6:55:18 PM
> > > Subject: Re: Implementing filtering based on multiple fields
> > >
> > > Thanks Otis.
> > >
> > > If I understand correctly - Bixo, Nutch and Droids are technologies to
> > use
> > > for crawling the web and building an index. My project is actually
> about
> > > indexing a large database, where you can think of every row as a web
> > page,
> > > and a particular column is the equivalent of a web site. (I didn't
> > mention
> > > that in the previous post because I didn't want to complicate my
> > question,
> > > and it seems equivalent to Google CSE given that Lucene can use
> virtually
> > > any input for indexing, AFAIK)
> > > Therefore I'm not sure if the frameworks you've mentioned are
> applicable
> > to
> > > my project as they seem to be related to web page indexing, but perhaps
> > I'm
> > > missing something.
> > > Also, what did you mean about isolating users and their data/indices.
> Did
> > > you mean that I should create a separate index per user?
> > >
> > > Thanks again!
> > >
> > > On Fri, Jan 8, 2010 at 12:35 AM, Otis Gospodnetic <
> > > otis_gospodnetic@yahoo.com> wrote:
> > >
> > > > For something like CSE, I think you want to isolate users and their
> > > > data/indices.
> > > >
> > > > I'd look at Bixo or Nutch or Droids ==> Lucene or Solr
> > > >
> > > > Otis
> > > > --
> > > > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> > > >
> > > >
> > > >
> > > > ----- Original Message ----
> > > > > From: Yaniv Ben Yosef
> > > > > To: java-user@lucene.apache.org
> > > > > Sent: Thu, January 7, 2010 3:54:22 PM
> > > > > Subject: Implementing filtering based on multiple fields
> > > > >
> > > > > Hi,
> > > > >
> > > > > I'm very new to Lucene. In fact, I'm at the beginning of an
> > evaluation
> > > > > phase, trying to figure whether Lucene is the right fit for my
> needs.
> > > > > The project I'm involved in requires something similar to the
> Google
> > > > Custom
> > > > > Search Engine (CSE). In CSE, each user can
> > > > > define a set (could be a large set) of websites, and limit the
> search
> > to
> > > > > only those websites. So for example, I can create a CSE that
> searches
> > all
> > > > > web pages on cnn.com, msnbc.com and nytimes.com only.
> > > > > I am trying to understand whether and how I can do something
> similar
> > in
> > > > > Lucene.
> > > > >
> > > > > The FAQ hints about this possibility
> > > > > here,
> > > > > but it mentions a class that no longer exists in 3.0 (QueryFilter),
> > and
> > > > is
> > > > > very laconic about the suggested options. Also I'm not sure how
> well
> > it
> > > > will
> > > > > perform in my use case (or even if it fits at all).
> > > > > I thought about creating a separate index for each user or CSE.
> > However,
> > > > my
> > > > > system should be able to handle tens of thousands of concurrent
> > users. I
> > > > > haven't done any analysis yet on how this will affect CPU, RAM, I/O
> > and
> > > > > storage size, but was wondering if any of you experienced Lucene
> > > > > users/developers think it's a good direction.
> > > > > If that's not a good idea, what would be a good strategy here?
> > > > >
> > > > > Any help will be much appreciated,
> > > > > Yaniv
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Re: Implementing filtering based on multiple fields

Posted by Yaniv Ben Yosef <ya...@gmail.com>.
Thanks Otis, that's very helpful.

On Fri, Jan 8, 2010 at 2:08 AM, Otis Gospodnetic <otis_gospodnetic@yahoo.com
> wrote:

> Ah, well, masking it didn't help.  Yes, ignore Bixo, Nutch, and Droids
> then.
> Consider DataImportHandler from Solr or wait a bit for Lucene Connectors
> Framework to materialize.  Or use LuSql, or DbSight, or Sematext's Database
> Indexer.
>
> Yes, I was suggesting a separate index for each user.  That's what Simpy
> uses and has some 200K indices on 1 box.... and I think dozens of QPS
> without any caching, if I remember correctly.  Load is under 1.0.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>
>
>
> ----- Original Message ----
> > From: Yaniv Ben Yosef <ya...@gmail.com>
> > To: java-user@lucene.apache.org
> > Sent: Thu, January 7, 2010 6:55:18 PM
> > Subject: Re: Implementing filtering based on multiple fields
> >
> > Thanks Otis.
> >
> > If I understand correctly - Bixo, Nutch and Droids are technologies to
> use
> > for crawling the web and building an index. My project is actually about
> > indexing a large database, where you can think of every row as a web
> page,
> > and a particular column is the equivalent of a web site. (I didn't
> mention
> > that in the previous post because I didn't want to complicate my
> question,
> > and it seems equivalent to Google CSE given that Lucene can use virtually
> > any input for indexing, AFAIK)
> > Therefore I'm not sure if the frameworks you've mentioned are applicable
> to
> > my project as they seem to be related to web page indexing, but perhaps
> I'm
> > missing something.
> > Also, what did you mean about isolating users and their data/indices. Did
> > you mean that I should create a separate index per user?
> >
> > Thanks again!
> >
> > On Fri, Jan 8, 2010 at 12:35 AM, Otis Gospodnetic <
> > otis_gospodnetic@yahoo.com> wrote:
> >
> > > For something like CSE, I think you want to isolate users and their
> > > data/indices.
> > >
> > > I'd look at Bixo or Nutch or Droids ==> Lucene or Solr
> > >
> > > Otis
> > > --
> > > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> > >
> > >
> > >
> > > ----- Original Message ----
> > > > From: Yaniv Ben Yosef
> > > > To: java-user@lucene.apache.org
> > > > Sent: Thu, January 7, 2010 3:54:22 PM
> > > > Subject: Implementing filtering based on multiple fields
> > > >
> > > > Hi,
> > > >
> > > > I'm very new to Lucene. In fact, I'm at the beginning of an
> evaluation
> > > > phase, trying to figure whether Lucene is the right fit for my needs.
> > > > The project I'm involved in requires something similar to the Google
> > > Custom
> > > > Search Engine (CSE). In CSE, each user can
> > > > define a set (could be a large set) of websites, and limit the search
> to
> > > > only those websites. So for example, I can create a CSE that searches
> all
> > > > web pages on cnn.com, msnbc.com and nytimes.com only.
> > > > I am trying to understand whether and how I can do something similar
> in
> > > > Lucene.
> > > >
> > > > The FAQ hints about this possibility
> > > > here,
> > > > but it mentions a class that no longer exists in 3.0 (QueryFilter),
> and
> > > is
> > > > very laconic about the suggested options. Also I'm not sure how well
> it
> > > will
> > > > perform in my use case (or even if it fits at all).
> > > > I thought about creating a separate index for each user or CSE.
> However,
> > > my
> > > > system should be able to handle tens of thousands of concurrent
> users. I
> > > > haven't done any analysis yet on how this will affect CPU, RAM, I/O
> and
> > > > storage size, but was wondering if any of you experienced Lucene
> > > > users/developers think it's a good direction.
> > > > If that's not a good idea, what would be a good strategy here?
> > > >
> > > > Any help will be much appreciated,
> > > > Yaniv
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Implementing filtering based on multiple fields

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Ah, well, masking it didn't help.  Yes, ignore Bixo, Nutch, and Droids then.
Consider DataImportHandler from Solr or wait a bit for Lucene Connectors Framework to materialize.  Or use LuSql, or DbSight, or Sematext's Database Indexer.

Yes, I was suggesting a separate index for each user.  That's what Simpy uses and has some 200K indices on 1 box.... and I think dozens of QPS without any caching, if I remember correctly.  Load is under 1.0.

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



----- Original Message ----
> From: Yaniv Ben Yosef <ya...@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Thu, January 7, 2010 6:55:18 PM
> Subject: Re: Implementing filtering based on multiple fields
> 
> Thanks Otis.
> 
> If I understand correctly - Bixo, Nutch and Droids are technologies to use
> for crawling the web and building an index. My project is actually about
> indexing a large database, where you can think of every row as a web page,
> and a particular column is the equivalent of a web site. (I didn't mention
> that in the previous post because I didn't want to complicate my question,
> and it seems equivalent to Google CSE given that Lucene can use virtually
> any input for indexing, AFAIK)
> Therefore I'm not sure if the frameworks you've mentioned are applicable to
> my project as they seem to be related to web page indexing, but perhaps I'm
> missing something.
> Also, what did you mean about isolating users and their data/indices. Did
> you mean that I should create a separate index per user?
> 
> Thanks again!
> 
> On Fri, Jan 8, 2010 at 12:35 AM, Otis Gospodnetic <
> otis_gospodnetic@yahoo.com> wrote:
> 
> > For something like CSE, I think you want to isolate users and their
> > data/indices.
> >
> > I'd look at Bixo or Nutch or Droids ==> Lucene or Solr
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> >
> >
> >
> > ----- Original Message ----
> > > From: Yaniv Ben Yosef 
> > > To: java-user@lucene.apache.org
> > > Sent: Thu, January 7, 2010 3:54:22 PM
> > > Subject: Implementing filtering based on multiple fields
> > >
> > > Hi,
> > >
> > > I'm very new to Lucene. In fact, I'm at the beginning of an evaluation
> > > phase, trying to figure whether Lucene is the right fit for my needs.
> > > The project I'm involved in requires something similar to the Google
> > Custom
> > > Search Engine (CSE). In CSE, each user can
> > > define a set (could be a large set) of websites, and limit the search to
> > > only those websites. So for example, I can create a CSE that searches all
> > > web pages on cnn.com, msnbc.com and nytimes.com only.
> > > I am trying to understand whether and how I can do something similar in
> > > Lucene.
> > >
> > > The FAQ hints about this possibility
> > > here,
> > > but it mentions a class that no longer exists in 3.0 (QueryFilter), and
> > is
> > > very laconic about the suggested options. Also I'm not sure how well it
> > will
> > > perform in my use case (or even if it fits at all).
> > > I thought about creating a separate index for each user or CSE. However,
> > my
> > > system should be able to handle tens of thousands of concurrent users. I
> > > haven't done any analysis yet on how this will affect CPU, RAM, I/O and
> > > storage size, but was wondering if any of you experienced Lucene
> > > users/developers think it's a good direction.
> > > If that's not a good idea, what would be a good strategy here?
> > >
> > > Any help will be much appreciated,
> > > Yaniv
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Implementing filtering based on multiple fields

Posted by Yaniv Ben Yosef <ya...@gmail.com>.
Thanks Otis.

If I understand correctly - Bixo, Nutch and Droids are technologies to use
for crawling the web and building an index. My project is actually about
indexing a large database, where you can think of every row as a web page,
and a particular column is the equivalent of a web site. (I didn't mention
that in the previous post because I didn't want to complicate my question,
and it seems equivalent to Google CSE given that Lucene can use virtually
any input for indexing, AFAIK)
Therefore I'm not sure if the frameworks you've mentioned are applicable to
my project as they seem to be related to web page indexing, but perhaps I'm
missing something.
Also, what did you mean about isolating users and their data/indices. Did
you mean that I should create a separate index per user?

Thanks again!

On Fri, Jan 8, 2010 at 12:35 AM, Otis Gospodnetic <
otis_gospodnetic@yahoo.com> wrote:

> For something like CSE, I think you want to isolate users and their
> data/indices.
>
> I'd look at Bixo or Nutch or Droids ==> Lucene or Solr
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>
>
>
> ----- Original Message ----
> > From: Yaniv Ben Yosef <ya...@gmail.com>
> > To: java-user@lucene.apache.org
> > Sent: Thu, January 7, 2010 3:54:22 PM
> > Subject: Implementing filtering based on multiple fields
> >
> > Hi,
> >
> > I'm very new to Lucene. In fact, I'm at the beginning of an evaluation
> > phase, trying to figure whether Lucene is the right fit for my needs.
> > The project I'm involved in requires something similar to the Google
> Custom
> > Search Engine (CSE). In CSE, each user can
> > define a set (could be a large set) of websites, and limit the search to
> > only those websites. So for example, I can create a CSE that searches all
> > web pages on cnn.com, msnbc.com and nytimes.com only.
> > I am trying to understand whether and how I can do something similar in
> > Lucene.
> >
> > The FAQ hints about this possibility
> > here,
> > but it mentions a class that no longer exists in 3.0 (QueryFilter), and
> is
> > very laconic about the suggested options. Also I'm not sure how well it
> will
> > perform in my use case (or even if it fits at all).
> > I thought about creating a separate index for each user or CSE. However,
> my
> > system should be able to handle tens of thousands of concurrent users. I
> > haven't done any analysis yet on how this will affect CPU, RAM, I/O and
> > storage size, but was wondering if any of you experienced Lucene
> > users/developers think it's a good direction.
> > If that's not a good idea, what would be a good strategy here?
> >
> > Any help will be much appreciated,
> > Yaniv
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Implementing filtering based on multiple fields

Posted by Otis Gospodnetic <ot...@yahoo.com>.
For something like CSE, I think you want to isolate users and their data/indices.

I'd look at Bixo or Nutch or Droids ==> Lucene or Solr

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



----- Original Message ----
> From: Yaniv Ben Yosef <ya...@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Thu, January 7, 2010 3:54:22 PM
> Subject: Implementing filtering based on multiple fields
> 
> Hi,
> 
> I'm very new to Lucene. In fact, I'm at the beginning of an evaluation
> phase, trying to figure whether Lucene is the right fit for my needs.
> The project I'm involved in requires something similar to the Google Custom
> Search Engine (CSE). In CSE, each user can
> define a set (could be a large set) of websites, and limit the search to
> only those websites. So for example, I can create a CSE that searches all
> web pages on cnn.com, msnbc.com and nytimes.com only.
> I am trying to understand whether and how I can do something similar in
> Lucene.
> 
> The FAQ hints about this possibility
> here,
> but it mentions a class that no longer exists in 3.0 (QueryFilter), and is
> very laconic about the suggested options. Also I'm not sure how well it will
> perform in my use case (or even if it fits at all).
> I thought about creating a separate index for each user or CSE. However, my
> system should be able to handle tens of thousands of concurrent users. I
> haven't done any analysis yet on how this will affect CPU, RAM, I/O and
> storage size, but was wondering if any of you experienced Lucene
> users/developers think it's a good direction.
> If that's not a good idea, what would be a good strategy here?
> 
> Any help will be much appreciated,
> Yaniv


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org