You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Mark Ferguson <ma...@gmail.com> on 2010/03/01 21:35:24 UTC

Reverse Search

Hello,

I am trying to figure out the best search strategy for my situation and am
looking for advice. I will be processing short bits of text (Tweets for
example), and need to search them to see if they certain terms. The list of
terms is a set of locations (towns cities) and is quite long, approximately
500 different entries, and terms can contain spaces.

My typical approach would be to index each Tweet and then search the
resulting document index for each search term. However, I'm not sure this is
the best solution in this situation for two reasons: first, the list of
locations is quite long so we are talking about a large number of queries,
which may grow even larger so I see scalability issues. Second, my Tweet
index is not stable as I am just interested in each Tweet as it comes in,
and can discard it after, so I have no need really to index each entry. It
is actually my list of locations which is stable and searchable.

My thought is to do some kind of reverse search, in which I index the
locations, and then I pass each Tweet to that index as my query. I am not
exactly sure how to go about this though in a way that will do the search in
the way I want. I am also concerned about locations that contain spaces and
how to have these recognised.

As an example, if my locations list is as follows: {"New York", "Chicago",
"Los Angeles"} and my text is the following: "Fire burning in Los Angeles",
I would like to be able to send that _text_ as  query to my indexed location
list, and get a hit.

Is this something that is doable, or does someone envision a different
approach to the problem? Thanks for your time.

Mark Ferguson

RE: Reverse Search

Posted by Digy <di...@gmail.com>.
MoreLikeThis in contrib may help.

DIGY

-----Original Message-----
From: Mark Ferguson [mailto:mark.a.ferguson@gmail.com] 
Sent: Monday, March 01, 2010 10:35 PM
To: java-user@lucene.apache.org
Subject: Reverse Search

Hello,

I am trying to figure out the best search strategy for my situation and am
looking for advice. I will be processing short bits of text (Tweets for
example), and need to search them to see if they certain terms. The list of
terms is a set of locations (towns cities) and is quite long, approximately
500 different entries, and terms can contain spaces.

My typical approach would be to index each Tweet and then search the
resulting document index for each search term. However, I'm not sure this is
the best solution in this situation for two reasons: first, the list of
locations is quite long so we are talking about a large number of queries,
which may grow even larger so I see scalability issues. Second, my Tweet
index is not stable as I am just interested in each Tweet as it comes in,
and can discard it after, so I have no need really to index each entry. It
is actually my list of locations which is stable and searchable.

My thought is to do some kind of reverse search, in which I index the
locations, and then I pass each Tweet to that index as my query. I am not
exactly sure how to go about this though in a way that will do the search in
the way I want. I am also concerned about locations that contain spaces and
how to have these recognised.

As an example, if my locations list is as follows: {"New York", "Chicago",
"Los Angeles"} and my text is the following: "Fire burning in Los Angeles",
I would like to be able to send that _text_ as  query to my indexed location
list, and get a hit.

Is this something that is doable, or does someone envision a different
approach to the problem? Thanks for your time.

Mark Ferguson


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Reverse Search

Posted by Mark Ferguson <ma...@gmail.com>.
Hi Steve,

MemoryIndex appears to be exactly what I'm looking for. Thank you!

Mark



On Mon, Mar 1, 2010 at 2:01 PM, Steven A Rowe <sa...@syr.edu> wrote:

> Hi Mark,
>
> On 03/01/2010 at 3:35 PM, Mark Ferguson wrote:
> > I will be processing short bits of text (Tweets for example), and
> > need to search them to see if they certain terms.
>
> You might consider, instead of performing reverse search, just querying all
> of your locations against one document at a time using Lucene's MemoryIndex,
> which is very fast:
>
> <
> http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/index/memory/MemoryIndex.html
> >
>
> If you decide to go the reverse search route, Lucene's InstantiatedIndex is
> also very fast, and unlike MemoryIndex, can handle more than one document at
> a time:|
>
> <
> http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/store/instantiated/package-summary.html
> >
>
> Steve
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Reverse Search

Posted by Steven A Rowe <sa...@syr.edu>.
Hi Mark,

On 03/01/2010 at 3:35 PM, Mark Ferguson wrote:
> I will be processing short bits of text (Tweets for example), and
> need to search them to see if they certain terms.

You might consider, instead of performing reverse search, just querying all of your locations against one document at a time using Lucene's MemoryIndex, which is very fast: 

<http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/index/memory/MemoryIndex.html>

If you decide to go the reverse search route, Lucene's InstantiatedIndex is also very fast, and unlike MemoryIndex, can handle more than one document at a time:|

<http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/store/instantiated/package-summary.html>

Steve


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org