You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Josh Rehman <jo...@joshrehman.com> on 2011/08/17 05:03:24 UTC

Can I use Lucene to solve this problem?

My organization is looking to solve a difficult problem, and I believe that
Lucene is a close fit (although perhaps it is not). However I'm not sure
exactly how to approach this problem.

The problem is this: given a small set of fixed noun phrases and a much
larger set of human generated short sentences, determine whether the
sentences refer to those noun phrases. For example, perhaps I have these
noun phrases:

   1. Bright yellow book
   2. Large bulbous balloon
   3. Green plaid shirt with stripes
   4. Dark yellow book

And these sentences:

   1. Yesterday I put on my green plaid shirt.
   2. Next week I'll sell my balloon.
   3. Just finished my bright book.
   4. Wondering at how lovely my baloon is [Note the misspelling]

Given that list of sentences, I will generate (sentence, noun phrase)
ordered pairs like this:
1,3
2,2
3,1
4,2

Or even an ordered pair of (sentence, [noun phrases]). E.g. 3,[1,4] (because
there might be an ambiguous reference to "Book")

The "shape" of this problem looks a lot like what Lucene does, but frankly I
don't have a lot of experience with textual indexing and search. I've
installed Lucene and managed to index and search my data structures, however
with the StandardIndexer I'm getting a lot of false positives.

Here is the code I have so far (I've elided the parsing code which is not
very interesting):
  https://gist.github.com/1150723

Really appreciate any and all guidance. Thanks.

Re: Can I use Lucene to solve this problem?

Posted by Federico Fissore <fe...@fissore.org>.
Josh Rehman, il 17/08/2011 05:03, ha scritto:
> My organization is looking to solve a difficult problem, and I believe that
> Lucene is a close fit (although perhaps it is not). However I'm not sure
> exactly how to approach this problem.
>
[...]


maybe using semantic vectors? [0]

we've played around it for a while but never had the time to put it in 
production: basically you search the vector index for each of your 
sentences and get back a set of vectors (the noun phrases). the hard 
part imho is understanding (if exists) a threshold to say, i.e., that 
(1,1) are too distant while (1,3) are close enough

fede

[0] https://code.google.com/p/semanticvectors/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Can I use Lucene to solve this problem?

Posted by Alexander Aristov <al...@gmail.com>.
Hi

Look at the apache mohaut project (based on hadoop ). It seems you need
machine learning algorithms.

Best Regards
Alexander Aristov


On 17 August 2011 20:39, Ian Lea <ia...@gmail.com> wrote:

> Certainly sounds doable in lucene.  Is it basically working apart from
> false positives?  Can you give some examples of the false positives?
>
> I'd be tempted to look at span queries which will let you say that
> "Yesterday I put on my green plaid shirt" is a better match against
> "Green plaid shirt with stripes" than "a plaid shirt that is green"
> would.  If that is what you want. See
> http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/ for
> good info on span queries.
>
> As for misspellings, that is a separate issue.  Google lucene
> spellcheck.  Or look at synonyms if you've got a list of alternatives.
>
>
> --
> Ian.
>
>
> On Wed, Aug 17, 2011 at 4:03 AM, Josh Rehman <jo...@joshrehman.com> wrote:
> > My organization is looking to solve a difficult problem, and I believe
> that
> > Lucene is a close fit (although perhaps it is not). However I'm not sure
> > exactly how to approach this problem.
> >
> > The problem is this: given a small set of fixed noun phrases and a much
> > larger set of human generated short sentences, determine whether the
> > sentences refer to those noun phrases. For example, perhaps I have these
> > noun phrases:
> >
> >   1. Bright yellow book
> >   2. Large bulbous balloon
> >   3. Green plaid shirt with stripes
> >   4. Dark yellow book
> >
> > And these sentences:
> >
> >   1. Yesterday I put on my green plaid shirt.
> >   2. Next week I'll sell my balloon.
> >   3. Just finished my bright book.
> >   4. Wondering at how lovely my baloon is [Note the misspelling]
> >
> > Given that list of sentences, I will generate (sentence, noun phrase)
> > ordered pairs like this:
> > 1,3
> > 2,2
> > 3,1
> > 4,2
> >
> > Or even an ordered pair of (sentence, [noun phrases]). E.g. 3,[1,4]
> (because
> > there might be an ambiguous reference to "Book")
> >
> > The "shape" of this problem looks a lot like what Lucene does, but
> frankly I
> > don't have a lot of experience with textual indexing and search. I've
> > installed Lucene and managed to index and search my data structures,
> however
> > with the StandardIndexer I'm getting a lot of false positives.
> >
> > Here is the code I have so far (I've elided the parsing code which is not
> > very interesting):
> >  https://gist.github.com/1150723
> >
> > Really appreciate any and all guidance. Thanks.
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Can I use Lucene to solve this problem?

Posted by Ian Lea <ia...@gmail.com>.
Certainly sounds doable in lucene.  Is it basically working apart from
false positives?  Can you give some examples of the false positives?

I'd be tempted to look at span queries which will let you say that
"Yesterday I put on my green plaid shirt" is a better match against
"Green plaid shirt with stripes" than "a plaid shirt that is green"
would.  If that is what you want. See
http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/ for
good info on span queries.

As for misspellings, that is a separate issue.  Google lucene
spellcheck.  Or look at synonyms if you've got a list of alternatives.


--
Ian.


On Wed, Aug 17, 2011 at 4:03 AM, Josh Rehman <jo...@joshrehman.com> wrote:
> My organization is looking to solve a difficult problem, and I believe that
> Lucene is a close fit (although perhaps it is not). However I'm not sure
> exactly how to approach this problem.
>
> The problem is this: given a small set of fixed noun phrases and a much
> larger set of human generated short sentences, determine whether the
> sentences refer to those noun phrases. For example, perhaps I have these
> noun phrases:
>
>   1. Bright yellow book
>   2. Large bulbous balloon
>   3. Green plaid shirt with stripes
>   4. Dark yellow book
>
> And these sentences:
>
>   1. Yesterday I put on my green plaid shirt.
>   2. Next week I'll sell my balloon.
>   3. Just finished my bright book.
>   4. Wondering at how lovely my baloon is [Note the misspelling]
>
> Given that list of sentences, I will generate (sentence, noun phrase)
> ordered pairs like this:
> 1,3
> 2,2
> 3,1
> 4,2
>
> Or even an ordered pair of (sentence, [noun phrases]). E.g. 3,[1,4] (because
> there might be an ambiguous reference to "Book")
>
> The "shape" of this problem looks a lot like what Lucene does, but frankly I
> don't have a lot of experience with textual indexing and search. I've
> installed Lucene and managed to index and search my data structures, however
> with the StandardIndexer I'm getting a lot of false positives.
>
> Here is the code I have so far (I've elided the parsing code which is not
> very interesting):
>  https://gist.github.com/1150723
>
> Really appreciate any and all guidance. Thanks.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org