You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by TJ Kolev <tj...@gmail.com> on 2010/01/13 21:59:52 UTC

Problem: Indexing and searching repeating groups of fields

Greetings,

Let's assume I have to index and search "resume" documents. Two fields are
defined: Language and Years. The fields are associated together in a group
called Experience. A resume document may have 0 or more Experience groups:

Ra{ E1(Java,5); E2(C,2); E3(PHP,3);}
Rb{ E1(Java,2); E2(C,5); E3(VB,1);}

How do I index such documents, and how do I search, so I can formulate a
query like this "Resumes which have (Java,5) and (C,2)" and get back Ra. I
know I can index multiple fields of the same name, and do "(Language:Java
AND Years:5) AND (Language:C AND Years:2)", but in addition to Ra that would
also return Rb, which I don't want. The problem here is that the "grouping"
is lost. I can create fields with compound names (E1Language, E1Years,
E2Language, E2Years, etc), but that helps me none, as I don't know which
group to search. I'd also like to query for "(Language:Java AND Years:5) OR
(Language:C AND Years:2)"

This is a simplified example. Real documents may have 30 - 40 groups, each
one with several fields. Putting all the fields in a group in one index
field won't work as the numeric/date ones should be available for range
searchers.

So far the way I see it is to do my own post processing on the results. The
issue is that text fields will need to be untokenized, or otherwise it would
be difficult to work on the result, and determine what matches.

Thank you.
tjk :)

Re: Problem: Indexing and searching repeating groups of fields

Posted by Erick Erickson <er...@gmail.com>.

One approach would be to do this with multi-valued fields. The
idea here is to index all your E fields in the *same* Lucene
field with an increment gap (see getPositionIncrementGap) > 1.

For this example, assume getPositionIncrementGap returns 100.

Then, for your documents you have something like....
doc.add(new Field("experience", "java,5" blah blah));
doc.add(new Field("experience", "C,2" blah blah));
doc.add(new Field("experience", "PHP,3" blah blah));

Then you do proximity searches with a slop of < 100.

The trick is that, the above tokens are positioned (roughly)
1 - java
2 - 5
102 - c
103 - 2
203 - php
204 - 3

Of course you have to override a suitable analyzer to break
your tokens up appropriately.

Now a query (SpanNear? Proximity? your choice) of the
form "java 5"~90 AND "c 2"~90 should only return Ra.

HTH
Erick

On Wed, Jan 13, 2010 at 3:59 PM, TJ Kolev <tj...@gmail.com> wrote:

> Greetings,
>
> Let's assume I have to index and search "resume" documents. Two fields are
> defined: Language and Years. The fields are associated together in a group
> called Experience. A resume document may have 0 or more Experience groups:
>
> Ra{ E1(Java,5); E2(C,2); E3(PHP,3);}
> Rb{ E1(Java,2); E2(C,5); E3(VB,1);}
>
> How do I index such documents, and how do I search, so I can formulate a
> query like this "Resumes which have (Java,5) and (C,2)" and get back Ra. I
> know I can index multiple fields of the same name, and do "(Language:Java
> AND Years:5) AND (Language:C AND Years:2)", but in addition to Ra that
> would
> also return Rb, which I don't want. The problem here is that the "grouping"
> is lost. I can create fields with compound names (E1Language, E1Years,
> E2Language, E2Years, etc), but that helps me none, as I don't know which
> group to search. I'd also like to query for "(Language:Java AND Years:5) OR
> (Language:C AND Years:2)"
>
> This is a simplified example. Real documents may have 30 - 40 groups, each
> one with several fields. Putting all the fields in a group in one index
> field won't work as the numeric/date ones should be available for range
> searchers.
>
> So far the way I see it is to do my own post processing on the results. The
> issue is that text fields will need to be untokenized, or otherwise it
> would
> be difficult to work on the result, and determine what matches.
>
> Thank you.
> tjk :)
>

Re: Problem: Indexing and searching repeating groups of fields

Posted by TJ Kolev <tj...@gmail.com>.

The issue is that the real world document has more than 2 fields. Me giving
an example of two was a bit misleading. I can't really "pair" them. Here's a
better example:

Resume_a
  Exp_1
    Language:Java, Years:5, Certification:Sun, Area:Web
  Exp_2
    Language:C, Years:3, Certification:None, Area:Desktop

And yes, I need to be able to query for ranges on the Years field. So for
now, I'll do as much in Lucene, and then post-filter on my own any extra
documents in the search result.

Thank you.
tjk :)

On Fri, Jan 15, 2010 at 11:31 AM, Erick Erickson <er...@gmail.com>wrote:

> Well, a variant on the easy solution might. What would
> happen if you indexed the un-split pairs in the same
> field? I.e.
>
> "java:5", "c:3", "php:2" all indexed as *single* tokens
> in the *same* field?
>
> But I think you should look at Digy's suggestion again.
> 6-10 fields is absolutely no problem at all. Documents
> in Lucene can have any number of fields without
> any relation to any other document. So having
> "sparse" documents has no penalty (unlike a RDBMS).
>
> In particular, with Digy's scheme you can answer questions
> like "has more than 3 years of C experience" whereas
> with my scheme it would be harder.
>
>
> FWIW
> Erick
>
> On Fri, Jan 15, 2010 at 11:19 AM, TJ Kolev <tj...@gmail.com> wrote:
>
> > Hi!
> >
> > I don't think the easy solution will work for me, because I'll have more
> > than two fields in a group - perhaps 6 - 10.
> >
> > However using span queries looks very promising. I'll investigate that.
> >
> > I see setPositionIncrement() only on the Token object. Is there a way to
> > set
> > this when adding a field to the document, so that the first token get its
> > position pushed away. I would prefer not to modify my analyzer if
> possible.
> >
> > Thank you.
> > tjk :)
> >
> > On Wed, Jan 13, 2010 at 3:52 PM, Erick Erickson <erickerickson@gmail.com
> > >wrote:
> >
> > > Ooooh, isn't that easier. You just prompted me to think
> > > that you don't even have to do that, just index the pairs as single
> > > tokens (KeywordAnalyzer? but watch out for no case folding)...
> > >
> > > On Wed, Jan 13, 2010 at 4:30 PM, Digy <di...@gmail.com> wrote:
> > >
> > > > How about using languages as fieldnames?
> > > > Doc1(Ra):
> > > >        Java:5
> > > >        C:2
> > > >        PHP:3
> > > >
> > > > Doc2(Rb)
> > > >        Java:2
> > > >        C:5
> > > >        VB:1
> > > >
> > > > Query:Java:5 AND C:2
> > > >
> > > > DIGY
> > > >
> > > > -----Original Message-----
> > > > From: TJ Kolev [mailto:tjkolev@gmail.com]
> > > > Sent: Wednesday, January 13, 2010 11:00 PM
> > > > To: java-user@lucene.apache.org
> > > > Subject: Problem: Indexing and searching repeating groups of fields
> > > >
> > > > Greetings,
> > > >
> > > > Let's assume I have to index and search "resume" documents. Two
> fields
> > > are
> > > > defined: Language and Years. The fields are associated together in a
> > > group
> > > > called Experience. A resume document may have 0 or more Experience
> > > groups:
> > > >
> > > > Ra{ E1(Java,5); E2(C,2); E3(PHP,3);}
> > > > Rb{ E1(Java,2); E2(C,5); E3(VB,1);}
> > > >
> > > > How do I index such documents, and how do I search, so I can
> formulate
> > a
> > > > query like this "Resumes which have (Java,5) and (C,2)" and get back
> > Ra.
> > > I
> > > > know I can index multiple fields of the same name, and do
> > "(Language:Java
> > > > AND Years:5) AND (Language:C AND Years:2)", but in addition to Ra
> that
> > > > would
> > > > also return Rb, which I don't want. The problem here is that the
> > > "grouping"
> > > > is lost. I can create fields with compound names (E1Language,
> E1Years,
> > > > E2Language, E2Years, etc), but that helps me none, as I don't know
> > which
> > > > group to search. I'd also like to query for "(Language:Java AND
> > Years:5)
> > > OR
> > > > (Language:C AND Years:2)"
> > > >
> > > > This is a simplified example. Real documents may have 30 - 40 groups,
> > > each
> > > > one with several fields. Putting all the fields in a group in one
> index
> > > > field won't work as the numeric/date ones should be available for
> range
> > > > searchers.
> > > >
> > > > So far the way I see it is to do my own post processing on the
> results.
> > > The
> > > > issue is that text fields will need to be untokenized, or otherwise
> it
> > > > would
> > > > be difficult to work on the result, and determine what matches.
> > > >
> > > > Thank you.
> > > > tjk :)
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> >
>

Re: Problem: Indexing and searching repeating groups of fields

Posted by Erick Erickson <er...@gmail.com>.

Well, a variant on the easy solution might. What would
happen if you indexed the un-split pairs in the same
field? I.e.

"java:5", "c:3", "php:2" all indexed as *single* tokens
in the *same* field?

But I think you should look at Digy's suggestion again.
6-10 fields is absolutely no problem at all. Documents
in Lucene can have any number of fields without
any relation to any other document. So having
"sparse" documents has no penalty (unlike a RDBMS).

In particular, with Digy's scheme you can answer questions
like "has more than 3 years of C experience" whereas
with my scheme it would be harder.


FWIW
Erick

On Fri, Jan 15, 2010 at 11:19 AM, TJ Kolev <tj...@gmail.com> wrote:

> Hi!
>
> I don't think the easy solution will work for me, because I'll have more
> than two fields in a group - perhaps 6 - 10.
>
> However using span queries looks very promising. I'll investigate that.
>
> I see setPositionIncrement() only on the Token object. Is there a way to
> set
> this when adding a field to the document, so that the first token get its
> position pushed away. I would prefer not to modify my analyzer if possible.
>
> Thank you.
> tjk :)
>
> On Wed, Jan 13, 2010 at 3:52 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > Ooooh, isn't that easier. You just prompted me to think
> > that you don't even have to do that, just index the pairs as single
> > tokens (KeywordAnalyzer? but watch out for no case folding)...
> >
> > On Wed, Jan 13, 2010 at 4:30 PM, Digy <di...@gmail.com> wrote:
> >
> > > How about using languages as fieldnames?
> > > Doc1(Ra):
> > >        Java:5
> > >        C:2
> > >        PHP:3
> > >
> > > Doc2(Rb)
> > >        Java:2
> > >        C:5
> > >        VB:1
> > >
> > > Query:Java:5 AND C:2
> > >
> > > DIGY
> > >
> > > -----Original Message-----
> > > From: TJ Kolev [mailto:tjkolev@gmail.com]
> > > Sent: Wednesday, January 13, 2010 11:00 PM
> > > To: java-user@lucene.apache.org
> > > Subject: Problem: Indexing and searching repeating groups of fields
> > >
> > > Greetings,
> > >
> > > Let's assume I have to index and search "resume" documents. Two fields
> > are
> > > defined: Language and Years. The fields are associated together in a
> > group
> > > called Experience. A resume document may have 0 or more Experience
> > groups:
> > >
> > > Ra{ E1(Java,5); E2(C,2); E3(PHP,3);}
> > > Rb{ E1(Java,2); E2(C,5); E3(VB,1);}
> > >
> > > How do I index such documents, and how do I search, so I can formulate
> a
> > > query like this "Resumes which have (Java,5) and (C,2)" and get back
> Ra.
> > I
> > > know I can index multiple fields of the same name, and do
> "(Language:Java
> > > AND Years:5) AND (Language:C AND Years:2)", but in addition to Ra that
> > > would
> > > also return Rb, which I don't want. The problem here is that the
> > "grouping"
> > > is lost. I can create fields with compound names (E1Language, E1Years,
> > > E2Language, E2Years, etc), but that helps me none, as I don't know
> which
> > > group to search. I'd also like to query for "(Language:Java AND
> Years:5)
> > OR
> > > (Language:C AND Years:2)"
> > >
> > > This is a simplified example. Real documents may have 30 - 40 groups,
> > each
> > > one with several fields. Putting all the fields in a group in one index
> > > field won't work as the numeric/date ones should be available for range
> > > searchers.
> > >
> > > So far the way I see it is to do my own post processing on the results.
> > The
> > > issue is that text fields will need to be untokenized, or otherwise it
> > > would
> > > be difficult to work on the result, and determine what matches.
> > >
> > > Thank you.
> > > tjk :)
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>

Re: Problem: Indexing and searching repeating groups of fields

Posted by TJ Kolev <tj...@gmail.com>.

Found public int getPositionIncrementGap(String fieldName) on Analyzer.
Sweet! Should've read more before emailing.

tjk :)

On Fri, Jan 15, 2010 at 10:19 AM, TJ Kolev <tj...@gmail.com> wrote:

> Hi!
>
> I don't think the easy solution will work for me, because I'll have more
> than two fields in a group - perhaps 6 - 10.
>
> However using span queries looks very promising. I'll investigate that.
>
> I see setPositionIncrement() only on the Token object. Is there a way to
> set this when adding a field to the document, so that the first token get
> its position pushed away. I would prefer not to modify my analyzer if
> possible.
>
> Thank you.
> tjk :)
>
>
> On Wed, Jan 13, 2010 at 3:52 PM, Erick Erickson <er...@gmail.com>wrote:
>
>> Ooooh, isn't that easier. You just prompted me to think
>> that you don't even have to do that, just index the pairs as single
>> tokens (KeywordAnalyzer? but watch out for no case folding)...
>>
>> On Wed, Jan 13, 2010 at 4:30 PM, Digy <di...@gmail.com> wrote:
>>
>> > How about using languages as fieldnames?
>> > Doc1(Ra):
>> >        Java:5
>> >        C:2
>> >        PHP:3
>> >
>> > Doc2(Rb)
>> >        Java:2
>> >        C:5
>> >        VB:1
>> >
>> > Query:Java:5 AND C:2
>> >
>> > DIGY
>> >
>> > -----Original Message-----
>> > From: TJ Kolev [mailto:tjkolev@gmail.com]
>> > Sent: Wednesday, January 13, 2010 11:00 PM
>> > To: java-user@lucene.apache.org
>> > Subject: Problem: Indexing and searching repeating groups of fields
>> >
>> > Greetings,
>> >
>> > Let's assume I have to index and search "resume" documents. Two fields
>> are
>> > defined: Language and Years. The fields are associated together in a
>> group
>> > called Experience. A resume document may have 0 or more Experience
>> groups:
>> >
>> > Ra{ E1(Java,5); E2(C,2); E3(PHP,3);}
>> > Rb{ E1(Java,2); E2(C,5); E3(VB,1);}
>> >
>> > How do I index such documents, and how do I search, so I can formulate a
>> > query like this "Resumes which have (Java,5) and (C,2)" and get back Ra.
>> I
>> > know I can index multiple fields of the same name, and do
>> "(Language:Java
>> > AND Years:5) AND (Language:C AND Years:2)", but in addition to Ra that
>> > would
>> > also return Rb, which I don't want. The problem here is that the
>> "grouping"
>> > is lost. I can create fields with compound names (E1Language, E1Years,
>> > E2Language, E2Years, etc), but that helps me none, as I don't know which
>> > group to search. I'd also like to query for "(Language:Java AND Years:5)
>> OR
>> > (Language:C AND Years:2)"
>> >
>> > This is a simplified example. Real documents may have 30 - 40 groups,
>> each
>> > one with several fields. Putting all the fields in a group in one index
>> > field won't work as the numeric/date ones should be available for range
>> > searchers.
>> >
>> > So far the way I see it is to do my own post processing on the results.
>> The
>> > issue is that text fields will need to be untokenized, or otherwise it
>> > would
>> > be difficult to work on the result, and determine what matches.
>> >
>> > Thank you.
>> > tjk :)
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>
>

Re: Problem: Indexing and searching repeating groups of fields

Posted by TJ Kolev <tj...@gmail.com>.

Hi!

I don't think the easy solution will work for me, because I'll have more
than two fields in a group - perhaps 6 - 10.

However using span queries looks very promising. I'll investigate that.

I see setPositionIncrement() only on the Token object. Is there a way to set
this when adding a field to the document, so that the first token get its
position pushed away. I would prefer not to modify my analyzer if possible.

Thank you.
tjk :)

On Wed, Jan 13, 2010 at 3:52 PM, Erick Erickson <er...@gmail.com>wrote:

> Ooooh, isn't that easier. You just prompted me to think
> that you don't even have to do that, just index the pairs as single
> tokens (KeywordAnalyzer? but watch out for no case folding)...
>
> On Wed, Jan 13, 2010 at 4:30 PM, Digy <di...@gmail.com> wrote:
>
> > How about using languages as fieldnames?
> > Doc1(Ra):
> >        Java:5
> >        C:2
> >        PHP:3
> >
> > Doc2(Rb)
> >        Java:2
> >        C:5
> >        VB:1
> >
> > Query:Java:5 AND C:2
> >
> > DIGY
> >
> > -----Original Message-----
> > From: TJ Kolev [mailto:tjkolev@gmail.com]
> > Sent: Wednesday, January 13, 2010 11:00 PM
> > To: java-user@lucene.apache.org
> > Subject: Problem: Indexing and searching repeating groups of fields
> >
> > Greetings,
> >
> > Let's assume I have to index and search "resume" documents. Two fields
> are
> > defined: Language and Years. The fields are associated together in a
> group
> > called Experience. A resume document may have 0 or more Experience
> groups:
> >
> > Ra{ E1(Java,5); E2(C,2); E3(PHP,3);}
> > Rb{ E1(Java,2); E2(C,5); E3(VB,1);}
> >
> > How do I index such documents, and how do I search, so I can formulate a
> > query like this "Resumes which have (Java,5) and (C,2)" and get back Ra.
> I
> > know I can index multiple fields of the same name, and do "(Language:Java
> > AND Years:5) AND (Language:C AND Years:2)", but in addition to Ra that
> > would
> > also return Rb, which I don't want. The problem here is that the
> "grouping"
> > is lost. I can create fields with compound names (E1Language, E1Years,
> > E2Language, E2Years, etc), but that helps me none, as I don't know which
> > group to search. I'd also like to query for "(Language:Java AND Years:5)
> OR
> > (Language:C AND Years:2)"
> >
> > This is a simplified example. Real documents may have 30 - 40 groups,
> each
> > one with several fields. Putting all the fields in a group in one index
> > field won't work as the numeric/date ones should be available for range
> > searchers.
> >
> > So far the way I see it is to do my own post processing on the results.
> The
> > issue is that text fields will need to be untokenized, or otherwise it
> > would
> > be difficult to work on the result, and determine what matches.
> >
> > Thank you.
> > tjk :)
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Re: Problem: Indexing and searching repeating groups of fields

Posted by Erick Erickson <er...@gmail.com>.

Ooooh, isn't that easier. You just prompted me to think
that you don't even have to do that, just index the pairs as single
tokens (KeywordAnalyzer? but watch out for no case folding)...

On Wed, Jan 13, 2010 at 4:30 PM, Digy <di...@gmail.com> wrote:

> How about using languages as fieldnames?
> Doc1(Ra):
>        Java:5
>        C:2
>        PHP:3
>
> Doc2(Rb)
>        Java:2
>        C:5
>        VB:1
>
> Query:Java:5 AND C:2
>
> DIGY
>
> -----Original Message-----
> From: TJ Kolev [mailto:tjkolev@gmail.com]
> Sent: Wednesday, January 13, 2010 11:00 PM
> To: java-user@lucene.apache.org
> Subject: Problem: Indexing and searching repeating groups of fields
>
> Greetings,
>
> Let's assume I have to index and search "resume" documents. Two fields are
> defined: Language and Years. The fields are associated together in a group
> called Experience. A resume document may have 0 or more Experience groups:
>
> Ra{ E1(Java,5); E2(C,2); E3(PHP,3);}
> Rb{ E1(Java,2); E2(C,5); E3(VB,1);}
>
> How do I index such documents, and how do I search, so I can formulate a
> query like this "Resumes which have (Java,5) and (C,2)" and get back Ra. I
> know I can index multiple fields of the same name, and do "(Language:Java
> AND Years:5) AND (Language:C AND Years:2)", but in addition to Ra that
> would
> also return Rb, which I don't want. The problem here is that the "grouping"
> is lost. I can create fields with compound names (E1Language, E1Years,
> E2Language, E2Years, etc), but that helps me none, as I don't know which
> group to search. I'd also like to query for "(Language:Java AND Years:5) OR
> (Language:C AND Years:2)"
>
> This is a simplified example. Real documents may have 30 - 40 groups, each
> one with several fields. Putting all the fields in a group in one index
> field won't work as the numeric/date ones should be available for range
> searchers.
>
> So far the way I see it is to do my own post processing on the results. The
> issue is that text fields will need to be untokenized, or otherwise it
> would
> be difficult to work on the result, and determine what matches.
>
> Thank you.
> tjk :)
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Problem: Indexing and searching repeating groups of fields

Posted by Digy <di...@gmail.com>.

How about using languages as fieldnames?
Doc1(Ra):
	Java:5
	C:2
	PHP:3
	
Doc2(Rb)
	Java:2
	C:5
	VB:1

Query:Java:5 AND C:2

DIGY

-----Original Message-----
From: TJ Kolev [mailto:tjkolev@gmail.com] 
Sent: Wednesday, January 13, 2010 11:00 PM
To: java-user@lucene.apache.org
Subject: Problem: Indexing and searching repeating groups of fields

Greetings,

Let's assume I have to index and search "resume" documents. Two fields are
defined: Language and Years. The fields are associated together in a group
called Experience. A resume document may have 0 or more Experience groups:

Ra{ E1(Java,5); E2(C,2); E3(PHP,3);}
Rb{ E1(Java,2); E2(C,5); E3(VB,1);}

How do I index such documents, and how do I search, so I can formulate a
query like this "Resumes which have (Java,5) and (C,2)" and get back Ra. I
know I can index multiple fields of the same name, and do "(Language:Java
AND Years:5) AND (Language:C AND Years:2)", but in addition to Ra that would
also return Rb, which I don't want. The problem here is that the "grouping"
is lost. I can create fields with compound names (E1Language, E1Years,
E2Language, E2Years, etc), but that helps me none, as I don't know which
group to search. I'd also like to query for "(Language:Java AND Years:5) OR
(Language:C AND Years:2)"

This is a simplified example. Real documents may have 30 - 40 groups, each
one with several fields. Putting all the fields in a group in one index
field won't work as the numeric/date ones should be available for range
searchers.

So far the way I see it is to do my own post processing on the results. The
issue is that text fields will need to be untokenized, or otherwise it would
be difficult to work on the result, and determine what matches.

Thank you.
tjk :)


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org