You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Suba Suresh <su...@wolfram.com> on 2006/08/23 17:36:00 UTC
How to combine multiple fields to a single field for indexing
In "Lucene In Action" book it says it is better practice to combine two
fields into one field and index it than use the MultiFieldQueryParser.
Do I initially index both the fields and then index them again together?
When I index them together do I index the fieldnames or values? Can
someone give me an example of how to do it?
thanks,
suba suresh.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: How to combine multiple fields to a single field for indexing
Posted by Chris Hostetter <ho...@fucit.org>.
: How do you set the position increment gap between each addition to the
it does't have an explicit setter, you just subclass that Analyzer of your
choosing and override getPositionIncrementGap to return the value of your
choosing -- it could be a fixed value, or your Analyzer could be
sophisticated and know to put in a larger gap after it sees special marker
values/tokens (ie: a gap of 10 after each "sentence", a gap of 100 after
each "paragraph", a gap of 100 after each "page", ...)
: same field name. Should you set it as high as possible to prevent
: proximity queries from crossing it? I have been looking for the code to
...
: nearspan, things blow up if you look for something within
: Integer.maximum--sic :) -- Will this be the same case for setting the
: positional gap and if so is there a good max to use to keep a query from
: ever crossing it?
How big of a gap you should use depends entirely on how you want to use it
-- you could say that a gap of "10" is big enough if you know your
application will never ask for phrase/span queries with slop greater then
"10" ... or you could pick 100, or 1000 .. it's entirely up to you; the
question is do you ever *want* your clients to be able to "bridge the
gap"? if so, then they need to know how big the gap is, if not then they
need to be prevented from asking for slop bigger then the gap.
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: How to combine multiple fields to a single field for indexing
Posted by Mark Miller <ma...@gmail.com>.
How do you set the position increment gap between each addition to the
same field name. Should you set it as high as possible to prevent
proximity queries from crossing it? I have been looking for the code to
find out how to put a gap between each same name field addition, but I
have been unable to find what I am looking for. Also, when using a
nearspan, things blow up if you look for something within
Integer.maximum--sic :) -- Will this be the same case for setting the
positional gap and if so is there a good max to use to keep a query from
ever crossing it?
Thanks,
Mark
Erik Hatcher wrote:
>
> On Aug 23, 2006, at 11:36 AM, Suba Suresh wrote:
>> In "Lucene In Action" book it says it is better practice to combine
>> two fields into one field and index it than use the
>> MultiFieldQueryParser. Do I initially index both the fields and then
>> index them again together? When I index them together do I index the
>> fieldnames or values? Can someone give me an example of how to do it?
>
> What I do is simply index all the fields individually that need to be
> searchable or just stored, but also index a general-purpose "contents"
> field with all of that same text.
>
> You can add multiple fields of the same name to a document, making it
> easy to just keep appending to a "contents" field for a document. You
> can see how this is done in the Lucene in Action code in the
> TestDataDocumentHandler.java - however I took a cruder approach and
> appended the fields together with a space in between them rather than
> using the multiple valued field approach. Either technique will work
> just fine.
>
> Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: How to combine multiple fields to a single field for indexing
Posted by KEGan <kh...@gmail.com>.
Thanks. I think I grasp the concept now :)
On 8/27/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>
>
> On Aug 26, 2006, at 5:11 AM, KEGan wrote:
> > Erik,
> >
> > "Given the position increment gap between instances of same-named
> > fields that is now part of Lucene, I recommend using multiple field
> > instances instead."
> >
> > Did you mean ... recommend "NOT" using multiple field ?
>
> I said what I meant accurately. Comparing building a single
> aggregate search field either by concatenating text into a single
> string and a single field, say "contents" instance, versus multiple
> "contents" instances that could get separated by a position increment
> gap, I recommend the second approach.
>
> But...
>
> > If we want to do query like "name:John" or boasting of Fields ...
> > then we
> > have to use multiple field instances, right ?
>
> of course.
>
> Erik
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: How to combine multiple fields to a single field for indexing
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Aug 26, 2006, at 5:11 AM, KEGan wrote:
> Erik,
>
> "Given the position increment gap between instances of same-named
> fields that is now part of Lucene, I recommend using multiple field
> instances instead."
>
> Did you mean ... recommend "NOT" using multiple field ?
I said what I meant accurately. Comparing building a single
aggregate search field either by concatenating text into a single
string and a single field, say "contents" instance, versus multiple
"contents" instances that could get separated by a position increment
gap, I recommend the second approach.
But...
> If we want to do query like "name:John" or boasting of Fields ...
> then we
> have to use multiple field instances, right ?
of course.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: How to combine multiple fields to a single field for indexing
Posted by KEGan <kh...@gmail.com>.
Erik,
"Given the position increment gap between instances of same-named
fields that is now part of Lucene, I recommend using multiple field
instances instead."
Did you mean ... recommend "NOT" using multiple field ?
If we want to do query like "name:John" or boasting of Fields ... then we
have to use multiple field instances, right ?
On 8/24/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>
> Yeah, I used a cruder form by appending all the text together into a
> single string with a space separator in that LIA example.
>
> Given the position increment gap between instances of same-named
> fields that is now part of Lucene, I recommend using multiple field
> instances instead.
>
> Erik
>
>
>
> On Aug 24, 2006, at 3:05 AM, Gopikrishnan Subramani wrote:
> > Erik's has used a space as the field separator. May be you can use a
> > different field separator that your analyzer won't eat up, so that
> > will
> > change the token position by 1.
> >
> > Gopi
> >
> > On 8/24/06, KEGan <kh...@gmail.com> wrote:
> >>
> >> Erik,
> >>
> >> What is generally the reason for indexing both individual fields,
> >> and the
> >> general-purpose "content" field ?
> >>
> >> Also, if we search in the general-purpose "content" field, wouldnt
> >> this
> >> problem occurs. Let say we have 2 fields and the following values:
> >>
> >> name : John Smith
> >> food : subway sandwich
> >>
> >> So the general-purpose "content" would have the following values:
> >>
> >> John Smith subway sandwich
> >>
> >> Hence, if the user search for "smith subway" (with quotation), the
> >> said
> >> document will be returned. On the other hand, if both fields were
> >> indexed
> >> seperately, this document would not be returned, since there is no
> >> field
> >> that contain the value "smith subway".
> >>
> >> How do we go about this problem ?
> >>
> >>
> >> On 8/24/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> >> >
> >> >
> >> > On Aug 23, 2006, at 11:36 AM, Suba Suresh wrote:
> >> > > In "Lucene In Action" book it says it is better practice to
> >> combine
> >> > > two fields into one field and index it than use the
> >> > > MultiFieldQueryParser. Do I initially index both the fields and
> >> > > then index them again together? When I index them together do I
> >> > > index the fieldnames or values? Can someone give me an example of
> >> > > how to do it?
> >> >
> >> > What I do is simply index all the fields individually that need
> >> to be
> >> > searchable or just stored, but also index a general-purpose
> >> > "contents" field with all of that same text.
> >> >
> >> > You can add multiple fields of the same name to a document,
> >> making it
> >> > easy to just keep appending to a "contents" field for a document.
> >> > You can see how this is done in the Lucene in Action code in the
> >> > TestDataDocumentHandler.java - however I took a cruder approach and
> >> > appended the fields together with a space in between them rather
> >> than
> >> > using the multiple valued field approach. Either technique will
> >> work
> >> > just fine.
> >> >
> >> > Erik
> >> >
> >> >
> >> >
> >> ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >
> >> >
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: How to combine multiple fields to a single field for indexing
Posted by Suba Suresh <su...@wolfram.com>.
Thanks for everyone's help. I understand how it works now. I can get rid
of MultiFieldQueryParser in search.
thanks
suba suresh.
Erik Hatcher wrote:
> Yeah, I used a cruder form by appending all the text together into a
> single string with a space separator in that LIA example.
>
> Given the position increment gap between instances of same-named fields
> that is now part of Lucene, I recommend using multiple field instances
> instead.
>
> Erik
>
>
>
> On Aug 24, 2006, at 3:05 AM, Gopikrishnan Subramani wrote:
>> Erik's has used a space as the field separator. May be you can use a
>> different field separator that your analyzer won't eat up, so that will
>> change the token position by 1.
>>
>> Gopi
>>
>> On 8/24/06, KEGan <kh...@gmail.com> wrote:
>>>
>>> Erik,
>>>
>>> What is generally the reason for indexing both individual fields, and
>>> the
>>> general-purpose "content" field ?
>>>
>>> Also, if we search in the general-purpose "content" field, wouldnt this
>>> problem occurs. Let say we have 2 fields and the following values:
>>>
>>> name : John Smith
>>> food : subway sandwich
>>>
>>> So the general-purpose "content" would have the following values:
>>>
>>> John Smith subway sandwich
>>>
>>> Hence, if the user search for "smith subway" (with quotation), the said
>>> document will be returned. On the other hand, if both fields were
>>> indexed
>>> seperately, this document would not be returned, since there is no field
>>> that contain the value "smith subway".
>>>
>>> How do we go about this problem ?
>>>
>>>
>>> On 8/24/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>>> >
>>> >
>>> > On Aug 23, 2006, at 11:36 AM, Suba Suresh wrote:
>>> > > In "Lucene In Action" book it says it is better practice to combine
>>> > > two fields into one field and index it than use the
>>> > > MultiFieldQueryParser. Do I initially index both the fields and
>>> > > then index them again together? When I index them together do I
>>> > > index the fieldnames or values? Can someone give me an example of
>>> > > how to do it?
>>> >
>>> > What I do is simply index all the fields individually that need to be
>>> > searchable or just stored, but also index a general-purpose
>>> > "contents" field with all of that same text.
>>> >
>>> > You can add multiple fields of the same name to a document, making it
>>> > easy to just keep appending to a "contents" field for a document.
>>> > You can see how this is done in the Lucene in Action code in the
>>> > TestDataDocumentHandler.java - however I took a cruder approach and
>>> > appended the fields together with a space in between them rather than
>>> > using the multiple valued field approach. Either technique will work
>>> > just fine.
>>> >
>>> > Erik
>>> >
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>>> >
>>> >
>>>
>>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: How to combine multiple fields to a single field for indexing
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Yeah, I used a cruder form by appending all the text together into a
single string with a space separator in that LIA example.
Given the position increment gap between instances of same-named
fields that is now part of Lucene, I recommend using multiple field
instances instead.
Erik
On Aug 24, 2006, at 3:05 AM, Gopikrishnan Subramani wrote:
> Erik's has used a space as the field separator. May be you can use a
> different field separator that your analyzer won't eat up, so that
> will
> change the token position by 1.
>
> Gopi
>
> On 8/24/06, KEGan <kh...@gmail.com> wrote:
>>
>> Erik,
>>
>> What is generally the reason for indexing both individual fields,
>> and the
>> general-purpose "content" field ?
>>
>> Also, if we search in the general-purpose "content" field, wouldnt
>> this
>> problem occurs. Let say we have 2 fields and the following values:
>>
>> name : John Smith
>> food : subway sandwich
>>
>> So the general-purpose "content" would have the following values:
>>
>> John Smith subway sandwich
>>
>> Hence, if the user search for "smith subway" (with quotation), the
>> said
>> document will be returned. On the other hand, if both fields were
>> indexed
>> seperately, this document would not be returned, since there is no
>> field
>> that contain the value "smith subway".
>>
>> How do we go about this problem ?
>>
>>
>> On 8/24/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>> >
>> >
>> > On Aug 23, 2006, at 11:36 AM, Suba Suresh wrote:
>> > > In "Lucene In Action" book it says it is better practice to
>> combine
>> > > two fields into one field and index it than use the
>> > > MultiFieldQueryParser. Do I initially index both the fields and
>> > > then index them again together? When I index them together do I
>> > > index the fieldnames or values? Can someone give me an example of
>> > > how to do it?
>> >
>> > What I do is simply index all the fields individually that need
>> to be
>> > searchable or just stored, but also index a general-purpose
>> > "contents" field with all of that same text.
>> >
>> > You can add multiple fields of the same name to a document,
>> making it
>> > easy to just keep appending to a "contents" field for a document.
>> > You can see how this is done in the Lucene in Action code in the
>> > TestDataDocumentHandler.java - however I took a cruder approach and
>> > appended the fields together with a space in between them rather
>> than
>> > using the multiple valued field approach. Either technique will
>> work
>> > just fine.
>> >
>> > Erik
>> >
>> >
>> >
>> ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: How to combine multiple fields to a single field for indexing
Posted by KEGan <kh...@gmail.com>.
I think I start to understand this :) .. Thanks guys.
~KEGan
On 8/24/06, Gopikrishnan Subramani <go...@gmail.com> wrote:
>
> Erik's has used a space as the field separator. May be you can use a
> different field separator that your analyzer won't eat up, so that will
> change the token position by 1.
>
> Gopi
>
> On 8/24/06, KEGan <kh...@gmail.com> wrote:
> >
> > Erik,
> >
> > What is generally the reason for indexing both individual fields, and
> the
> > general-purpose "content" field ?
> >
> > Also, if we search in the general-purpose "content" field, wouldnt this
> > problem occurs. Let say we have 2 fields and the following values:
> >
> > name : John Smith
> > food : subway sandwich
> >
> > So the general-purpose "content" would have the following values:
> >
> > John Smith subway sandwich
> >
> > Hence, if the user search for "smith subway" (with quotation), the said
> > document will be returned. On the other hand, if both fields were
> indexed
> > seperately, this document would not be returned, since there is no field
> > that contain the value "smith subway".
> >
> > How do we go about this problem ?
> >
> >
> > On 8/24/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> > >
> > >
> > > On Aug 23, 2006, at 11:36 AM, Suba Suresh wrote:
> > > > In "Lucene In Action" book it says it is better practice to combine
> > > > two fields into one field and index it than use the
> > > > MultiFieldQueryParser. Do I initially index both the fields and
> > > > then index them again together? When I index them together do I
> > > > index the fieldnames or values? Can someone give me an example of
> > > > how to do it?
> > >
> > > What I do is simply index all the fields individually that need to be
> > > searchable or just stored, but also index a general-purpose
> > > "contents" field with all of that same text.
> > >
> > > You can add multiple fields of the same name to a document, making it
> > > easy to just keep appending to a "contents" field for a document.
> > > You can see how this is done in the Lucene in Action code in the
> > > TestDataDocumentHandler.java - however I took a cruder approach and
> > > appended the fields together with a space in between them rather than
> > > using the multiple valued field approach. Either technique will work
> > > just fine.
> > >
> > > Erik
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
>
>
Re: How to combine multiple fields to a single field for indexing
Posted by Gopikrishnan Subramani <go...@gmail.com>.
Erik's has used a space as the field separator. May be you can use a
different field separator that your analyzer won't eat up, so that will
change the token position by 1.
Gopi
On 8/24/06, KEGan <kh...@gmail.com> wrote:
>
> Erik,
>
> What is generally the reason for indexing both individual fields, and the
> general-purpose "content" field ?
>
> Also, if we search in the general-purpose "content" field, wouldnt this
> problem occurs. Let say we have 2 fields and the following values:
>
> name : John Smith
> food : subway sandwich
>
> So the general-purpose "content" would have the following values:
>
> John Smith subway sandwich
>
> Hence, if the user search for "smith subway" (with quotation), the said
> document will be returned. On the other hand, if both fields were indexed
> seperately, this document would not be returned, since there is no field
> that contain the value "smith subway".
>
> How do we go about this problem ?
>
>
> On 8/24/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> >
> >
> > On Aug 23, 2006, at 11:36 AM, Suba Suresh wrote:
> > > In "Lucene In Action" book it says it is better practice to combine
> > > two fields into one field and index it than use the
> > > MultiFieldQueryParser. Do I initially index both the fields and
> > > then index them again together? When I index them together do I
> > > index the fieldnames or values? Can someone give me an example of
> > > how to do it?
> >
> > What I do is simply index all the fields individually that need to be
> > searchable or just stored, but also index a general-purpose
> > "contents" field with all of that same text.
> >
> > You can add multiple fields of the same name to a document, making it
> > easy to just keep appending to a "contents" field for a document.
> > You can see how this is done in the Lucene in Action code in the
> > TestDataDocumentHandler.java - however I took a cruder approach and
> > appended the fields together with a space in between them rather than
> > using the multiple valued field approach. Either technique will work
> > just fine.
> >
> > Erik
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
Re: How to combine multiple fields to a single field for indexing
Posted by Chris Hostetter <ho...@fucit.org>.
: What is generally the reason for indexing both individual fields, and the
: general-purpose "content" field ?
so you can explicitly query for "name:paris" or "city:paris" instead of
just "paris"
: name : John Smith
: food : subway sandwich
:
: So the general-purpose "content" would have the following values:
:
: John Smith subway sandwich
:
: Hence, if the user search for "smith subway" (with quotation), the said
not exactly ... this is where the position incriment gap of your Analyzer
comes in. you can say how much gap exists between two seperate values in
the same field, so if your gap is 10 then contents:"smith subway"~5 won't
match ... but contents:(smith subway) will
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: How to combine multiple fields to a single field for indexing
Posted by KEGan <kh...@gmail.com>.
Erik,
What is generally the reason for indexing both individual fields, and the
general-purpose "content" field ?
Also, if we search in the general-purpose "content" field, wouldnt this
problem occurs. Let say we have 2 fields and the following values:
name : John Smith
food : subway sandwich
So the general-purpose "content" would have the following values:
John Smith subway sandwich
Hence, if the user search for "smith subway" (with quotation), the said
document will be returned. On the other hand, if both fields were indexed
seperately, this document would not be returned, since there is no field
that contain the value "smith subway".
How do we go about this problem ?
On 8/24/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>
>
> On Aug 23, 2006, at 11:36 AM, Suba Suresh wrote:
> > In "Lucene In Action" book it says it is better practice to combine
> > two fields into one field and index it than use the
> > MultiFieldQueryParser. Do I initially index both the fields and
> > then index them again together? When I index them together do I
> > index the fieldnames or values? Can someone give me an example of
> > how to do it?
>
> What I do is simply index all the fields individually that need to be
> searchable or just stored, but also index a general-purpose
> "contents" field with all of that same text.
>
> You can add multiple fields of the same name to a document, making it
> easy to just keep appending to a "contents" field for a document.
> You can see how this is done in the Lucene in Action code in the
> TestDataDocumentHandler.java - however I took a cruder approach and
> appended the fields together with a space in between them rather than
> using the multiple valued field approach. Either technique will work
> just fine.
>
> Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: How to combine multiple fields to a single field for indexing
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Aug 23, 2006, at 11:36 AM, Suba Suresh wrote:
> In "Lucene In Action" book it says it is better practice to combine
> two fields into one field and index it than use the
> MultiFieldQueryParser. Do I initially index both the fields and
> then index them again together? When I index them together do I
> index the fieldnames or values? Can someone give me an example of
> how to do it?
What I do is simply index all the fields individually that need to be
searchable or just stored, but also index a general-purpose
"contents" field with all of that same text.
You can add multiple fields of the same name to a document, making it
easy to just keep appending to a "contents" field for a document.
You can see how this is done in the Lucene in Action code in the
TestDataDocumentHandler.java - however I took a cruder approach and
appended the fields together with a space in between them rather than
using the multiple valued field approach. Either technique will work
just fine.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org