You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Suba Suresh <su...@wolfram.com> on 2006/08/23 17:36:00 UTC

How to combine multiple fields to a single field for indexing

In "Lucene In Action" book it says it is better practice to combine two 
fields into one field and index it than use the MultiFieldQueryParser. 
Do I initially index both the fields and then index them again together? 
When I index them together do I index the fieldnames or values? Can 
someone give me an example of how to do it?

thanks,
suba suresh.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to combine multiple fields to a single field for indexing

Posted by Chris Hostetter <ho...@fucit.org>.

: How do you set the position increment gap between each addition to the

it does't have an explicit setter, you just subclass that Analyzer of your
choosing and override getPositionIncrementGap to return the value of your
choosing -- it could be a fixed value, or your Analyzer could be
sophisticated and know to put in a larger gap after it sees special marker
values/tokens (ie: a gap of 10 after each "sentence", a gap of 100 after
each "paragraph", a gap of 100 after each "page", ...)

: same field name. Should you set it as high as possible to prevent
: proximity queries from crossing it? I have been looking for the code to
	...
: nearspan, things blow up if you look for something within
: Integer.maximum--sic :) -- Will this be the same case for setting the
: positional gap and if so is there a good max to use to keep a query from
: ever crossing it?

How big of a gap you should use depends entirely on how you want to use it
-- you could say that a gap of "10" is big enough if you know your
application will never ask for phrase/span queries with slop greater then
"10" ... or you could pick 100, or 1000 .. it's entirely up to you; the
question is do you ever *want* your clients to be able to "bridge the
gap"?  if so, then they need to know how big the gap is, if not then they
need to be prevented from asking for slop bigger then the gap.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to combine multiple fields to a single field for indexing

Posted by Mark Miller <ma...@gmail.com>.

How do you set the position increment gap between each addition to the 
same field name. Should you set it as high as possible to prevent 
proximity queries from crossing it? I have been looking for the code to 
find out how to put a gap between each same name field addition, but I 
have been unable to find what I am looking for. Also, when using a 
nearspan, things blow up if you look for something within 
Integer.maximum--sic :) -- Will this be the same case for setting the 
positional gap and if so is there a good max to use to keep a query from 
ever crossing it?

Thanks,

Mark

Erik Hatcher wrote:
>
> On Aug 23, 2006, at 11:36 AM, Suba Suresh wrote:
>> In "Lucene In Action" book it says it is better practice to combine 
>> two fields into one field and index it than use the 
>> MultiFieldQueryParser. Do I initially index both the fields and then 
>> index them again together? When I index them together do I index the 
>> fieldnames or values? Can someone give me an example of how to do it?
>
> What I do is simply index all the fields individually that need to be 
> searchable or just stored, but also index a general-purpose "contents" 
> field with all of that same text.
>
> You can add multiple fields of the same name to a document, making it 
> easy to just keep appending to a "contents" field for a document.  You 
> can see how this is done in the Lucene in Action code in the 
> TestDataDocumentHandler.java - however I took a cruder approach and 
> appended the fields together with a space in between them rather than 
> using the multiple valued field approach.  Either technique will work 
> just fine.
>
>     Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to combine multiple fields to a single field for indexing

Posted by KEGan <kh...@gmail.com>.

Thanks. I think I grasp the concept now :)

On 8/27/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>
>
> On Aug 26, 2006, at 5:11 AM, KEGan wrote:
> > Erik,
> >
> > "Given the position increment gap between instances of same-named
> > fields that is now part of Lucene, I recommend using multiple field
> > instances instead."
> >
> > Did you mean ... recommend "NOT" using multiple field ?
>
> I said what I meant accurately.  Comparing building a single
> aggregate search field either by concatenating text into a single
> string and a single field, say "contents" instance, versus multiple
> "contents" instances that could get separated by a position increment
> gap, I recommend the second approach.
>
> But...
>
> > If we want to do query like "name:John" or boasting of Fields ...
> > then we
> > have to use multiple field instances, right ?
>
> of course.
>
>        Erik
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: How to combine multiple fields to a single field for indexing

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Aug 26, 2006, at 5:11 AM, KEGan wrote:
> Erik,
>
> "Given the position increment gap between instances of same-named
> fields that is now part of Lucene, I recommend using multiple field
> instances instead."
>
> Did you mean ... recommend "NOT" using multiple field ?

I said what I meant accurately.  Comparing building a single  
aggregate search field either by concatenating text into a single  
string and a single field, say "contents" instance, versus multiple  
"contents" instances that could get separated by a position increment  
gap, I recommend the second approach.

But...

> If we want to do query like "name:John" or boasting of Fields ...  
> then we
> have to use multiple field instances, right ?

of course.

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to combine multiple fields to a single field for indexing

Posted by KEGan <kh...@gmail.com>.

Erik,

"Given the position increment gap between instances of same-named
fields that is now part of Lucene, I recommend using multiple field
instances instead."

Did you mean ... recommend "NOT" using multiple field ?

If we want to do query like "name:John" or boasting of Fields ... then we
have to use multiple field instances, right ?


On 8/24/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>
> Yeah, I used a cruder form by appending all the text together into a
> single string with a space separator in that LIA example.
>
> Given the position increment gap between instances of same-named
> fields that is now part of Lucene, I recommend using multiple field
> instances instead.
>
>        Erik
>
>
>
> On Aug 24, 2006, at 3:05 AM, Gopikrishnan Subramani wrote:
> > Erik's has used a space as the field separator. May be you can use a
> > different field separator that your analyzer won't eat up, so that
> > will
> > change the token position by 1.
> >
> > Gopi
> >
> > On 8/24/06, KEGan <kh...@gmail.com> wrote:
> >>
> >> Erik,
> >>
> >> What is generally the reason for indexing both individual fields,
> >> and the
> >> general-purpose "content" field ?
> >>
> >> Also, if we search in the general-purpose "content" field, wouldnt
> >> this
> >> problem occurs. Let say we have 2 fields and the following values:
> >>
> >> name : John Smith
> >> food  : subway sandwich
> >>
> >> So the general-purpose "content" would have the following values:
> >>
> >> John Smith subway sandwich
> >>
> >> Hence, if the user search for "smith subway" (with quotation), the
> >> said
> >> document will be returned. On the other hand, if both fields were
> >> indexed
> >> seperately, this document would not be returned, since there is no
> >> field
> >> that contain the value "smith subway".
> >>
> >> How do we go about this problem ?
> >>
> >>
> >> On 8/24/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> >> >
> >> >
> >> > On Aug 23, 2006, at 11:36 AM, Suba Suresh wrote:
> >> > > In "Lucene In Action" book it says it is better practice to
> >> combine
> >> > > two fields into one field and index it than use the
> >> > > MultiFieldQueryParser. Do I initially index both the fields and
> >> > > then index them again together? When I index them together do I
> >> > > index the fieldnames or values? Can someone give me an example of
> >> > > how to do it?
> >> >
> >> > What I do is simply index all the fields individually that need
> >> to be
> >> > searchable or just stored, but also index a general-purpose
> >> > "contents" field with all of that same text.
> >> >
> >> > You can add multiple fields of the same name to a document,
> >> making it
> >> > easy to just keep appending to a "contents" field for a document.
> >> > You can see how this is done in the Lucene in Action code in the
> >> > TestDataDocumentHandler.java - however I took a cruder approach and
> >> > appended the fields together with a space in between them rather
> >> than
> >> > using the multiple valued field approach.  Either technique will
> >> work
> >> > just fine.
> >> >
> >> >        Erik
> >> >
> >> >
> >> >
> >> ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >
> >> >
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: How to combine multiple fields to a single field for indexing

Posted by Suba Suresh <su...@wolfram.com>.

Thanks for everyone's help. I understand how it works now. I can get rid 
of MultiFieldQueryParser in search.

thanks
suba suresh.


Erik Hatcher wrote:
> Yeah, I used a cruder form by appending all the text together into a 
> single string with a space separator in that LIA example.
> 
> Given the position increment gap between instances of same-named fields 
> that is now part of Lucene, I recommend using multiple field instances 
> instead.
> 
>     Erik
> 
> 
> 
> On Aug 24, 2006, at 3:05 AM, Gopikrishnan Subramani wrote:
>> Erik's has used a space as the field separator. May be you can use a
>> different field separator that your analyzer won't eat up, so that will
>> change the token position by 1.
>>
>> Gopi
>>
>> On 8/24/06, KEGan <kh...@gmail.com> wrote:
>>>
>>> Erik,
>>>
>>> What is generally the reason for indexing both individual fields, and 
>>> the
>>> general-purpose "content" field ?
>>>
>>> Also, if we search in the general-purpose "content" field, wouldnt this
>>> problem occurs. Let say we have 2 fields and the following values:
>>>
>>> name : John Smith
>>> food  : subway sandwich
>>>
>>> So the general-purpose "content" would have the following values:
>>>
>>> John Smith subway sandwich
>>>
>>> Hence, if the user search for "smith subway" (with quotation), the said
>>> document will be returned. On the other hand, if both fields were 
>>> indexed
>>> seperately, this document would not be returned, since there is no field
>>> that contain the value "smith subway".
>>>
>>> How do we go about this problem ?
>>>
>>>
>>> On 8/24/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>>> >
>>> >
>>> > On Aug 23, 2006, at 11:36 AM, Suba Suresh wrote:
>>> > > In "Lucene In Action" book it says it is better practice to combine
>>> > > two fields into one field and index it than use the
>>> > > MultiFieldQueryParser. Do I initially index both the fields and
>>> > > then index them again together? When I index them together do I
>>> > > index the fieldnames or values? Can someone give me an example of
>>> > > how to do it?
>>> >
>>> > What I do is simply index all the fields individually that need to be
>>> > searchable or just stored, but also index a general-purpose
>>> > "contents" field with all of that same text.
>>> >
>>> > You can add multiple fields of the same name to a document, making it
>>> > easy to just keep appending to a "contents" field for a document.
>>> > You can see how this is done in the Lucene in Action code in the
>>> > TestDataDocumentHandler.java - however I took a cruder approach and
>>> > appended the fields together with a space in between them rather than
>>> > using the multiple valued field approach.  Either technique will work
>>> > just fine.
>>> >
>>> >        Erik
>>> >
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>>> >
>>> >
>>>
>>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to combine multiple fields to a single field for indexing

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

Yeah, I used a cruder form by appending all the text together into a  
single string with a space separator in that LIA example.

Given the position increment gap between instances of same-named  
fields that is now part of Lucene, I recommend using multiple field  
instances instead.

	Erik



On Aug 24, 2006, at 3:05 AM, Gopikrishnan Subramani wrote:
> Erik's has used a space as the field separator. May be you can use a
> different field separator that your analyzer won't eat up, so that  
> will
> change the token position by 1.
>
> Gopi
>
> On 8/24/06, KEGan <kh...@gmail.com> wrote:
>>
>> Erik,
>>
>> What is generally the reason for indexing both individual fields,  
>> and the
>> general-purpose "content" field ?
>>
>> Also, if we search in the general-purpose "content" field, wouldnt  
>> this
>> problem occurs. Let say we have 2 fields and the following values:
>>
>> name : John Smith
>> food  : subway sandwich
>>
>> So the general-purpose "content" would have the following values:
>>
>> John Smith subway sandwich
>>
>> Hence, if the user search for "smith subway" (with quotation), the  
>> said
>> document will be returned. On the other hand, if both fields were  
>> indexed
>> seperately, this document would not be returned, since there is no  
>> field
>> that contain the value "smith subway".
>>
>> How do we go about this problem ?
>>
>>
>> On 8/24/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>> >
>> >
>> > On Aug 23, 2006, at 11:36 AM, Suba Suresh wrote:
>> > > In "Lucene In Action" book it says it is better practice to  
>> combine
>> > > two fields into one field and index it than use the
>> > > MultiFieldQueryParser. Do I initially index both the fields and
>> > > then index them again together? When I index them together do I
>> > > index the fieldnames or values? Can someone give me an example of
>> > > how to do it?
>> >
>> > What I do is simply index all the fields individually that need  
>> to be
>> > searchable or just stored, but also index a general-purpose
>> > "contents" field with all of that same text.
>> >
>> > You can add multiple fields of the same name to a document,  
>> making it
>> > easy to just keep appending to a "contents" field for a document.
>> > You can see how this is done in the Lucene in Action code in the
>> > TestDataDocumentHandler.java - however I took a cruder approach and
>> > appended the fields together with a space in between them rather  
>> than
>> > using the multiple valued field approach.  Either technique will  
>> work
>> > just fine.
>> >
>> >        Erik
>> >
>> >
>> >  
>> ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to combine multiple fields to a single field for indexing

Posted by KEGan <kh...@gmail.com>.

I think I start to understand this :) .. Thanks guys.

~KEGan


On 8/24/06, Gopikrishnan Subramani <go...@gmail.com> wrote:
>
> Erik's has used a space as the field separator. May be you can use a
> different field separator that your analyzer won't eat up, so that will
> change the token position by 1.
>
> Gopi
>
> On 8/24/06, KEGan <kh...@gmail.com> wrote:
> >
> > Erik,
> >
> > What is generally the reason for indexing both individual fields, and
> the
> > general-purpose "content" field ?
> >
> > Also, if we search in the general-purpose "content" field, wouldnt this
> > problem occurs. Let say we have 2 fields and the following values:
> >
> > name : John Smith
> > food  : subway sandwich
> >
> > So the general-purpose "content" would have the following values:
> >
> > John Smith subway sandwich
> >
> > Hence, if the user search for "smith subway" (with quotation), the said
> > document will be returned. On the other hand, if both fields were
> indexed
> > seperately, this document would not be returned, since there is no field
> > that contain the value "smith subway".
> >
> > How do we go about this problem ?
> >
> >
> > On 8/24/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> > >
> > >
> > > On Aug 23, 2006, at 11:36 AM, Suba Suresh wrote:
> > > > In "Lucene In Action" book it says it is better practice to combine
> > > > two fields into one field and index it than use the
> > > > MultiFieldQueryParser. Do I initially index both the fields and
> > > > then index them again together? When I index them together do I
> > > > index the fieldnames or values? Can someone give me an example of
> > > > how to do it?
> > >
> > > What I do is simply index all the fields individually that need to be
> > > searchable or just stored, but also index a general-purpose
> > > "contents" field with all of that same text.
> > >
> > > You can add multiple fields of the same name to a document, making it
> > > easy to just keep appending to a "contents" field for a document.
> > > You can see how this is done in the Lucene in Action code in the
> > > TestDataDocumentHandler.java - however I took a cruder approach and
> > > appended the fields together with a space in between them rather than
> > > using the multiple valued field approach.  Either technique will work
> > > just fine.
> > >
> > >        Erik
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
>
>

Re: How to combine multiple fields to a single field for indexing

Posted by Gopikrishnan Subramani <go...@gmail.com>.

Erik's has used a space as the field separator. May be you can use a
different field separator that your analyzer won't eat up, so that will
change the token position by 1.

Gopi

On 8/24/06, KEGan <kh...@gmail.com> wrote:
>
> Erik,
>
> What is generally the reason for indexing both individual fields, and the
> general-purpose "content" field ?
>
> Also, if we search in the general-purpose "content" field, wouldnt this
> problem occurs. Let say we have 2 fields and the following values:
>
> name : John Smith
> food  : subway sandwich
>
> So the general-purpose "content" would have the following values:
>
> John Smith subway sandwich
>
> Hence, if the user search for "smith subway" (with quotation), the said
> document will be returned. On the other hand, if both fields were indexed
> seperately, this document would not be returned, since there is no field
> that contain the value "smith subway".
>
> How do we go about this problem ?
>
>
> On 8/24/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> >
> >
> > On Aug 23, 2006, at 11:36 AM, Suba Suresh wrote:
> > > In "Lucene In Action" book it says it is better practice to combine
> > > two fields into one field and index it than use the
> > > MultiFieldQueryParser. Do I initially index both the fields and
> > > then index them again together? When I index them together do I
> > > index the fieldnames or values? Can someone give me an example of
> > > how to do it?
> >
> > What I do is simply index all the fields individually that need to be
> > searchable or just stored, but also index a general-purpose
> > "contents" field with all of that same text.
> >
> > You can add multiple fields of the same name to a document, making it
> > easy to just keep appending to a "contents" field for a document.
> > You can see how this is done in the Lucene in Action code in the
> > TestDataDocumentHandler.java - however I took a cruder approach and
> > appended the fields together with a space in between them rather than
> > using the multiple valued field approach.  Either technique will work
> > just fine.
> >
> >        Erik
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>

Re: How to combine multiple fields to a single field for indexing

Posted by Chris Hostetter <ho...@fucit.org>.

: What is generally the reason for indexing both individual fields, and the
: general-purpose "content" field ?

so you can explicitly query for "name:paris" or "city:paris" instead of
just "paris"

: name : John Smith
: food  : subway sandwich
:
: So the general-purpose "content" would have the following values:
:
: John Smith subway sandwich
:
: Hence, if the user search for "smith subway" (with quotation), the said

not exactly ... this is where the position incriment gap of your Analyzer
comes in.  you can say how much gap exists between two seperate values in
the same field, so if your gap is 10 then contents:"smith subway"~5 won't
match ... but contents:(smith subway) will


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to combine multiple fields to a single field for indexing

Posted by KEGan <kh...@gmail.com>.

Erik,

What is generally the reason for indexing both individual fields, and the
general-purpose "content" field ?

Also, if we search in the general-purpose "content" field, wouldnt this
problem occurs. Let say we have 2 fields and the following values:

name : John Smith
food  : subway sandwich

So the general-purpose "content" would have the following values:

John Smith subway sandwich

Hence, if the user search for "smith subway" (with quotation), the said
document will be returned. On the other hand, if both fields were indexed
seperately, this document would not be returned, since there is no field
that contain the value "smith subway".

How do we go about this problem ?


On 8/24/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>
>
> On Aug 23, 2006, at 11:36 AM, Suba Suresh wrote:
> > In "Lucene In Action" book it says it is better practice to combine
> > two fields into one field and index it than use the
> > MultiFieldQueryParser. Do I initially index both the fields and
> > then index them again together? When I index them together do I
> > index the fieldnames or values? Can someone give me an example of
> > how to do it?
>
> What I do is simply index all the fields individually that need to be
> searchable or just stored, but also index a general-purpose
> "contents" field with all of that same text.
>
> You can add multiple fields of the same name to a document, making it
> easy to just keep appending to a "contents" field for a document.
> You can see how this is done in the Lucene in Action code in the
> TestDataDocumentHandler.java - however I took a cruder approach and
> appended the fields together with a space in between them rather than
> using the multiple valued field approach.  Either technique will work
> just fine.
>
>        Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: How to combine multiple fields to a single field for indexing

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Aug 23, 2006, at 11:36 AM, Suba Suresh wrote:
> In "Lucene In Action" book it says it is better practice to combine  
> two fields into one field and index it than use the  
> MultiFieldQueryParser. Do I initially index both the fields and  
> then index them again together? When I index them together do I  
> index the fieldnames or values? Can someone give me an example of  
> how to do it?

What I do is simply index all the fields individually that need to be  
searchable or just stored, but also index a general-purpose  
"contents" field with all of that same text.

You can add multiple fields of the same name to a document, making it  
easy to just keep appending to a "contents" field for a document.   
You can see how this is done in the Lucene in Action code in the  
TestDataDocumentHandler.java - however I took a cruder approach and  
appended the fields together with a space in between them rather than  
using the multiple valued field approach.  Either technique will work  
just fine.

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org