You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by lee carroll <le...@googlemail.com> on 2011/10/07 09:27:59 UTC

multiple document types in a core

I've prototyped a solution which makes use of multiple doc types.
Does the following have any value in terms of field value storage or
are field values saved once and pointers from other records maintained
making the below
design redundant?

we have CITY (500) and each city has many HOTEL (75000). our schema looks like:

doc type : CITY (stored and indexed)
n number of city attributes (lat,lon, name, city description, city
holiday categories etc) All stored and indexed

doc type HOTEL (stored and indexed)
n number of hotel attributes (lat, lon, hotel name, hotel description,
star rating, feature list, etc ) All stored and indexed
n number of city attributes (lat,lon, name, city description, city
holiday categories) **not stored** but indexed

Basically we can search hotels using city attributes but to display
city data for a chosen hotel we would search for that city document to
retrieve values.

Do we gain anything here ? Basically would the city fields associated
with hotels be stored and repeated 74500 less times or are the values
stored once and pointers for
each hotel document kept to point at the city values ?

Has any one done a similiar thing and ran into problems ??

thanks in advance lee c

Re: multiple document types in a core

Posted by lee carroll <le...@googlemail.com>.
Hi Erick,

Your right I think. On resources we gain a little bit on:
disk (a production implementation with live data would be 500 mb saved
in disk usage on each slave and master)
some reduction in network traffic on replication (we do a full
re-index every 24 hours at present)

On design we gain a little by being able to support searches at
various document levels (perform a destination search or hotel search
and return
documents at the "correct" level for the search with out the need to
perform field collapsing)

But in the cold light of day I don't think we gain huge amounts.
(leaving aside the index replication of a full index)

cheers lee c



On 23 October 2011 19:05, Erick Erickson <er...@gmail.com> wrote:
> Yes, stored fields are placed verbatim for every doc. But I wonder
> at the utility of trying to share stored information. The stored
> info is put in certain files in the index, see:
> http://lucene.apache.org/java/3_0_2/fileformats.html#file-names
>
> and the files that store data are pretty much irrelevant to searching,
> the data in them is only referenced when assembling the document
> for return. So by adding this complexity you'll be saving a bit
> on file transfers when replicating your index, but not much else.
>
> Is it worth it? If so, why?
>
> Best
> Erick
>
> On Mon, Oct 17, 2011 at 11:07 AM, lee carroll
> <le...@googlemail.com> wrote:
>> Just as a follow up
>>
>> it looks like stored fields are stored verbatim for every doc.
>>
>> hotel index and store dest attributes
>> index size: 131M
>> number of records 49147
>>
>> hotel index only dest attributes
>>
>> index size: 111m
>> number of records 49147
>>
>>
>> ~400 chars(bytes) of destination data * 49147 (number of hotel docs) = ~19m
>>
>> basically everything is being stored
>>
>> No difference in time to index (very rough and not scientific :-) )
>>
>> So it does seem an ok strategy to denormalise docs with index fields
>> but normalise with stored fields ?
>> Or have i missed some problems with this ?
>>
>> cheers lee c
>>
>>
>>
>> On 16 October 2011 11:54, lee carroll <le...@googlemail.com> wrote:
>>> Hi Chris thanks for the response
>>>
>>>> It's an inverted index, so *tems* exist once (per segment) and those terms
>>>> "point" to the documents -- so having the same terms (in the same fields)
>>>> for multiple types of documents in one index is going to take up less
>>>> overall space then having distinct collections for each type of document.
>>>
>>> I'm not asking about the indexed terms but rather the stored values.
>>> By having two doc types are we gaining anything by "storing"
>>> attributes only for that doc type
>>>
>>> cheers lee c
>>>
>>
>

Re: multiple document types in a core

Posted by Erick Erickson <er...@gmail.com>.
Yes, stored fields are placed verbatim for every doc. But I wonder
at the utility of trying to share stored information. The stored
info is put in certain files in the index, see:
http://lucene.apache.org/java/3_0_2/fileformats.html#file-names

and the files that store data are pretty much irrelevant to searching,
the data in them is only referenced when assembling the document
for return. So by adding this complexity you'll be saving a bit
on file transfers when replicating your index, but not much else.

Is it worth it? If so, why?

Best
Erick

On Mon, Oct 17, 2011 at 11:07 AM, lee carroll
<le...@googlemail.com> wrote:
> Just as a follow up
>
> it looks like stored fields are stored verbatim for every doc.
>
> hotel index and store dest attributes
> index size: 131M
> number of records 49147
>
> hotel index only dest attributes
>
> index size: 111m
> number of records 49147
>
>
> ~400 chars(bytes) of destination data * 49147 (number of hotel docs) = ~19m
>
> basically everything is being stored
>
> No difference in time to index (very rough and not scientific :-) )
>
> So it does seem an ok strategy to denormalise docs with index fields
> but normalise with stored fields ?
> Or have i missed some problems with this ?
>
> cheers lee c
>
>
>
> On 16 October 2011 11:54, lee carroll <le...@googlemail.com> wrote:
>> Hi Chris thanks for the response
>>
>>> It's an inverted index, so *tems* exist once (per segment) and those terms
>>> "point" to the documents -- so having the same terms (in the same fields)
>>> for multiple types of documents in one index is going to take up less
>>> overall space then having distinct collections for each type of document.
>>
>> I'm not asking about the indexed terms but rather the stored values.
>> By having two doc types are we gaining anything by "storing"
>> attributes only for that doc type
>>
>> cheers lee c
>>
>

Re: multiple document types in a core

Posted by lee carroll <le...@googlemail.com>.
Just as a follow up

it looks like stored fields are stored verbatim for every doc.

hotel index and store dest attributes
index size: 131M
number of records 49147

hotel index only dest attributes

index size: 111m
number of records 49147


~400 chars(bytes) of destination data * 49147 (number of hotel docs) = ~19m

basically everything is being stored

No difference in time to index (very rough and not scientific :-) )

So it does seem an ok strategy to denormalise docs with index fields
but normalise with stored fields ?
Or have i missed some problems with this ?

cheers lee c



On 16 October 2011 11:54, lee carroll <le...@googlemail.com> wrote:
> Hi Chris thanks for the response
>
>> It's an inverted index, so *tems* exist once (per segment) and those terms
>> "point" to the documents -- so having the same terms (in the same fields)
>> for multiple types of documents in one index is going to take up less
>> overall space then having distinct collections for each type of document.
>
> I'm not asking about the indexed terms but rather the stored values.
> By having two doc types are we gaining anything by "storing"
> attributes only for that doc type
>
> cheers lee c
>

Re: multiple document types in a core

Posted by lee carroll <le...@googlemail.com>.
Hi Chris thanks for the response

> It's an inverted index, so *tems* exist once (per segment) and those terms
> "point" to the documents -- so having the same terms (in the same fields)
> for multiple types of documents in one index is going to take up less
> overall space then having distinct collections for each type of document.

I'm not asking about the indexed terms but rather the stored values.
By having two doc types are we gaining anything by "storing"
attributes only for that doc type

cheers lee c

Re: multiple document types in a core

Posted by Chris Hostetter <ho...@fucit.org>.
: Basically we can search hotels using city attributes but to display
: city data for a chosen hotel we would search for that city document to
: retrieve values.
: 
: Do we gain anything here ? Basically would the city fields associated
: with hotels be stored and repeated 74500 less times or are the values
: stored once and pointers for
: each hotel document kept to point at the city values ?

It's an inverted index, so *tems* exist once (per segment) and those terms 
"point" to the documents -- so having the same terms (in the same fields) 
for multiple types of documents in one index is going to take up less 
overall space then having distinct collections for each type of document.

if you use *completely* different fields for each type of document (or use 
the same fields, but the documents have completley differnet terms in 
those fields) then you're better off with differnet collections.

-Hoss