You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Vadim Kisselmann <v....@googlemail.com> on 2012/03/28 16:55:26 UTC

Localize the largest fields (content) in index

Hello folks,

i work with Solr 4.0 r1292064 from trunk.
My index grows fast, with 10Mio. docs i get an index size of 150GB
(25% stored, 75% indexed).
I want to find out, which fields(content) are too large, to consider measures.

How can i localize/discover the largest fields in my index?
Luke(latest from trunk) doesn't work
with my Solr version. I build Lucene/Solr .jars and tried to feed Luke
this these, but i get many errors
and can't build it.

What other options do i have?

Thanks and best regards
Vadim

Re: Localize the largest fields (content) in index

Posted by Erick Erickson <er...@gmail.com>.

I don't think there's really any reason SolrCloud won't work with
Tomcat, the setup is
probably just tricky. See:
http://lucene.472066.n3.nabble.com/SolrCloud-new-td1528872.html
It's about a year old, but might prove helpful.

Best
Erick

On Thu, Mar 29, 2012 at 3:41 PM, Vadim Kisselmann
<v....@googlemail.com> wrote:
> Yes, i think so, too :)
> MLT doesn´t need termVectors really, but it´s faster with them. I
> found out, what
> MLT works better on the title field in my case, instead of big text fields.
>
> Sharding is in planning, but my setup with SolrCloud, ZK and Tomcat
> doesn´t work,
> see here: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201203.mbox/%3CCA+GXEZE3LCTtgXFzn9uEdRxMymGF=z0UJB9s8b0qkipAfn6fsA@mail.gmail.com%3E
> I split my huge index (150GB-index in this case is my test-index), and
> want use SolrCloud,
> but it´s not runnable with tomcat at this time.
>
> Best regards
> Vadim
>
>
> 2012/3/29 Erick Erickson <er...@gmail.com>:
>> Yeah, it's worth a try. The term vectors aren't entirely necessary for
>> highlighting,
>> although they do make things more efficient.
>>
>> As far as MLT, does MLT really need such a big field?
>>
>> But you may be on your way to sharding your index if you remove this info
>> and testing shows problems....
>>
>> Best
>> Erick
>>
>> On Thu, Mar 29, 2012 at 9:32 AM, Vadim Kisselmann
>> <v....@googlemail.com> wrote:
>>> Hi Erick,
>>> thanks:)
>>> The admin UI give me the counts, so i can identify fields with big
>>> bulks of unique terms.
>>> I known this wiki-page, but i read it one more time.
>>> List of my file extensions with size in GB(Index size ~150GB):
>>> tvf 90GB
>>> fdt 30GB
>>> tim 18GB
>>> prx 15GB
>>> frq 12GB
>>> tip 200MB
>>> tvx 150MB
>>>
>>> tvf is my biggest file extension.
>>> Wiki :This file contains, for each field that has a term vector
>>> stored, a list of the terms, their frequencies and, optionally,
>>> position and offest information.
>>>
>>> Hmm, i use termVectors on my biggest fields because of MLT and Highlighting.
>>> But i think i should test my performance without termVectors. Good Idea? :)
>>>
>>> What do you think about my file extension sizes?
>>>
>>> Best regards
>>> Vadim
>>>
>>>
>>>
>>>
>>> 2012/3/29 Erick Erickson <er...@gmail.com>:
>>>> The admin UI (schema browser) will give you the counts of unique terms
>>>> in your fields, which is where I'd start.
>>>>
>>>> I suspect you've already seen this page, but if not:
>>>> http://lucene.apache.org/java/3_5_0/fileformats.html#file-names
>>>> the .fdt and .fdx file extensions are where data goes when
>>>> you set 'stored="true" '. These files don't affect search speed,
>>>> they just contain the verbatim copy of the data.
>>>>
>>>> The relative sizes of the various files above should give
>>>> you a hint as to what's using the most space, but it'll be a bit
>>>> of a hunt for you to pinpoint what's actually up. TermVectors
>>>> and norms are often sources of using up space.
>>>>
>>>> Best
>>>> Erick
>>>>
>>>> On Wed, Mar 28, 2012 at 10:55 AM, Vadim Kisselmann
>>>> <v....@googlemail.com> wrote:
>>>>> Hello folks,
>>>>>
>>>>> i work with Solr 4.0 r1292064 from trunk.
>>>>> My index grows fast, with 10Mio. docs i get an index size of 150GB
>>>>> (25% stored, 75% indexed).
>>>>> I want to find out, which fields(content) are too large, to consider measures.
>>>>>
>>>>> How can i localize/discover the largest fields in my index?
>>>>> Luke(latest from trunk) doesn't work
>>>>> with my Solr version. I build Lucene/Solr .jars and tried to feed Luke
>>>>> this these, but i get many errors
>>>>> and can't build it.
>>>>>
>>>>> What other options do i have?
>>>>>
>>>>> Thanks and best regards
>>>>> Vadim

Re: Localize the largest fields (content) in index

Posted by Vadim Kisselmann <v....@googlemail.com>.

Yes, i think so, too :)
MLT doesn´t need termVectors really, but it´s faster with them. I
found out, what
MLT works better on the title field in my case, instead of big text fields.

Sharding is in planning, but my setup with SolrCloud, ZK and Tomcat
doesn´t work,
see here: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201203.mbox/%3CCA+GXEZE3LCTtgXFzn9uEdRxMymGF=z0UJB9s8b0qkipAfn6fsA@mail.gmail.com%3E
I split my huge index (150GB-index in this case is my test-index), and
want use SolrCloud,
but it´s not runnable with tomcat at this time.

Best regards
Vadim


2012/3/29 Erick Erickson <er...@gmail.com>:
> Yeah, it's worth a try. The term vectors aren't entirely necessary for
> highlighting,
> although they do make things more efficient.
>
> As far as MLT, does MLT really need such a big field?
>
> But you may be on your way to sharding your index if you remove this info
> and testing shows problems....
>
> Best
> Erick
>
> On Thu, Mar 29, 2012 at 9:32 AM, Vadim Kisselmann
> <v....@googlemail.com> wrote:
>> Hi Erick,
>> thanks:)
>> The admin UI give me the counts, so i can identify fields with big
>> bulks of unique terms.
>> I known this wiki-page, but i read it one more time.
>> List of my file extensions with size in GB(Index size ~150GB):
>> tvf 90GB
>> fdt 30GB
>> tim 18GB
>> prx 15GB
>> frq 12GB
>> tip 200MB
>> tvx 150MB
>>
>> tvf is my biggest file extension.
>> Wiki :This file contains, for each field that has a term vector
>> stored, a list of the terms, their frequencies and, optionally,
>> position and offest information.
>>
>> Hmm, i use termVectors on my biggest fields because of MLT and Highlighting.
>> But i think i should test my performance without termVectors. Good Idea? :)
>>
>> What do you think about my file extension sizes?
>>
>> Best regards
>> Vadim
>>
>>
>>
>>
>> 2012/3/29 Erick Erickson <er...@gmail.com>:
>>> The admin UI (schema browser) will give you the counts of unique terms
>>> in your fields, which is where I'd start.
>>>
>>> I suspect you've already seen this page, but if not:
>>> http://lucene.apache.org/java/3_5_0/fileformats.html#file-names
>>> the .fdt and .fdx file extensions are where data goes when
>>> you set 'stored="true" '. These files don't affect search speed,
>>> they just contain the verbatim copy of the data.
>>>
>>> The relative sizes of the various files above should give
>>> you a hint as to what's using the most space, but it'll be a bit
>>> of a hunt for you to pinpoint what's actually up. TermVectors
>>> and norms are often sources of using up space.
>>>
>>> Best
>>> Erick
>>>
>>> On Wed, Mar 28, 2012 at 10:55 AM, Vadim Kisselmann
>>> <v....@googlemail.com> wrote:
>>>> Hello folks,
>>>>
>>>> i work with Solr 4.0 r1292064 from trunk.
>>>> My index grows fast, with 10Mio. docs i get an index size of 150GB
>>>> (25% stored, 75% indexed).
>>>> I want to find out, which fields(content) are too large, to consider measures.
>>>>
>>>> How can i localize/discover the largest fields in my index?
>>>> Luke(latest from trunk) doesn't work
>>>> with my Solr version. I build Lucene/Solr .jars and tried to feed Luke
>>>> this these, but i get many errors
>>>> and can't build it.
>>>>
>>>> What other options do i have?
>>>>
>>>> Thanks and best regards
>>>> Vadim

Re: Localize the largest fields (content) in index

Posted by Erick Erickson <er...@gmail.com>.

Yeah, it's worth a try. The term vectors aren't entirely necessary for
highlighting,
although they do make things more efficient.

As far as MLT, does MLT really need such a big field?

But you may be on your way to sharding your index if you remove this info
and testing shows problems....

Best
Erick

On Thu, Mar 29, 2012 at 9:32 AM, Vadim Kisselmann
<v....@googlemail.com> wrote:
> Hi Erick,
> thanks:)
> The admin UI give me the counts, so i can identify fields with big
> bulks of unique terms.
> I known this wiki-page, but i read it one more time.
> List of my file extensions with size in GB(Index size ~150GB):
> tvf 90GB
> fdt 30GB
> tim 18GB
> prx 15GB
> frq 12GB
> tip 200MB
> tvx 150MB
>
> tvf is my biggest file extension.
> Wiki :This file contains, for each field that has a term vector
> stored, a list of the terms, their frequencies and, optionally,
> position and offest information.
>
> Hmm, i use termVectors on my biggest fields because of MLT and Highlighting.
> But i think i should test my performance without termVectors. Good Idea? :)
>
> What do you think about my file extension sizes?
>
> Best regards
> Vadim
>
>
>
>
> 2012/3/29 Erick Erickson <er...@gmail.com>:
>> The admin UI (schema browser) will give you the counts of unique terms
>> in your fields, which is where I'd start.
>>
>> I suspect you've already seen this page, but if not:
>> http://lucene.apache.org/java/3_5_0/fileformats.html#file-names
>> the .fdt and .fdx file extensions are where data goes when
>> you set 'stored="true" '. These files don't affect search speed,
>> they just contain the verbatim copy of the data.
>>
>> The relative sizes of the various files above should give
>> you a hint as to what's using the most space, but it'll be a bit
>> of a hunt for you to pinpoint what's actually up. TermVectors
>> and norms are often sources of using up space.
>>
>> Best
>> Erick
>>
>> On Wed, Mar 28, 2012 at 10:55 AM, Vadim Kisselmann
>> <v....@googlemail.com> wrote:
>>> Hello folks,
>>>
>>> i work with Solr 4.0 r1292064 from trunk.
>>> My index grows fast, with 10Mio. docs i get an index size of 150GB
>>> (25% stored, 75% indexed).
>>> I want to find out, which fields(content) are too large, to consider measures.
>>>
>>> How can i localize/discover the largest fields in my index?
>>> Luke(latest from trunk) doesn't work
>>> with my Solr version. I build Lucene/Solr .jars and tried to feed Luke
>>> this these, but i get many errors
>>> and can't build it.
>>>
>>> What other options do i have?
>>>
>>> Thanks and best regards
>>> Vadim

Re: Localize the largest fields (content) in index

Posted by Vadim Kisselmann <v....@googlemail.com>.

Hi Erick,
thanks:)
The admin UI give me the counts, so i can identify fields with big
bulks of unique terms.
I known this wiki-page, but i read it one more time.
List of my file extensions with size in GB(Index size ~150GB):
tvf 90GB
fdt 30GB
tim 18GB
prx 15GB
frq 12GB
tip 200MB
tvx 150MB

tvf is my biggest file extension.
Wiki :This file contains, for each field that has a term vector
stored, a list of the terms, their frequencies and, optionally,
position and offest information.

Hmm, i use termVectors on my biggest fields because of MLT and Highlighting.
But i think i should test my performance without termVectors. Good Idea? :)

What do you think about my file extension sizes?

Best regards
Vadim

2012/3/29 Erick Erickson <er...@gmail.com>:
> The admin UI (schema browser) will give you the counts of unique terms
> in your fields, which is where I'd start.
>
> I suspect you've already seen this page, but if not:
> http://lucene.apache.org/java/3_5_0/fileformats.html#file-names
> the .fdt and .fdx file extensions are where data goes when
> you set 'stored="true" '. These files don't affect search speed,
> they just contain the verbatim copy of the data.
>
> The relative sizes of the various files above should give
> you a hint as to what's using the most space, but it'll be a bit
> of a hunt for you to pinpoint what's actually up. TermVectors
> and norms are often sources of using up space.
>
> Best
> Erick
>
> On Wed, Mar 28, 2012 at 10:55 AM, Vadim Kisselmann
> <v....@googlemail.com> wrote:
>> Hello folks,
>>
>> i work with Solr 4.0 r1292064 from trunk.
>> My index grows fast, with 10Mio. docs i get an index size of 150GB
>> (25% stored, 75% indexed).
>> I want to find out, which fields(content) are too large, to consider measures.
>>
>> How can i localize/discover the largest fields in my index?
>> Luke(latest from trunk) doesn't work
>> with my Solr version. I build Lucene/Solr .jars and tried to feed Luke
>> this these, but i get many errors
>> and can't build it.
>>
>> What other options do i have?
>>
>> Thanks and best regards
>> Vadim

Re: Localize the largest fields (content) in index

Posted by Erick Erickson <er...@gmail.com>.

The admin UI (schema browser) will give you the counts of unique terms
in your fields, which is where I'd start.

I suspect you've already seen this page, but if not:
http://lucene.apache.org/java/3_5_0/fileformats.html#file-names
the .fdt and .fdx file extensions are where data goes when
you set 'stored="true" '. These files don't affect search speed,
they just contain the verbatim copy of the data.

The relative sizes of the various files above should give
you a hint as to what's using the most space, but it'll be a bit
of a hunt for you to pinpoint what's actually up. TermVectors
and norms are often sources of using up space.

Best
Erick

On Wed, Mar 28, 2012 at 10:55 AM, Vadim Kisselmann
<v....@googlemail.com> wrote:
> Hello folks,
>
> i work with Solr 4.0 r1292064 from trunk.
> My index grows fast, with 10Mio. docs i get an index size of 150GB
> (25% stored, 75% indexed).
> I want to find out, which fields(content) are too large, to consider measures.
>
> How can i localize/discover the largest fields in my index?
> Luke(latest from trunk) doesn't work
> with my Solr version. I build Lucene/Solr .jars and tried to feed Luke
> this these, but i get many errors
> and can't build it.
>
> What other options do i have?
>
> Thanks and best regards
> Vadim