You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Peter Spam <ps...@mac.com> on 2010/07/21 02:36:47 UTC

Solr searching performance issues, using large documents

Data set: About 4,000 log files (will eventually grow to millions).  Average log file is 850k.  Largest log file (so far) is about 70MB.

Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds).  Thanks in advance for any ideas!


-Peter


-------------------------------------------------------------------------------------------------------------------------------------

4GB RAM server
% java -Xms2048M -Xmx3072M -jar start.jar

-------------------------------------------------------------------------------------------------------------------------------------

schema.xml changes:

    <fieldType name="text_pl" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
 	<filter class="solr.LowerCaseFilterFactory"/> 
	<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
      </analyzer>
    </fieldType>

...

   <field name="body" type="text_pl" indexed="true" stored="true" multiValued="false" termVectors="true" termPositions="true" termOffsets="true" />
    <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
   <field name="version" type="string" indexed="true" stored="true" multiValued="false"/>
   <field name="device" type="string" indexed="true" stored="true" multiValued="false"/>
   <field name="filename" type="string" indexed="true" stored="true" multiValued="false"/>
   <field name="filesize" type="long" indexed="true" stored="true" multiValued="false"/>
   <field name="pversion" type="int" indexed="true" stored="true" multiValued="false"/>
   <field name="first2md5" type="string" indexed="false" stored="true" multiValued="false"/>
   <field name="ckey" type="string" indexed="true" stored="true" multiValued="false"/>

...

 <dynamicField name="*" type="ignored" multiValued="true" />
 <defaultSearchField>body</defaultSearchField>
 <solrQueryParser defaultOperator="AND"/>

-------------------------------------------------------------------------------------------------------------------------------------

solrconfig.xml changes:

    <maxFieldLength>2147483647</maxFieldLength>
    <ramBufferSizeMB>128</ramBufferSizeMB>

-------------------------------------------------------------------------------------------------------------------------------------

The query:

rowStr = "&rows=10"
facet = "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
termvectors = "&tv=true&qt=tvrh&tv.all=true"
hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
regexv = "(?m)^.*\n.*\n.*$"
hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)

thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? '' : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl + hl_regex

baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' + p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s

Re: Solr searching performance issues, using large documents (now 1MB documents)

Posted by Lance Norskog <go...@gmail.com>.

How much disk space is used by the index?

If you run the Lucene CheckIndex program, how many terms etc. does it report?

When you do the first facet query, how much does the memory in use grow?

Are you storing the text fields, or only indexing? Do you fetch the
facets only, or do you also fetch the document contents?

On Wed, Aug 25, 2010 at 11:34 AM, Peter Spam <ps...@mac.com> wrote:
> This is a very small number of documents (7000), so I am surprised Solr is having such a hard time with it!!
>
> I do facet on 3 terms.
>
> Subsequent "hello" searches are faster, but still well over a second.  This is a very fast Mac Pro, with 6GB of RAM.
>
>
> Thanks,
> Peter
>
> On Aug 25, 2010, at 9:52 AM, Yonik Seeley wrote:
>
>> On Wed, Aug 25, 2010 at 11:29 AM, Peter Spam <ps...@mac.com> wrote:
>>> So, I went through all the effort to break my documents into max 1 MB chunks, and searching for hello still takes over 40 seconds (searching across 7433 documents):
>>>
>>>        8 results (41980 ms)
>>>
>>> What is going on???  (scroll down for my config).
>>
>> Are you still faceting on that query also?
>> Breaking your docs into many chunks means inflating the doc count and
>> will make faceting slower.
>> Also, first-time faceting (as with sorting) is slow... did you try
>> another query after  "hello" (and without a commit happening
>> inbetween) to see if it was faster?
>>
>> -Yonik
>> http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Solr searching performance issues, using large documents (now 1MB documents)

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Wed, Aug 25, 2010 at 2:34 PM, Peter Spam <ps...@mac.com> wrote:
> This is a very small number of documents (7000), so I am surprised Solr is having such a hard time with it!!
>
> I do facet on 3 terms.
>
> Subsequent "hello" searches are faster, but still well over a second.  This is a very fast Mac Pro, with 6GB of RAM.

Search apps often need tweaking for best performance.
We probably need to determine if you are IO bound (because the index
is large enough that there are many disk seeks) or if you are CPU
bound (possible, depending on the faceting).

Perhaps one easy thing to start with is to add debugQuery=true and
report the timings of the different components it gives.

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8

Re: Solr searching performance issues, using large documents (now 1MB documents)

Posted by Peter Spam <ps...@mac.com>.

This is a very small number of documents (7000), so I am surprised Solr is having such a hard time with it!!

I do facet on 3 terms.

Subsequent "hello" searches are faster, but still well over a second.  This is a very fast Mac Pro, with 6GB of RAM.


Thanks,
Peter

On Aug 25, 2010, at 9:52 AM, Yonik Seeley wrote:

> On Wed, Aug 25, 2010 at 11:29 AM, Peter Spam <ps...@mac.com> wrote:
>> So, I went through all the effort to break my documents into max 1 MB chunks, and searching for hello still takes over 40 seconds (searching across 7433 documents):
>> 
>>        8 results (41980 ms)
>> 
>> What is going on???  (scroll down for my config).
> 
> Are you still faceting on that query also?
> Breaking your docs into many chunks means inflating the doc count and
> will make faceting slower.
> Also, first-time faceting (as with sorting) is slow... did you try
> another query after  "hello" (and without a commit happening
> inbetween) to see if it was faster?
> 
> -Yonik
> http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8

Re: Solr searching performance issues, using large documents (now 1MB documents)

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Wed, Aug 25, 2010 at 11:29 AM, Peter Spam <ps...@mac.com> wrote:
> So, I went through all the effort to break my documents into max 1 MB chunks, and searching for hello still takes over 40 seconds (searching across 7433 documents):
>
>        8 results (41980 ms)
>
> What is going on???  (scroll down for my config).

Are you still faceting on that query also?
Breaking your docs into many chunks means inflating the doc count and
will make faceting slower.
Also, first-time faceting (as with sorting) is slow... did you try
another query after  "hello" (and without a commit happening
inbetween) to see if it was faster?

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8

Re: Solr searching performance issues, using large documents (now 1MB documents)

Posted by Peter Spam <ps...@mac.com>.

So, I went through all the effort to break my documents into max 1 MB chunks, and searching for hello still takes over 40 seconds (searching across 7433 documents):

	8 results (41980 ms)

What is going on???  (scroll down for my config).


-Peter
 
On Aug 16, 2010, at 3:59 PM, Markus Jelsma wrote:

> I've no idea if it's possible but i'd at least try to return an ArrayList of rows instead of just a single row. And if it doesn't work, which is probably the case, how about filing an issue in Jira?
> 
>  
> 
> Reading the docs in the matter, i think it should (made) to be possible to return multiple rows in an ArrayList.
>  
> -----Original message-----
> From: Peter Spam <ps...@mac.com>
> Sent: Tue 17-08-2010 00:47
> To: solr-user@lucene.apache.org; 
> Subject: Re: Solr searching performance issues, using large documents
> 
> Still stuck on this - any hints on how to write the JavaScript to split a document?  Thanks!
> 
> 
> -Pete
> 
> On Aug 5, 2010, at 8:10 PM, Lance Norskog wrote:
> 
>> You may have to write your own javascript to read in the giant field
>> and split it up.
>> 
>> On Thu, Aug 5, 2010 at 5:27 PM, Peter Spam <ps...@mac.com> wrote:
>>> I've read through the DataImportHandler page a few times, and still can't figure out how to separate a large document into smaller documents.  Any hints? :-)  Thanks!
>>> 
>>> -Peter
>>> 
>>> On Aug 2, 2010, at 9:01 PM, Lance Norskog wrote:
>>> 
>>>> Spanning won't work- you would have to make overlapping mini-documents
>>>> if you want to support this.
>>>> 
>>>> I don't know how big the chunks should be- you'll have to experiment.
>>>> 
>>>> Lance
>>>> 
>>>> On Mon, Aug 2, 2010 at 10:01 AM, Peter Spam <ps...@mac.com> wrote:
>>>>> What would happen if the search query phrase spanned separate document chunks?
>>>>> 
>>>>> Also, what would the optimal size of chunks be?
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> 
>>>>> -Peter
>>>>> 
>>>>> On Aug 1, 2010, at 7:21 PM, Lance Norskog wrote:
>>>>> 
>>>>>> Not that I know of.
>>>>>> 
>>>>>> The DataImportHandler has the ability to create multiple documents
>>>>>> from one input stream. It is possible to create a DIH file that reads
>>>>>> large log files and splits each one into N documents, with the file
>>>>>> name as a common field. The DIH wiki page tells you in general how to
>>>>>> make a DIH file.
>>>>>> 
>>>>>> http://wiki.apache.org/solr/DataImportHandler
>>>>>> 
>>>>>> From this, you should be able to make a DIH file that puts log files
>>>>>> in as separate documents. As to splitting files up into
>>>>>> mini-documents, you might have to write a bit of Javascript to achieve
>>>>>> this. There is no data structure or software that implements
>>>>>> structured documents.
>>>>>> 
>>>>>> On Sun, Aug 1, 2010 at 2:06 PM, Peter Spam <ps...@mac.com> wrote:
>>>>>>> Thanks for the pointer, Lance!  Is there an example of this somewhere?
>>>>>>> 
>>>>>>> 
>>>>>>> -Peter
>>>>>>> 
>>>>>>> On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote:
>>>>>>> 
>>>>>>>> Ah! You're not just highlighting, you're snippetizing. This makes it easier.
>>>>>>>> 
>>>>>>>> Highlighting does not stream- it pulls the entire stored contents into
>>>>>>>> one string and then pulls out the snippet.  If you want this to be
>>>>>>>> fast, you have to split up the text into small pieces and only
>>>>>>>> snippetize from the most relevant text. So, separate documents with a
>>>>>>>> common group id for the document it came from. You might have to do 2
>>>>>>>> queries to achieve what you want, but the second query for the same
>>>>>>>> query will be blindingly fast. Often <1ms.
>>>>>>>> 
>>>>>>>> Good luck!
>>>>>>>> 
>>>>>>>> Lance
>>>>>>>> 
>>>>>>>> On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam <ps...@mac.com> wrote:
>>>>>>>>> However, I do need to search the entire document, or else the highlighting will sometimes be blank :-(
>>>>>>>>> Thanks!
>>>>>>>>> 
>>>>>>>>> - Peter
>>>>>>>>> 
>>>>>>>>> ps. sorry for the many responses - I'm rushing around trying to get this working.
>>>>>>>>> 
>>>>>>>>> On Jul 31, 2010, at 1:11 PM, Peter Spam wrote:
>>>>>>>>> 
>>>>>>>>>> Correction - it went from 17 seconds to 10 seconds - I was changing the hl.regex.maxAnalyzedChars the first time.
>>>>>>>>>> Thanks!
>>>>>>>>>> 
>>>>>>>>>> -Peter
>>>>>>>>>> 
>>>>>>>>>> On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:
>>>>>>>>>> 
>>>>>>>>>>> On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> did you already try other values for hl.maxAnalyzedChars=2147483647
>>>>>>>>>>> 
>>>>>>>>>>> Yes, I tried dropping it down to 21, but it didn't have much of an impact (one search I just tried went from 17 seconds to 15.8 seconds, and this is an 8-core Mac Pro with 6GB RAM - 4GB for java).
>>>>>>>>>>> 
>>>>>>>>>>>> ? Also regular expression highlighting is more expensive, I think.
>>>>>>>>>>>> What does the 'fuzzy' variable mean? If you use this to query via
>>>>>>>>>>>> "~someTerm" instead "someTerm"
>>>>>>>>>>>> then you should try the trunk of solr which is a lot faster for fuzzy or
>>>>>>>>>>>> other wildcard search.
>>>>>>>>>>> 
>>>>>>>>>>> "fuzzy" could be set to "*" but isn't right now.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks for the tips, Peter - this has been very frustrating!
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> - Peter
>>>>>>>>>>> 
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Peter.
>>>>>>>>>>>> 
>>>>>>>>>>>>> Data set: About 4,000 log files (will eventually grow to millions).  Average log file is 850k.  Largest log file (so far) is about 70MB.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds).  Thanks in advance for any ideas!
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -Peter
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 4GB RAM server
>>>>>>>>>>>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>>> 
>>>>>>>>>>>>> schema.xml changes:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  <fieldType name="text_pl" class="solr.TextField">
>>>>>>>>>>>>>    <analyzer>
>>>>>>>>>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>>>>>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>>>>>>>>>>>>>    </analyzer>
>>>>>>>>>>>>>  </fieldType>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> ...
>>>>>>>>>>>>> 
>>>>>>>>>>>>> <field name="body" type="text_pl" indexed="true" stored="true" multiValued="false" termVectors="true" termPositions="true" termOffsets="true" />
>>>>>>>>>>>>>  <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
>>>>>>>>>>>>> <field name="version" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>>>>> <field name="device" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>>>>> <field name="filename" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>>>>> <field name="filesize" type="long" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>>>>> <field name="pversion" type="int" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>>>>> <field name="first2md5" type="string" indexed="false" stored="true" multiValued="false"/>
>>>>>>>>>>>>> <field name="ckey" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> ...
>>>>>>>>>>>>> 
>>>>>>>>>>>>> <dynamicField name="*" type="ignored" multiValued="true" />
>>>>>>>>>>>>> <defaultSearchField>body</defaultSearchField>
>>>>>>>>>>>>> <solrQueryParser defaultOperator="AND"/>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>>> 
>>>>>>>>>>>>> solrconfig.xml changes:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  <maxFieldLength>2147483647</maxFieldLength>
>>>>>>>>>>>>>  <ramBufferSizeMB>128</ramBufferSizeMB>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The query:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> rowStr = "&rows=10"
>>>>>>>>>>>>> facet = "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>>>>>>>>>>>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>>>>>>>>>>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>>>>>>>>>>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>>>>>>>>>>>> regexv = "(?m)^.*\n.*\n.*$"
>>>>>>>>>>>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>>>>>>>>>>>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? '' : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl + hl_regex
>>>>>>>>>>>>> 
>>>>>>>>>>>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' + p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> http://karussell.wordpress.com/
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Lance Norskog
>>>>>>>> goksron@gmail.com
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Lance Norskog
>>>>>> goksron@gmail.com
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Lance Norskog
>>>> goksron@gmail.com
>>> 
>>> 
>> 
>> 
>> 
>> -- 
>> Lance Norskog
>> goksron@gmail.com
>

RE: Re: Solr searching performance issues, using large documents

Posted by Markus Jelsma <ma...@buyways.nl>.

I've no idea if it's possible but i'd at least try to return an ArrayList of rows instead of just a single row. And if it doesn't work, which is probably the case, how about filing an issue in Jira?

 

Reading the docs in the matter, i think it should (made) to be possible to return multiple rows in an ArrayList.
 
-----Original message-----
From: Peter Spam <ps...@mac.com>
Sent: Tue 17-08-2010 00:47
To: solr-user@lucene.apache.org; 
Subject: Re: Solr searching performance issues, using large documents

Still stuck on this - any hints on how to write the JavaScript to split a document?  Thanks!


-Pete

On Aug 5, 2010, at 8:10 PM, Lance Norskog wrote:

> You may have to write your own javascript to read in the giant field
> and split it up.
> 
> On Thu, Aug 5, 2010 at 5:27 PM, Peter Spam <ps...@mac.com> wrote:
>> I've read through the DataImportHandler page a few times, and still can't figure out how to separate a large document into smaller documents.  Any hints? :-)  Thanks!
>> 
>> -Peter
>> 
>> On Aug 2, 2010, at 9:01 PM, Lance Norskog wrote:
>> 
>>> Spanning won't work- you would have to make overlapping mini-documents
>>> if you want to support this.
>>> 
>>> I don't know how big the chunks should be- you'll have to experiment.
>>> 
>>> Lance
>>> 
>>> On Mon, Aug 2, 2010 at 10:01 AM, Peter Spam <ps...@mac.com> wrote:
>>>> What would happen if the search query phrase spanned separate document chunks?
>>>> 
>>>> Also, what would the optimal size of chunks be?
>>>> 
>>>> Thanks!
>>>> 
>>>> 
>>>> -Peter
>>>> 
>>>> On Aug 1, 2010, at 7:21 PM, Lance Norskog wrote:
>>>> 
>>>>> Not that I know of.
>>>>> 
>>>>> The DataImportHandler has the ability to create multiple documents
>>>>> from one input stream. It is possible to create a DIH file that reads
>>>>> large log files and splits each one into N documents, with the file
>>>>> name as a common field. The DIH wiki page tells you in general how to
>>>>> make a DIH file.
>>>>> 
>>>>> http://wiki.apache.org/solr/DataImportHandler
>>>>> 
>>>>> From this, you should be able to make a DIH file that puts log files
>>>>> in as separate documents. As to splitting files up into
>>>>> mini-documents, you might have to write a bit of Javascript to achieve
>>>>> this. There is no data structure or software that implements
>>>>> structured documents.
>>>>> 
>>>>> On Sun, Aug 1, 2010 at 2:06 PM, Peter Spam <ps...@mac.com> wrote:
>>>>>> Thanks for the pointer, Lance!  Is there an example of this somewhere?
>>>>>> 
>>>>>> 
>>>>>> -Peter
>>>>>> 
>>>>>> On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote:
>>>>>> 
>>>>>>> Ah! You're not just highlighting, you're snippetizing. This makes it easier.
>>>>>>> 
>>>>>>> Highlighting does not stream- it pulls the entire stored contents into
>>>>>>> one string and then pulls out the snippet.  If you want this to be
>>>>>>> fast, you have to split up the text into small pieces and only
>>>>>>> snippetize from the most relevant text. So, separate documents with a
>>>>>>> common group id for the document it came from. You might have to do 2
>>>>>>> queries to achieve what you want, but the second query for the same
>>>>>>> query will be blindingly fast. Often <1ms.
>>>>>>> 
>>>>>>> Good luck!
>>>>>>> 
>>>>>>> Lance
>>>>>>> 
>>>>>>> On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam <ps...@mac.com> wrote:
>>>>>>>> However, I do need to search the entire document, or else the highlighting will sometimes be blank :-(
>>>>>>>> Thanks!
>>>>>>>> 
>>>>>>>> - Peter
>>>>>>>> 
>>>>>>>> ps. sorry for the many responses - I'm rushing around trying to get this working.
>>>>>>>> 
>>>>>>>> On Jul 31, 2010, at 1:11 PM, Peter Spam wrote:
>>>>>>>> 
>>>>>>>>> Correction - it went from 17 seconds to 10 seconds - I was changing the hl.regex.maxAnalyzedChars the first time.
>>>>>>>>> Thanks!
>>>>>>>>> 
>>>>>>>>> -Peter
>>>>>>>>> 
>>>>>>>>> On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:
>>>>>>>>> 
>>>>>>>>>> On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:
>>>>>>>>>> 
>>>>>>>>>>> did you already try other values for hl.maxAnalyzedChars=2147483647
>>>>>>>>>> 
>>>>>>>>>> Yes, I tried dropping it down to 21, but it didn't have much of an impact (one search I just tried went from 17 seconds to 15.8 seconds, and this is an 8-core Mac Pro with 6GB RAM - 4GB for java).
>>>>>>>>>> 
>>>>>>>>>>> ? Also regular expression highlighting is more expensive, I think.
>>>>>>>>>>> What does the 'fuzzy' variable mean? If you use this to query via
>>>>>>>>>>> "~someTerm" instead "someTerm"
>>>>>>>>>>> then you should try the trunk of solr which is a lot faster for fuzzy or
>>>>>>>>>>> other wildcard search.
>>>>>>>>>> 
>>>>>>>>>> "fuzzy" could be set to "*" but isn't right now.
>>>>>>>>>> 
>>>>>>>>>> Thanks for the tips, Peter - this has been very frustrating!
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> - Peter
>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>> Peter.
>>>>>>>>>>> 
>>>>>>>>>>>> Data set: About 4,000 log files (will eventually grow to millions).  Average log file is 850k.  Largest log file (so far) is about 70MB.
>>>>>>>>>>>> 
>>>>>>>>>>>> Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds).  Thanks in advance for any ideas!
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> -Peter
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>> 
>>>>>>>>>>>> 4GB RAM server
>>>>>>>>>>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>>>>>>>>>>> 
>>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>> 
>>>>>>>>>>>> schema.xml changes:
>>>>>>>>>>>> 
>>>>>>>>>>>>  <fieldType name="text_pl" class="solr.TextField">
>>>>>>>>>>>>    <analyzer>
>>>>>>>>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>>>>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>>>>>>>>>>>>    </analyzer>
>>>>>>>>>>>>  </fieldType>
>>>>>>>>>>>> 
>>>>>>>>>>>> ...
>>>>>>>>>>>> 
>>>>>>>>>>>> <field name="body" type="text_pl" indexed="true" stored="true" multiValued="false" termVectors="true" termPositions="true" termOffsets="true" />
>>>>>>>>>>>>  <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
>>>>>>>>>>>> <field name="version" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>>>> <field name="device" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>>>> <field name="filename" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>>>> <field name="filesize" type="long" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>>>> <field name="pversion" type="int" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>>>> <field name="first2md5" type="string" indexed="false" stored="true" multiValued="false"/>
>>>>>>>>>>>> <field name="ckey" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>>>> 
>>>>>>>>>>>> ...
>>>>>>>>>>>> 
>>>>>>>>>>>> <dynamicField name="*" type="ignored" multiValued="true" />
>>>>>>>>>>>> <defaultSearchField>body</defaultSearchField>
>>>>>>>>>>>> <solrQueryParser defaultOperator="AND"/>
>>>>>>>>>>>> 
>>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>> 
>>>>>>>>>>>> solrconfig.xml changes:
>>>>>>>>>>>> 
>>>>>>>>>>>>  <maxFieldLength>2147483647</maxFieldLength>
>>>>>>>>>>>>  <ramBufferSizeMB>128</ramBufferSizeMB>
>>>>>>>>>>>> 
>>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>> 
>>>>>>>>>>>> The query:
>>>>>>>>>>>> 
>>>>>>>>>>>> rowStr = "&rows=10"
>>>>>>>>>>>> facet = "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>>>>>>>>>>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>>>>>>>>>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>>>>>>>>>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>>>>>>>>>>> regexv = "(?m)^.*\n.*\n.*$"
>>>>>>>>>>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>>>>>>>>>>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>>>>>>>>>>>> 
>>>>>>>>>>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? '' : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl + hl_regex
>>>>>>>>>>>> 
>>>>>>>>>>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' + p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> http://karussell.wordpress.com/
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Lance Norskog
>>>>>>> goksron@gmail.com
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Lance Norskog
>>>>> goksron@gmail.com
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Lance Norskog
>>> goksron@gmail.com
>> 
>> 
> 
> 
> 
> -- 
> Lance Norskog
> goksron@gmail.com

Re: Solr searching performance issues, using large documents

Posted by Peter Spam <ps...@mac.com>.

Still stuck on this - any hints on how to write the JavaScript to split a document?  Thanks!


-Pete

On Aug 5, 2010, at 8:10 PM, Lance Norskog wrote:

> You may have to write your own javascript to read in the giant field
> and split it up.
> 
> On Thu, Aug 5, 2010 at 5:27 PM, Peter Spam <ps...@mac.com> wrote:
>> I've read through the DataImportHandler page a few times, and still can't figure out how to separate a large document into smaller documents.  Any hints? :-)  Thanks!
>> 
>> -Peter
>> 
>> On Aug 2, 2010, at 9:01 PM, Lance Norskog wrote:
>> 
>>> Spanning won't work- you would have to make overlapping mini-documents
>>> if you want to support this.
>>> 
>>> I don't know how big the chunks should be- you'll have to experiment.
>>> 
>>> Lance
>>> 
>>> On Mon, Aug 2, 2010 at 10:01 AM, Peter Spam <ps...@mac.com> wrote:
>>>> What would happen if the search query phrase spanned separate document chunks?
>>>> 
>>>> Also, what would the optimal size of chunks be?
>>>> 
>>>> Thanks!
>>>> 
>>>> 
>>>> -Peter
>>>> 
>>>> On Aug 1, 2010, at 7:21 PM, Lance Norskog wrote:
>>>> 
>>>>> Not that I know of.
>>>>> 
>>>>> The DataImportHandler has the ability to create multiple documents
>>>>> from one input stream. It is possible to create a DIH file that reads
>>>>> large log files and splits each one into N documents, with the file
>>>>> name as a common field. The DIH wiki page tells you in general how to
>>>>> make a DIH file.
>>>>> 
>>>>> http://wiki.apache.org/solr/DataImportHandler
>>>>> 
>>>>> From this, you should be able to make a DIH file that puts log files
>>>>> in as separate documents. As to splitting files up into
>>>>> mini-documents, you might have to write a bit of Javascript to achieve
>>>>> this. There is no data structure or software that implements
>>>>> structured documents.
>>>>> 
>>>>> On Sun, Aug 1, 2010 at 2:06 PM, Peter Spam <ps...@mac.com> wrote:
>>>>>> Thanks for the pointer, Lance!  Is there an example of this somewhere?
>>>>>> 
>>>>>> 
>>>>>> -Peter
>>>>>> 
>>>>>> On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote:
>>>>>> 
>>>>>>> Ah! You're not just highlighting, you're snippetizing. This makes it easier.
>>>>>>> 
>>>>>>> Highlighting does not stream- it pulls the entire stored contents into
>>>>>>> one string and then pulls out the snippet.  If you want this to be
>>>>>>> fast, you have to split up the text into small pieces and only
>>>>>>> snippetize from the most relevant text. So, separate documents with a
>>>>>>> common group id for the document it came from. You might have to do 2
>>>>>>> queries to achieve what you want, but the second query for the same
>>>>>>> query will be blindingly fast. Often <1ms.
>>>>>>> 
>>>>>>> Good luck!
>>>>>>> 
>>>>>>> Lance
>>>>>>> 
>>>>>>> On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam <ps...@mac.com> wrote:
>>>>>>>> However, I do need to search the entire document, or else the highlighting will sometimes be blank :-(
>>>>>>>> Thanks!
>>>>>>>> 
>>>>>>>> - Peter
>>>>>>>> 
>>>>>>>> ps. sorry for the many responses - I'm rushing around trying to get this working.
>>>>>>>> 
>>>>>>>> On Jul 31, 2010, at 1:11 PM, Peter Spam wrote:
>>>>>>>> 
>>>>>>>>> Correction - it went from 17 seconds to 10 seconds - I was changing the hl.regex.maxAnalyzedChars the first time.
>>>>>>>>> Thanks!
>>>>>>>>> 
>>>>>>>>> -Peter
>>>>>>>>> 
>>>>>>>>> On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:
>>>>>>>>> 
>>>>>>>>>> On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:
>>>>>>>>>> 
>>>>>>>>>>> did you already try other values for hl.maxAnalyzedChars=2147483647
>>>>>>>>>> 
>>>>>>>>>> Yes, I tried dropping it down to 21, but it didn't have much of an impact (one search I just tried went from 17 seconds to 15.8 seconds, and this is an 8-core Mac Pro with 6GB RAM - 4GB for java).
>>>>>>>>>> 
>>>>>>>>>>> ? Also regular expression highlighting is more expensive, I think.
>>>>>>>>>>> What does the 'fuzzy' variable mean? If you use this to query via
>>>>>>>>>>> "~someTerm" instead "someTerm"
>>>>>>>>>>> then you should try the trunk of solr which is a lot faster for fuzzy or
>>>>>>>>>>> other wildcard search.
>>>>>>>>>> 
>>>>>>>>>> "fuzzy" could be set to "*" but isn't right now.
>>>>>>>>>> 
>>>>>>>>>> Thanks for the tips, Peter - this has been very frustrating!
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> - Peter
>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>> Peter.
>>>>>>>>>>> 
>>>>>>>>>>>> Data set: About 4,000 log files (will eventually grow to millions).  Average log file is 850k.  Largest log file (so far) is about 70MB.
>>>>>>>>>>>> 
>>>>>>>>>>>> Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds).  Thanks in advance for any ideas!
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> -Peter
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>> 
>>>>>>>>>>>> 4GB RAM server
>>>>>>>>>>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>>>>>>>>>>> 
>>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>> 
>>>>>>>>>>>> schema.xml changes:
>>>>>>>>>>>> 
>>>>>>>>>>>>  <fieldType name="text_pl" class="solr.TextField">
>>>>>>>>>>>>    <analyzer>
>>>>>>>>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>>>>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>>>>>>>>>>>>    </analyzer>
>>>>>>>>>>>>  </fieldType>
>>>>>>>>>>>> 
>>>>>>>>>>>> ...
>>>>>>>>>>>> 
>>>>>>>>>>>> <field name="body" type="text_pl" indexed="true" stored="true" multiValued="false" termVectors="true" termPositions="true" termOffsets="true" />
>>>>>>>>>>>>  <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
>>>>>>>>>>>> <field name="version" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>>>> <field name="device" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>>>> <field name="filename" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>>>> <field name="filesize" type="long" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>>>> <field name="pversion" type="int" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>>>> <field name="first2md5" type="string" indexed="false" stored="true" multiValued="false"/>
>>>>>>>>>>>> <field name="ckey" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>>>> 
>>>>>>>>>>>> ...
>>>>>>>>>>>> 
>>>>>>>>>>>> <dynamicField name="*" type="ignored" multiValued="true" />
>>>>>>>>>>>> <defaultSearchField>body</defaultSearchField>
>>>>>>>>>>>> <solrQueryParser defaultOperator="AND"/>
>>>>>>>>>>>> 
>>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>> 
>>>>>>>>>>>> solrconfig.xml changes:
>>>>>>>>>>>> 
>>>>>>>>>>>>  <maxFieldLength>2147483647</maxFieldLength>
>>>>>>>>>>>>  <ramBufferSizeMB>128</ramBufferSizeMB>
>>>>>>>>>>>> 
>>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>> 
>>>>>>>>>>>> The query:
>>>>>>>>>>>> 
>>>>>>>>>>>> rowStr = "&rows=10"
>>>>>>>>>>>> facet = "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>>>>>>>>>>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>>>>>>>>>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>>>>>>>>>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>>>>>>>>>>> regexv = "(?m)^.*\n.*\n.*$"
>>>>>>>>>>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>>>>>>>>>>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>>>>>>>>>>>> 
>>>>>>>>>>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? '' : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl + hl_regex
>>>>>>>>>>>> 
>>>>>>>>>>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' + p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> http://karussell.wordpress.com/
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Lance Norskog
>>>>>>> goksron@gmail.com
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Lance Norskog
>>>>> goksron@gmail.com
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Lance Norskog
>>> goksron@gmail.com
>> 
>> 
> 
> 
> 
> -- 
> Lance Norskog
> goksron@gmail.com

Re: Solr searching performance issues, using large documents

Posted by Lance Norskog <go...@gmail.com>.

You may have to write your own javascript to read in the giant field
and split it up.

On Thu, Aug 5, 2010 at 5:27 PM, Peter Spam <ps...@mac.com> wrote:
> I've read through the DataImportHandler page a few times, and still can't figure out how to separate a large document into smaller documents.  Any hints? :-)  Thanks!
>
> -Peter
>
> On Aug 2, 2010, at 9:01 PM, Lance Norskog wrote:
>
>> Spanning won't work- you would have to make overlapping mini-documents
>> if you want to support this.
>>
>> I don't know how big the chunks should be- you'll have to experiment.
>>
>> Lance
>>
>> On Mon, Aug 2, 2010 at 10:01 AM, Peter Spam <ps...@mac.com> wrote:
>>> What would happen if the search query phrase spanned separate document chunks?
>>>
>>> Also, what would the optimal size of chunks be?
>>>
>>> Thanks!
>>>
>>>
>>> -Peter
>>>
>>> On Aug 1, 2010, at 7:21 PM, Lance Norskog wrote:
>>>
>>>> Not that I know of.
>>>>
>>>> The DataImportHandler has the ability to create multiple documents
>>>> from one input stream. It is possible to create a DIH file that reads
>>>> large log files and splits each one into N documents, with the file
>>>> name as a common field. The DIH wiki page tells you in general how to
>>>> make a DIH file.
>>>>
>>>> http://wiki.apache.org/solr/DataImportHandler
>>>>
>>>> From this, you should be able to make a DIH file that puts log files
>>>> in as separate documents. As to splitting files up into
>>>> mini-documents, you might have to write a bit of Javascript to achieve
>>>> this. There is no data structure or software that implements
>>>> structured documents.
>>>>
>>>> On Sun, Aug 1, 2010 at 2:06 PM, Peter Spam <ps...@mac.com> wrote:
>>>>> Thanks for the pointer, Lance!  Is there an example of this somewhere?
>>>>>
>>>>>
>>>>> -Peter
>>>>>
>>>>> On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote:
>>>>>
>>>>>> Ah! You're not just highlighting, you're snippetizing. This makes it easier.
>>>>>>
>>>>>> Highlighting does not stream- it pulls the entire stored contents into
>>>>>> one string and then pulls out the snippet.  If you want this to be
>>>>>> fast, you have to split up the text into small pieces and only
>>>>>> snippetize from the most relevant text. So, separate documents with a
>>>>>> common group id for the document it came from. You might have to do 2
>>>>>> queries to achieve what you want, but the second query for the same
>>>>>> query will be blindingly fast. Often <1ms.
>>>>>>
>>>>>> Good luck!
>>>>>>
>>>>>> Lance
>>>>>>
>>>>>> On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam <ps...@mac.com> wrote:
>>>>>>> However, I do need to search the entire document, or else the highlighting will sometimes be blank :-(
>>>>>>> Thanks!
>>>>>>>
>>>>>>> - Peter
>>>>>>>
>>>>>>> ps. sorry for the many responses - I'm rushing around trying to get this working.
>>>>>>>
>>>>>>> On Jul 31, 2010, at 1:11 PM, Peter Spam wrote:
>>>>>>>
>>>>>>>> Correction - it went from 17 seconds to 10 seconds - I was changing the hl.regex.maxAnalyzedChars the first time.
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> -Peter
>>>>>>>>
>>>>>>>> On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:
>>>>>>>>
>>>>>>>>> On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:
>>>>>>>>>
>>>>>>>>>> did you already try other values for hl.maxAnalyzedChars=2147483647
>>>>>>>>>
>>>>>>>>> Yes, I tried dropping it down to 21, but it didn't have much of an impact (one search I just tried went from 17 seconds to 15.8 seconds, and this is an 8-core Mac Pro with 6GB RAM - 4GB for java).
>>>>>>>>>
>>>>>>>>>> ? Also regular expression highlighting is more expensive, I think.
>>>>>>>>>> What does the 'fuzzy' variable mean? If you use this to query via
>>>>>>>>>> "~someTerm" instead "someTerm"
>>>>>>>>>> then you should try the trunk of solr which is a lot faster for fuzzy or
>>>>>>>>>> other wildcard search.
>>>>>>>>>
>>>>>>>>> "fuzzy" could be set to "*" but isn't right now.
>>>>>>>>>
>>>>>>>>> Thanks for the tips, Peter - this has been very frustrating!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> - Peter
>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Peter.
>>>>>>>>>>
>>>>>>>>>>> Data set: About 4,000 log files (will eventually grow to millions).  Average log file is 850k.  Largest log file (so far) is about 70MB.
>>>>>>>>>>>
>>>>>>>>>>> Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds).  Thanks in advance for any ideas!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -Peter
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> 4GB RAM server
>>>>>>>>>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>>>>>>>>>>
>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> schema.xml changes:
>>>>>>>>>>>
>>>>>>>>>>>  <fieldType name="text_pl" class="solr.TextField">
>>>>>>>>>>>    <analyzer>
>>>>>>>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>>>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>>>>>>>>>>>    </analyzer>
>>>>>>>>>>>  </fieldType>
>>>>>>>>>>>
>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>> <field name="body" type="text_pl" indexed="true" stored="true" multiValued="false" termVectors="true" termPositions="true" termOffsets="true" />
>>>>>>>>>>>  <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
>>>>>>>>>>> <field name="version" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>>> <field name="device" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>>> <field name="filename" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>>> <field name="filesize" type="long" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>>> <field name="pversion" type="int" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>>> <field name="first2md5" type="string" indexed="false" stored="true" multiValued="false"/>
>>>>>>>>>>> <field name="ckey" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>>>
>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>> <dynamicField name="*" type="ignored" multiValued="true" />
>>>>>>>>>>> <defaultSearchField>body</defaultSearchField>
>>>>>>>>>>> <solrQueryParser defaultOperator="AND"/>
>>>>>>>>>>>
>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> solrconfig.xml changes:
>>>>>>>>>>>
>>>>>>>>>>>  <maxFieldLength>2147483647</maxFieldLength>
>>>>>>>>>>>  <ramBufferSizeMB>128</ramBufferSizeMB>
>>>>>>>>>>>
>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> The query:
>>>>>>>>>>>
>>>>>>>>>>> rowStr = "&rows=10"
>>>>>>>>>>> facet = "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>>>>>>>>>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>>>>>>>>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>>>>>>>>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>>>>>>>>>> regexv = "(?m)^.*\n.*\n.*$"
>>>>>>>>>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>>>>>>>>>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>>>>>>>>>>>
>>>>>>>>>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? '' : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl + hl_regex
>>>>>>>>>>>
>>>>>>>>>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' + p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> http://karussell.wordpress.com/
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Lance Norskog
>>>>>> goksron@gmail.com
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Lance Norskog
>>>> goksron@gmail.com
>>>
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Solr searching performance issues, using large documents

Posted by Peter Spam <ps...@mac.com>.

I've read through the DataImportHandler page a few times, and still can't figure out how to separate a large document into smaller documents.  Any hints? :-)  Thanks!

-Peter

On Aug 2, 2010, at 9:01 PM, Lance Norskog wrote:

> Spanning won't work- you would have to make overlapping mini-documents
> if you want to support this.
> 
> I don't know how big the chunks should be- you'll have to experiment.
> 
> Lance
> 
> On Mon, Aug 2, 2010 at 10:01 AM, Peter Spam <ps...@mac.com> wrote:
>> What would happen if the search query phrase spanned separate document chunks?
>> 
>> Also, what would the optimal size of chunks be?
>> 
>> Thanks!
>> 
>> 
>> -Peter
>> 
>> On Aug 1, 2010, at 7:21 PM, Lance Norskog wrote:
>> 
>>> Not that I know of.
>>> 
>>> The DataImportHandler has the ability to create multiple documents
>>> from one input stream. It is possible to create a DIH file that reads
>>> large log files and splits each one into N documents, with the file
>>> name as a common field. The DIH wiki page tells you in general how to
>>> make a DIH file.
>>> 
>>> http://wiki.apache.org/solr/DataImportHandler
>>> 
>>> From this, you should be able to make a DIH file that puts log files
>>> in as separate documents. As to splitting files up into
>>> mini-documents, you might have to write a bit of Javascript to achieve
>>> this. There is no data structure or software that implements
>>> structured documents.
>>> 
>>> On Sun, Aug 1, 2010 at 2:06 PM, Peter Spam <ps...@mac.com> wrote:
>>>> Thanks for the pointer, Lance!  Is there an example of this somewhere?
>>>> 
>>>> 
>>>> -Peter
>>>> 
>>>> On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote:
>>>> 
>>>>> Ah! You're not just highlighting, you're snippetizing. This makes it easier.
>>>>> 
>>>>> Highlighting does not stream- it pulls the entire stored contents into
>>>>> one string and then pulls out the snippet.  If you want this to be
>>>>> fast, you have to split up the text into small pieces and only
>>>>> snippetize from the most relevant text. So, separate documents with a
>>>>> common group id for the document it came from. You might have to do 2
>>>>> queries to achieve what you want, but the second query for the same
>>>>> query will be blindingly fast. Often <1ms.
>>>>> 
>>>>> Good luck!
>>>>> 
>>>>> Lance
>>>>> 
>>>>> On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam <ps...@mac.com> wrote:
>>>>>> However, I do need to search the entire document, or else the highlighting will sometimes be blank :-(
>>>>>> Thanks!
>>>>>> 
>>>>>> - Peter
>>>>>> 
>>>>>> ps. sorry for the many responses - I'm rushing around trying to get this working.
>>>>>> 
>>>>>> On Jul 31, 2010, at 1:11 PM, Peter Spam wrote:
>>>>>> 
>>>>>>> Correction - it went from 17 seconds to 10 seconds - I was changing the hl.regex.maxAnalyzedChars the first time.
>>>>>>> Thanks!
>>>>>>> 
>>>>>>> -Peter
>>>>>>> 
>>>>>>> On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:
>>>>>>> 
>>>>>>>> On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:
>>>>>>>> 
>>>>>>>>> did you already try other values for hl.maxAnalyzedChars=2147483647
>>>>>>>> 
>>>>>>>> Yes, I tried dropping it down to 21, but it didn't have much of an impact (one search I just tried went from 17 seconds to 15.8 seconds, and this is an 8-core Mac Pro with 6GB RAM - 4GB for java).
>>>>>>>> 
>>>>>>>>> ? Also regular expression highlighting is more expensive, I think.
>>>>>>>>> What does the 'fuzzy' variable mean? If you use this to query via
>>>>>>>>> "~someTerm" instead "someTerm"
>>>>>>>>> then you should try the trunk of solr which is a lot faster for fuzzy or
>>>>>>>>> other wildcard search.
>>>>>>>> 
>>>>>>>> "fuzzy" could be set to "*" but isn't right now.
>>>>>>>> 
>>>>>>>> Thanks for the tips, Peter - this has been very frustrating!
>>>>>>>> 
>>>>>>>> 
>>>>>>>> - Peter
>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> Peter.
>>>>>>>>> 
>>>>>>>>>> Data set: About 4,000 log files (will eventually grow to millions).  Average log file is 850k.  Largest log file (so far) is about 70MB.
>>>>>>>>>> 
>>>>>>>>>> Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds).  Thanks in advance for any ideas!
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> -Peter
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>> 
>>>>>>>>>> 4GB RAM server
>>>>>>>>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>>>>>>>>> 
>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>> 
>>>>>>>>>> schema.xml changes:
>>>>>>>>>> 
>>>>>>>>>>  <fieldType name="text_pl" class="solr.TextField">
>>>>>>>>>>    <analyzer>
>>>>>>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>>>>>>>>>>    </analyzer>
>>>>>>>>>>  </fieldType>
>>>>>>>>>> 
>>>>>>>>>> ...
>>>>>>>>>> 
>>>>>>>>>> <field name="body" type="text_pl" indexed="true" stored="true" multiValued="false" termVectors="true" termPositions="true" termOffsets="true" />
>>>>>>>>>>  <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
>>>>>>>>>> <field name="version" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>> <field name="device" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>> <field name="filename" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>> <field name="filesize" type="long" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>> <field name="pversion" type="int" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>> <field name="first2md5" type="string" indexed="false" stored="true" multiValued="false"/>
>>>>>>>>>> <field name="ckey" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>> 
>>>>>>>>>> ...
>>>>>>>>>> 
>>>>>>>>>> <dynamicField name="*" type="ignored" multiValued="true" />
>>>>>>>>>> <defaultSearchField>body</defaultSearchField>
>>>>>>>>>> <solrQueryParser defaultOperator="AND"/>
>>>>>>>>>> 
>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>> 
>>>>>>>>>> solrconfig.xml changes:
>>>>>>>>>> 
>>>>>>>>>>  <maxFieldLength>2147483647</maxFieldLength>
>>>>>>>>>>  <ramBufferSizeMB>128</ramBufferSizeMB>
>>>>>>>>>> 
>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>> 
>>>>>>>>>> The query:
>>>>>>>>>> 
>>>>>>>>>> rowStr = "&rows=10"
>>>>>>>>>> facet = "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>>>>>>>>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>>>>>>>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>>>>>>>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>>>>>>>>> regexv = "(?m)^.*\n.*\n.*$"
>>>>>>>>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>>>>>>>>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>>>>>>>>>> 
>>>>>>>>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? '' : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl + hl_regex
>>>>>>>>>> 
>>>>>>>>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' + p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> http://karussell.wordpress.com/
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Lance Norskog
>>>>> goksron@gmail.com
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Lance Norskog
>>> goksron@gmail.com
>> 
>> 
> 
> 
> 
> -- 
> Lance Norskog
> goksron@gmail.com

Re: Solr searching performance issues, using large documents

Posted by Lance Norskog <go...@gmail.com>.

Spanning won't work- you would have to make overlapping mini-documents
if you want to support this.

I don't know how big the chunks should be- you'll have to experiment.

Lance

On Mon, Aug 2, 2010 at 10:01 AM, Peter Spam <ps...@mac.com> wrote:
> What would happen if the search query phrase spanned separate document chunks?
>
> Also, what would the optimal size of chunks be?
>
> Thanks!
>
>
> -Peter
>
> On Aug 1, 2010, at 7:21 PM, Lance Norskog wrote:
>
>> Not that I know of.
>>
>> The DataImportHandler has the ability to create multiple documents
>> from one input stream. It is possible to create a DIH file that reads
>> large log files and splits each one into N documents, with the file
>> name as a common field. The DIH wiki page tells you in general how to
>> make a DIH file.
>>
>> http://wiki.apache.org/solr/DataImportHandler
>>
>> From this, you should be able to make a DIH file that puts log files
>> in as separate documents. As to splitting files up into
>> mini-documents, you might have to write a bit of Javascript to achieve
>> this. There is no data structure or software that implements
>> structured documents.
>>
>> On Sun, Aug 1, 2010 at 2:06 PM, Peter Spam <ps...@mac.com> wrote:
>>> Thanks for the pointer, Lance!  Is there an example of this somewhere?
>>>
>>>
>>> -Peter
>>>
>>> On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote:
>>>
>>>> Ah! You're not just highlighting, you're snippetizing. This makes it easier.
>>>>
>>>> Highlighting does not stream- it pulls the entire stored contents into
>>>> one string and then pulls out the snippet.  If you want this to be
>>>> fast, you have to split up the text into small pieces and only
>>>> snippetize from the most relevant text. So, separate documents with a
>>>> common group id for the document it came from. You might have to do 2
>>>> queries to achieve what you want, but the second query for the same
>>>> query will be blindingly fast. Often <1ms.
>>>>
>>>> Good luck!
>>>>
>>>> Lance
>>>>
>>>> On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam <ps...@mac.com> wrote:
>>>>> However, I do need to search the entire document, or else the highlighting will sometimes be blank :-(
>>>>> Thanks!
>>>>>
>>>>> - Peter
>>>>>
>>>>> ps. sorry for the many responses - I'm rushing around trying to get this working.
>>>>>
>>>>> On Jul 31, 2010, at 1:11 PM, Peter Spam wrote:
>>>>>
>>>>>> Correction - it went from 17 seconds to 10 seconds - I was changing the hl.regex.maxAnalyzedChars the first time.
>>>>>> Thanks!
>>>>>>
>>>>>> -Peter
>>>>>>
>>>>>> On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:
>>>>>>
>>>>>>> On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:
>>>>>>>
>>>>>>>> did you already try other values for hl.maxAnalyzedChars=2147483647
>>>>>>>
>>>>>>> Yes, I tried dropping it down to 21, but it didn't have much of an impact (one search I just tried went from 17 seconds to 15.8 seconds, and this is an 8-core Mac Pro with 6GB RAM - 4GB for java).
>>>>>>>
>>>>>>>> ? Also regular expression highlighting is more expensive, I think.
>>>>>>>> What does the 'fuzzy' variable mean? If you use this to query via
>>>>>>>> "~someTerm" instead "someTerm"
>>>>>>>> then you should try the trunk of solr which is a lot faster for fuzzy or
>>>>>>>> other wildcard search.
>>>>>>>
>>>>>>> "fuzzy" could be set to "*" but isn't right now.
>>>>>>>
>>>>>>> Thanks for the tips, Peter - this has been very frustrating!
>>>>>>>
>>>>>>>
>>>>>>> - Peter
>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Peter.
>>>>>>>>
>>>>>>>>> Data set: About 4,000 log files (will eventually grow to millions).  Average log file is 850k.  Largest log file (so far) is about 70MB.
>>>>>>>>>
>>>>>>>>> Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds).  Thanks in advance for any ideas!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Peter
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> 4GB RAM server
>>>>>>>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>>>>>>>>
>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> schema.xml changes:
>>>>>>>>>
>>>>>>>>>  <fieldType name="text_pl" class="solr.TextField">
>>>>>>>>>    <analyzer>
>>>>>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>>>>>>>>>    </analyzer>
>>>>>>>>>  </fieldType>
>>>>>>>>>
>>>>>>>>> ...
>>>>>>>>>
>>>>>>>>> <field name="body" type="text_pl" indexed="true" stored="true" multiValued="false" termVectors="true" termPositions="true" termOffsets="true" />
>>>>>>>>>  <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
>>>>>>>>> <field name="version" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>> <field name="device" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>> <field name="filename" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>> <field name="filesize" type="long" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>> <field name="pversion" type="int" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>> <field name="first2md5" type="string" indexed="false" stored="true" multiValued="false"/>
>>>>>>>>> <field name="ckey" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>>>
>>>>>>>>> ...
>>>>>>>>>
>>>>>>>>> <dynamicField name="*" type="ignored" multiValued="true" />
>>>>>>>>> <defaultSearchField>body</defaultSearchField>
>>>>>>>>> <solrQueryParser defaultOperator="AND"/>
>>>>>>>>>
>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> solrconfig.xml changes:
>>>>>>>>>
>>>>>>>>>  <maxFieldLength>2147483647</maxFieldLength>
>>>>>>>>>  <ramBufferSizeMB>128</ramBufferSizeMB>
>>>>>>>>>
>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> The query:
>>>>>>>>>
>>>>>>>>> rowStr = "&rows=10"
>>>>>>>>> facet = "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>>>>>>>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>>>>>>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>>>>>>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>>>>>>>> regexv = "(?m)^.*\n.*\n.*$"
>>>>>>>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>>>>>>>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>>>>>>>>>
>>>>>>>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? '' : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl + hl_regex
>>>>>>>>>
>>>>>>>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' + p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> http://karussell.wordpress.com/
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Lance Norskog
>>>> goksron@gmail.com
>>>
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Solr searching performance issues, using large documents

Posted by Peter Spam <ps...@mac.com>.

What would happen if the search query phrase spanned separate document chunks?

Also, what would the optimal size of chunks be?

Thanks!


-Peter

On Aug 1, 2010, at 7:21 PM, Lance Norskog wrote:

> Not that I know of.
> 
> The DataImportHandler has the ability to create multiple documents
> from one input stream. It is possible to create a DIH file that reads
> large log files and splits each one into N documents, with the file
> name as a common field. The DIH wiki page tells you in general how to
> make a DIH file.
> 
> http://wiki.apache.org/solr/DataImportHandler
> 
> From this, you should be able to make a DIH file that puts log files
> in as separate documents. As to splitting files up into
> mini-documents, you might have to write a bit of Javascript to achieve
> this. There is no data structure or software that implements
> structured documents.
> 
> On Sun, Aug 1, 2010 at 2:06 PM, Peter Spam <ps...@mac.com> wrote:
>> Thanks for the pointer, Lance!  Is there an example of this somewhere?
>> 
>> 
>> -Peter
>> 
>> On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote:
>> 
>>> Ah! You're not just highlighting, you're snippetizing. This makes it easier.
>>> 
>>> Highlighting does not stream- it pulls the entire stored contents into
>>> one string and then pulls out the snippet.  If you want this to be
>>> fast, you have to split up the text into small pieces and only
>>> snippetize from the most relevant text. So, separate documents with a
>>> common group id for the document it came from. You might have to do 2
>>> queries to achieve what you want, but the second query for the same
>>> query will be blindingly fast. Often <1ms.
>>> 
>>> Good luck!
>>> 
>>> Lance
>>> 
>>> On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam <ps...@mac.com> wrote:
>>>> However, I do need to search the entire document, or else the highlighting will sometimes be blank :-(
>>>> Thanks!
>>>> 
>>>> - Peter
>>>> 
>>>> ps. sorry for the many responses - I'm rushing around trying to get this working.
>>>> 
>>>> On Jul 31, 2010, at 1:11 PM, Peter Spam wrote:
>>>> 
>>>>> Correction - it went from 17 seconds to 10 seconds - I was changing the hl.regex.maxAnalyzedChars the first time.
>>>>> Thanks!
>>>>> 
>>>>> -Peter
>>>>> 
>>>>> On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:
>>>>> 
>>>>>> On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:
>>>>>> 
>>>>>>> did you already try other values for hl.maxAnalyzedChars=2147483647
>>>>>> 
>>>>>> Yes, I tried dropping it down to 21, but it didn't have much of an impact (one search I just tried went from 17 seconds to 15.8 seconds, and this is an 8-core Mac Pro with 6GB RAM - 4GB for java).
>>>>>> 
>>>>>>> ? Also regular expression highlighting is more expensive, I think.
>>>>>>> What does the 'fuzzy' variable mean? If you use this to query via
>>>>>>> "~someTerm" instead "someTerm"
>>>>>>> then you should try the trunk of solr which is a lot faster for fuzzy or
>>>>>>> other wildcard search.
>>>>>> 
>>>>>> "fuzzy" could be set to "*" but isn't right now.
>>>>>> 
>>>>>> Thanks for the tips, Peter - this has been very frustrating!
>>>>>> 
>>>>>> 
>>>>>> - Peter
>>>>>> 
>>>>>>> Regards,
>>>>>>> Peter.
>>>>>>> 
>>>>>>>> Data set: About 4,000 log files (will eventually grow to millions).  Average log file is 850k.  Largest log file (so far) is about 70MB.
>>>>>>>> 
>>>>>>>> Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds).  Thanks in advance for any ideas!
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -Peter
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>> 
>>>>>>>> 4GB RAM server
>>>>>>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>>>>>>> 
>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>> 
>>>>>>>> schema.xml changes:
>>>>>>>> 
>>>>>>>>  <fieldType name="text_pl" class="solr.TextField">
>>>>>>>>    <analyzer>
>>>>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>>>>>>>>    </analyzer>
>>>>>>>>  </fieldType>
>>>>>>>> 
>>>>>>>> ...
>>>>>>>> 
>>>>>>>> <field name="body" type="text_pl" indexed="true" stored="true" multiValued="false" termVectors="true" termPositions="true" termOffsets="true" />
>>>>>>>>  <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
>>>>>>>> <field name="version" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>> <field name="device" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>> <field name="filename" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>> <field name="filesize" type="long" indexed="true" stored="true" multiValued="false"/>
>>>>>>>> <field name="pversion" type="int" indexed="true" stored="true" multiValued="false"/>
>>>>>>>> <field name="first2md5" type="string" indexed="false" stored="true" multiValued="false"/>
>>>>>>>> <field name="ckey" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>> 
>>>>>>>> ...
>>>>>>>> 
>>>>>>>> <dynamicField name="*" type="ignored" multiValued="true" />
>>>>>>>> <defaultSearchField>body</defaultSearchField>
>>>>>>>> <solrQueryParser defaultOperator="AND"/>
>>>>>>>> 
>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>> 
>>>>>>>> solrconfig.xml changes:
>>>>>>>> 
>>>>>>>>  <maxFieldLength>2147483647</maxFieldLength>
>>>>>>>>  <ramBufferSizeMB>128</ramBufferSizeMB>
>>>>>>>> 
>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>> 
>>>>>>>> The query:
>>>>>>>> 
>>>>>>>> rowStr = "&rows=10"
>>>>>>>> facet = "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>>>>>>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>>>>>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>>>>>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>>>>>>> regexv = "(?m)^.*\n.*\n.*$"
>>>>>>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>>>>>>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>>>>>>>> 
>>>>>>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? '' : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl + hl_regex
>>>>>>>> 
>>>>>>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' + p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> http://karussell.wordpress.com/
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Lance Norskog
>>> goksron@gmail.com
>> 
>> 
> 
> 
> 
> -- 
> Lance Norskog
> goksron@gmail.com

Re: Solr searching performance issues, using large documents

Posted by Lance Norskog <go...@gmail.com>.

Not that I know of.

The DataImportHandler has the ability to create multiple documents
from one input stream. It is possible to create a DIH file that reads
large log files and splits each one into N documents, with the file
name as a common field. The DIH wiki page tells you in general how to
make a DIH file.

http://wiki.apache.org/solr/DataImportHandler

>From this, you should be able to make a DIH file that puts log files
in as separate documents. As to splitting files up into
mini-documents, you might have to write a bit of Javascript to achieve
this. There is no data structure or software that implements
structured documents.

On Sun, Aug 1, 2010 at 2:06 PM, Peter Spam <ps...@mac.com> wrote:
> Thanks for the pointer, Lance!  Is there an example of this somewhere?
>
>
> -Peter
>
> On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote:
>
>> Ah! You're not just highlighting, you're snippetizing. This makes it easier.
>>
>> Highlighting does not stream- it pulls the entire stored contents into
>> one string and then pulls out the snippet.  If you want this to be
>> fast, you have to split up the text into small pieces and only
>> snippetize from the most relevant text. So, separate documents with a
>> common group id for the document it came from. You might have to do 2
>> queries to achieve what you want, but the second query for the same
>> query will be blindingly fast. Often <1ms.
>>
>> Good luck!
>>
>> Lance
>>
>> On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam <ps...@mac.com> wrote:
>>> However, I do need to search the entire document, or else the highlighting will sometimes be blank :-(
>>> Thanks!
>>>
>>> - Peter
>>>
>>> ps. sorry for the many responses - I'm rushing around trying to get this working.
>>>
>>> On Jul 31, 2010, at 1:11 PM, Peter Spam wrote:
>>>
>>>> Correction - it went from 17 seconds to 10 seconds - I was changing the hl.regex.maxAnalyzedChars the first time.
>>>> Thanks!
>>>>
>>>> -Peter
>>>>
>>>> On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:
>>>>
>>>>> On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:
>>>>>
>>>>>> did you already try other values for hl.maxAnalyzedChars=2147483647
>>>>>
>>>>> Yes, I tried dropping it down to 21, but it didn't have much of an impact (one search I just tried went from 17 seconds to 15.8 seconds, and this is an 8-core Mac Pro with 6GB RAM - 4GB for java).
>>>>>
>>>>>> ? Also regular expression highlighting is more expensive, I think.
>>>>>> What does the 'fuzzy' variable mean? If you use this to query via
>>>>>> "~someTerm" instead "someTerm"
>>>>>> then you should try the trunk of solr which is a lot faster for fuzzy or
>>>>>> other wildcard search.
>>>>>
>>>>> "fuzzy" could be set to "*" but isn't right now.
>>>>>
>>>>> Thanks for the tips, Peter - this has been very frustrating!
>>>>>
>>>>>
>>>>> - Peter
>>>>>
>>>>>> Regards,
>>>>>> Peter.
>>>>>>
>>>>>>> Data set: About 4,000 log files (will eventually grow to millions).  Average log file is 850k.  Largest log file (so far) is about 70MB.
>>>>>>>
>>>>>>> Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds).  Thanks in advance for any ideas!
>>>>>>>
>>>>>>>
>>>>>>> -Peter
>>>>>>>
>>>>>>>
>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>
>>>>>>> 4GB RAM server
>>>>>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>>>>>>
>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>
>>>>>>> schema.xml changes:
>>>>>>>
>>>>>>>  <fieldType name="text_pl" class="solr.TextField">
>>>>>>>    <analyzer>
>>>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>>>>>>>    </analyzer>
>>>>>>>  </fieldType>
>>>>>>>
>>>>>>> ...
>>>>>>>
>>>>>>> <field name="body" type="text_pl" indexed="true" stored="true" multiValued="false" termVectors="true" termPositions="true" termOffsets="true" />
>>>>>>>  <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
>>>>>>> <field name="version" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>> <field name="device" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>> <field name="filename" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>> <field name="filesize" type="long" indexed="true" stored="true" multiValued="false"/>
>>>>>>> <field name="pversion" type="int" indexed="true" stored="true" multiValued="false"/>
>>>>>>> <field name="first2md5" type="string" indexed="false" stored="true" multiValued="false"/>
>>>>>>> <field name="ckey" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>>>
>>>>>>> ...
>>>>>>>
>>>>>>> <dynamicField name="*" type="ignored" multiValued="true" />
>>>>>>> <defaultSearchField>body</defaultSearchField>
>>>>>>> <solrQueryParser defaultOperator="AND"/>
>>>>>>>
>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>
>>>>>>> solrconfig.xml changes:
>>>>>>>
>>>>>>>  <maxFieldLength>2147483647</maxFieldLength>
>>>>>>>  <ramBufferSizeMB>128</ramBufferSizeMB>
>>>>>>>
>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>
>>>>>>> The query:
>>>>>>>
>>>>>>> rowStr = "&rows=10"
>>>>>>> facet = "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>>>>>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>>>>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>>>>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>>>>>> regexv = "(?m)^.*\n.*\n.*$"
>>>>>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>>>>>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>>>>>>>
>>>>>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? '' : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl + hl_regex
>>>>>>>
>>>>>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' + p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> http://karussell.wordpress.com/
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Solr searching performance issues, using large documents

Posted by Peter Spam <ps...@mac.com>.

Thanks for the pointer, Lance!  Is there an example of this somewhere?


-Peter

On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote:

> Ah! You're not just highlighting, you're snippetizing. This makes it easier.
> 
> Highlighting does not stream- it pulls the entire stored contents into
> one string and then pulls out the snippet.  If you want this to be
> fast, you have to split up the text into small pieces and only
> snippetize from the most relevant text. So, separate documents with a
> common group id for the document it came from. You might have to do 2
> queries to achieve what you want, but the second query for the same
> query will be blindingly fast. Often <1ms.
> 
> Good luck!
> 
> Lance
> 
> On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam <ps...@mac.com> wrote:
>> However, I do need to search the entire document, or else the highlighting will sometimes be blank :-(
>> Thanks!
>> 
>> - Peter
>> 
>> ps. sorry for the many responses - I'm rushing around trying to get this working.
>> 
>> On Jul 31, 2010, at 1:11 PM, Peter Spam wrote:
>> 
>>> Correction - it went from 17 seconds to 10 seconds - I was changing the hl.regex.maxAnalyzedChars the first time.
>>> Thanks!
>>> 
>>> -Peter
>>> 
>>> On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:
>>> 
>>>> On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:
>>>> 
>>>>> did you already try other values for hl.maxAnalyzedChars=2147483647
>>>> 
>>>> Yes, I tried dropping it down to 21, but it didn't have much of an impact (one search I just tried went from 17 seconds to 15.8 seconds, and this is an 8-core Mac Pro with 6GB RAM - 4GB for java).
>>>> 
>>>>> ? Also regular expression highlighting is more expensive, I think.
>>>>> What does the 'fuzzy' variable mean? If you use this to query via
>>>>> "~someTerm" instead "someTerm"
>>>>> then you should try the trunk of solr which is a lot faster for fuzzy or
>>>>> other wildcard search.
>>>> 
>>>> "fuzzy" could be set to "*" but isn't right now.
>>>> 
>>>> Thanks for the tips, Peter - this has been very frustrating!
>>>> 
>>>> 
>>>> - Peter
>>>> 
>>>>> Regards,
>>>>> Peter.
>>>>> 
>>>>>> Data set: About 4,000 log files (will eventually grow to millions).  Average log file is 850k.  Largest log file (so far) is about 70MB.
>>>>>> 
>>>>>> Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds).  Thanks in advance for any ideas!
>>>>>> 
>>>>>> 
>>>>>> -Peter
>>>>>> 
>>>>>> 
>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>> 
>>>>>> 4GB RAM server
>>>>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>>>>> 
>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>> 
>>>>>> schema.xml changes:
>>>>>> 
>>>>>>  <fieldType name="text_pl" class="solr.TextField">
>>>>>>    <analyzer>
>>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>>>>>>    </analyzer>
>>>>>>  </fieldType>
>>>>>> 
>>>>>> ...
>>>>>> 
>>>>>> <field name="body" type="text_pl" indexed="true" stored="true" multiValued="false" termVectors="true" termPositions="true" termOffsets="true" />
>>>>>>  <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
>>>>>> <field name="version" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>> <field name="device" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>> <field name="filename" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>> <field name="filesize" type="long" indexed="true" stored="true" multiValued="false"/>
>>>>>> <field name="pversion" type="int" indexed="true" stored="true" multiValued="false"/>
>>>>>> <field name="first2md5" type="string" indexed="false" stored="true" multiValued="false"/>
>>>>>> <field name="ckey" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>> 
>>>>>> ...
>>>>>> 
>>>>>> <dynamicField name="*" type="ignored" multiValued="true" />
>>>>>> <defaultSearchField>body</defaultSearchField>
>>>>>> <solrQueryParser defaultOperator="AND"/>
>>>>>> 
>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>> 
>>>>>> solrconfig.xml changes:
>>>>>> 
>>>>>>  <maxFieldLength>2147483647</maxFieldLength>
>>>>>>  <ramBufferSizeMB>128</ramBufferSizeMB>
>>>>>> 
>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>> 
>>>>>> The query:
>>>>>> 
>>>>>> rowStr = "&rows=10"
>>>>>> facet = "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>>>>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>>>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>>>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>>>>> regexv = "(?m)^.*\n.*\n.*$"
>>>>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>>>>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>>>>>> 
>>>>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? '' : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl + hl_regex
>>>>>> 
>>>>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' + p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> http://karussell.wordpress.com/
>>>>> 
>>>> 
>>> 
>> 
>> 
> 
> 
> 
> -- 
> Lance Norskog
> goksron@gmail.com

Re: Solr searching performance issues, using large documents

Posted by Lance Norskog <go...@gmail.com>.

Ah! You're not just highlighting, you're snippetizing. This makes it easier.

Highlighting does not stream- it pulls the entire stored contents into
one string and then pulls out the snippet.  If you want this to be
fast, you have to split up the text into small pieces and only
snippetize from the most relevant text. So, separate documents with a
common group id for the document it came from. You might have to do 2
queries to achieve what you want, but the second query for the same
query will be blindingly fast. Often <1ms.

Good luck!

Lance

On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam <ps...@mac.com> wrote:
> However, I do need to search the entire document, or else the highlighting will sometimes be blank :-(
> Thanks!
>
> - Peter
>
> ps. sorry for the many responses - I'm rushing around trying to get this working.
>
> On Jul 31, 2010, at 1:11 PM, Peter Spam wrote:
>
>> Correction - it went from 17 seconds to 10 seconds - I was changing the hl.regex.maxAnalyzedChars the first time.
>> Thanks!
>>
>> -Peter
>>
>> On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:
>>
>>> On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:
>>>
>>>> did you already try other values for hl.maxAnalyzedChars=2147483647
>>>
>>> Yes, I tried dropping it down to 21, but it didn't have much of an impact (one search I just tried went from 17 seconds to 15.8 seconds, and this is an 8-core Mac Pro with 6GB RAM - 4GB for java).
>>>
>>>> ? Also regular expression highlighting is more expensive, I think.
>>>> What does the 'fuzzy' variable mean? If you use this to query via
>>>> "~someTerm" instead "someTerm"
>>>> then you should try the trunk of solr which is a lot faster for fuzzy or
>>>> other wildcard search.
>>>
>>> "fuzzy" could be set to "*" but isn't right now.
>>>
>>> Thanks for the tips, Peter - this has been very frustrating!
>>>
>>>
>>> - Peter
>>>
>>>> Regards,
>>>> Peter.
>>>>
>>>>> Data set: About 4,000 log files (will eventually grow to millions).  Average log file is 850k.  Largest log file (so far) is about 70MB.
>>>>>
>>>>> Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds).  Thanks in advance for any ideas!
>>>>>
>>>>>
>>>>> -Peter
>>>>>
>>>>>
>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>
>>>>> 4GB RAM server
>>>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>>>>
>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>
>>>>> schema.xml changes:
>>>>>
>>>>>  <fieldType name="text_pl" class="solr.TextField">
>>>>>    <analyzer>
>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>>    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>>>>>    </analyzer>
>>>>>  </fieldType>
>>>>>
>>>>> ...
>>>>>
>>>>> <field name="body" type="text_pl" indexed="true" stored="true" multiValued="false" termVectors="true" termPositions="true" termOffsets="true" />
>>>>>  <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
>>>>> <field name="version" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>> <field name="device" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>> <field name="filename" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>> <field name="filesize" type="long" indexed="true" stored="true" multiValued="false"/>
>>>>> <field name="pversion" type="int" indexed="true" stored="true" multiValued="false"/>
>>>>> <field name="first2md5" type="string" indexed="false" stored="true" multiValued="false"/>
>>>>> <field name="ckey" type="string" indexed="true" stored="true" multiValued="false"/>
>>>>>
>>>>> ...
>>>>>
>>>>> <dynamicField name="*" type="ignored" multiValued="true" />
>>>>> <defaultSearchField>body</defaultSearchField>
>>>>> <solrQueryParser defaultOperator="AND"/>
>>>>>
>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>
>>>>> solrconfig.xml changes:
>>>>>
>>>>>  <maxFieldLength>2147483647</maxFieldLength>
>>>>>  <ramBufferSizeMB>128</ramBufferSizeMB>
>>>>>
>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>
>>>>> The query:
>>>>>
>>>>> rowStr = "&rows=10"
>>>>> facet = "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>>>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>>>> regexv = "(?m)^.*\n.*\n.*$"
>>>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>>>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>>>>>
>>>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? '' : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl + hl_regex
>>>>>
>>>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' + p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> http://karussell.wordpress.com/
>>>>
>>>
>>
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Solr searching performance issues, using large documents

Posted by Peter Spam <ps...@mac.com>.

However, I do need to search the entire document, or else the highlighting will sometimes be blank :-(
Thanks!

- Peter

ps. sorry for the many responses - I'm rushing around trying to get this working.

On Jul 31, 2010, at 1:11 PM, Peter Spam wrote:

> Correction - it went from 17 seconds to 10 seconds - I was changing the hl.regex.maxAnalyzedChars the first time.
> Thanks!
> 
> -Peter
> 
> On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:
> 
>> On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:
>> 
>>> did you already try other values for hl.maxAnalyzedChars=2147483647
>> 
>> Yes, I tried dropping it down to 21, but it didn't have much of an impact (one search I just tried went from 17 seconds to 15.8 seconds, and this is an 8-core Mac Pro with 6GB RAM - 4GB for java).
>> 
>>> ? Also regular expression highlighting is more expensive, I think.
>>> What does the 'fuzzy' variable mean? If you use this to query via
>>> "~someTerm" instead "someTerm"
>>> then you should try the trunk of solr which is a lot faster for fuzzy or
>>> other wildcard search.
>> 
>> "fuzzy" could be set to "*" but isn't right now.
>> 
>> Thanks for the tips, Peter - this has been very frustrating!
>> 
>> 
>> - Peter
>> 
>>> Regards,
>>> Peter.
>>> 
>>>> Data set: About 4,000 log files (will eventually grow to millions).  Average log file is 850k.  Largest log file (so far) is about 70MB.
>>>> 
>>>> Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds).  Thanks in advance for any ideas!
>>>> 
>>>> 
>>>> -Peter
>>>> 
>>>> 
>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>> 
>>>> 4GB RAM server
>>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>>> 
>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>> 
>>>> schema.xml changes:
>>>> 
>>>>  <fieldType name="text_pl" class="solr.TextField">
>>>>    <analyzer>
>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>> 	<filter class="solr.LowerCaseFilterFactory"/> 
>>>> 	<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>>>>    </analyzer>
>>>>  </fieldType>
>>>> 
>>>> ...
>>>> 
>>>> <field name="body" type="text_pl" indexed="true" stored="true" multiValued="false" termVectors="true" termPositions="true" termOffsets="true" />
>>>>  <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
>>>> <field name="version" type="string" indexed="true" stored="true" multiValued="false"/>
>>>> <field name="device" type="string" indexed="true" stored="true" multiValued="false"/>
>>>> <field name="filename" type="string" indexed="true" stored="true" multiValued="false"/>
>>>> <field name="filesize" type="long" indexed="true" stored="true" multiValued="false"/>
>>>> <field name="pversion" type="int" indexed="true" stored="true" multiValued="false"/>
>>>> <field name="first2md5" type="string" indexed="false" stored="true" multiValued="false"/>
>>>> <field name="ckey" type="string" indexed="true" stored="true" multiValued="false"/>
>>>> 
>>>> ...
>>>> 
>>>> <dynamicField name="*" type="ignored" multiValued="true" />
>>>> <defaultSearchField>body</defaultSearchField>
>>>> <solrQueryParser defaultOperator="AND"/>
>>>> 
>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>> 
>>>> solrconfig.xml changes:
>>>> 
>>>>  <maxFieldLength>2147483647</maxFieldLength>
>>>>  <ramBufferSizeMB>128</ramBufferSizeMB>
>>>> 
>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>> 
>>>> The query:
>>>> 
>>>> rowStr = "&rows=10"
>>>> facet = "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>>> regexv = "(?m)^.*\n.*\n.*$"
>>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>>>> 
>>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? '' : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl + hl_regex
>>>> 
>>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' + p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> -- 
>>> http://karussell.wordpress.com/
>>> 
>> 
>

Re: Solr searching performance issues, using large documents

Posted by Peter Spam <ps...@mac.com>.

Correction - it went from 17 seconds to 10 seconds - I was changing the hl.regex.maxAnalyzedChars the first time.
Thanks!

-Peter

On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:

> On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:
> 
>> did you already try other values for hl.maxAnalyzedChars=2147483647
> 
> Yes, I tried dropping it down to 21, but it didn't have much of an impact (one search I just tried went from 17 seconds to 15.8 seconds, and this is an 8-core Mac Pro with 6GB RAM - 4GB for java).
> 
>> ? Also regular expression highlighting is more expensive, I think.
>> What does the 'fuzzy' variable mean? If you use this to query via
>> "~someTerm" instead "someTerm"
>> then you should try the trunk of solr which is a lot faster for fuzzy or
>> other wildcard search.
> 
> "fuzzy" could be set to "*" but isn't right now.
> 
> Thanks for the tips, Peter - this has been very frustrating!
> 
> 
> - Peter
> 
>> Regards,
>> Peter.
>> 
>>> Data set: About 4,000 log files (will eventually grow to millions).  Average log file is 850k.  Largest log file (so far) is about 70MB.
>>> 
>>> Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds).  Thanks in advance for any ideas!
>>> 
>>> 
>>> -Peter
>>> 
>>> 
>>> -------------------------------------------------------------------------------------------------------------------------------------
>>> 
>>> 4GB RAM server
>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>> 
>>> -------------------------------------------------------------------------------------------------------------------------------------
>>> 
>>> schema.xml changes:
>>> 
>>>   <fieldType name="text_pl" class="solr.TextField">
>>>     <analyzer>
>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>> 	<filter class="solr.LowerCaseFilterFactory"/> 
>>> 	<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>>>     </analyzer>
>>>   </fieldType>
>>> 
>>> ...
>>> 
>>>  <field name="body" type="text_pl" indexed="true" stored="true" multiValued="false" termVectors="true" termPositions="true" termOffsets="true" />
>>>   <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
>>>  <field name="version" type="string" indexed="true" stored="true" multiValued="false"/>
>>>  <field name="device" type="string" indexed="true" stored="true" multiValued="false"/>
>>>  <field name="filename" type="string" indexed="true" stored="true" multiValued="false"/>
>>>  <field name="filesize" type="long" indexed="true" stored="true" multiValued="false"/>
>>>  <field name="pversion" type="int" indexed="true" stored="true" multiValued="false"/>
>>>  <field name="first2md5" type="string" indexed="false" stored="true" multiValued="false"/>
>>>  <field name="ckey" type="string" indexed="true" stored="true" multiValued="false"/>
>>> 
>>> ...
>>> 
>>> <dynamicField name="*" type="ignored" multiValued="true" />
>>> <defaultSearchField>body</defaultSearchField>
>>> <solrQueryParser defaultOperator="AND"/>
>>> 
>>> -------------------------------------------------------------------------------------------------------------------------------------
>>> 
>>> solrconfig.xml changes:
>>> 
>>>   <maxFieldLength>2147483647</maxFieldLength>
>>>   <ramBufferSizeMB>128</ramBufferSizeMB>
>>> 
>>> -------------------------------------------------------------------------------------------------------------------------------------
>>> 
>>> The query:
>>> 
>>> rowStr = "&rows=10"
>>> facet = "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>> regexv = "(?m)^.*\n.*\n.*$"
>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>>> 
>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? '' : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl + hl_regex
>>> 
>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' + p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>> 
>>> 
>>> 
>> 
>> 
>> -- 
>> http://karussell.wordpress.com/
>> 
>

Re: Solr searching performance issues, using large documents

Posted by Peter Spam <ps...@mac.com>.

On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:

> did you already try other values for hl.maxAnalyzedChars=2147483647

Yes, I tried dropping it down to 21, but it didn't have much of an impact (one search I just tried went from 17 seconds to 15.8 seconds, and this is an 8-core Mac Pro with 6GB RAM - 4GB for java).

> ? Also regular expression highlighting is more expensive, I think.
> What does the 'fuzzy' variable mean? If you use this to query via
> "~someTerm" instead "someTerm"
> then you should try the trunk of solr which is a lot faster for fuzzy or
> other wildcard search.

"fuzzy" could be set to "*" but isn't right now.

Thanks for the tips, Peter - this has been very frustrating!


- Peter

> Regards,
> Peter.
> 
>> Data set: About 4,000 log files (will eventually grow to millions).  Average log file is 850k.  Largest log file (so far) is about 70MB.
>> 
>> Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds).  Thanks in advance for any ideas!
>> 
>> 
>> -Peter
>> 
>> 
>> -------------------------------------------------------------------------------------------------------------------------------------
>> 
>> 4GB RAM server
>> % java -Xms2048M -Xmx3072M -jar start.jar
>> 
>> -------------------------------------------------------------------------------------------------------------------------------------
>> 
>> schema.xml changes:
>> 
>>    <fieldType name="text_pl" class="solr.TextField">
>>      <analyzer>
>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> 	<filter class="solr.LowerCaseFilterFactory"/> 
>> 	<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>>      </analyzer>
>>    </fieldType>
>> 
>> ...
>> 
>>   <field name="body" type="text_pl" indexed="true" stored="true" multiValued="false" termVectors="true" termPositions="true" termOffsets="true" />
>>    <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
>>   <field name="version" type="string" indexed="true" stored="true" multiValued="false"/>
>>   <field name="device" type="string" indexed="true" stored="true" multiValued="false"/>
>>   <field name="filename" type="string" indexed="true" stored="true" multiValued="false"/>
>>   <field name="filesize" type="long" indexed="true" stored="true" multiValued="false"/>
>>   <field name="pversion" type="int" indexed="true" stored="true" multiValued="false"/>
>>   <field name="first2md5" type="string" indexed="false" stored="true" multiValued="false"/>
>>   <field name="ckey" type="string" indexed="true" stored="true" multiValued="false"/>
>> 
>> ...
>> 
>> <dynamicField name="*" type="ignored" multiValued="true" />
>> <defaultSearchField>body</defaultSearchField>
>> <solrQueryParser defaultOperator="AND"/>
>> 
>> -------------------------------------------------------------------------------------------------------------------------------------
>> 
>> solrconfig.xml changes:
>> 
>>    <maxFieldLength>2147483647</maxFieldLength>
>>    <ramBufferSizeMB>128</ramBufferSizeMB>
>> 
>> -------------------------------------------------------------------------------------------------------------------------------------
>> 
>> The query:
>> 
>> rowStr = "&rows=10"
>> facet = "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>> regexv = "(?m)^.*\n.*\n.*$"
>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>> 
>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? '' : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl + hl_regex
>> 
>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' + p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>> 
>> 
>> 
> 
> 
> -- 
> http://karussell.wordpress.com/
>

Re: Solr searching performance issues, using large documents

Posted by Peter Spam <ps...@mac.com>.

On Jul 30, 2010, at 7:04 PM, Lance Norskog wrote:

> Wait- how much text are you highlighting? You say these logfiles are X
> big- how big are the actual documents you are storing?

I want it to be like google - I put the entire (sometimes 60MB) doc in a field, and then just highlight 2-4 lines of it.


Thanks,
Peter


> On Fri, Jul 30, 2010 at 1:16 PM, Peter Karich <pe...@yahoo.de> wrote:
>> Hi Peter :-),
>> 
>> did you already try other values for
>> 
>> hl.maxAnalyzedChars=2147483647
>> 
>> ? Also regular expression highlighting is more expensive, I think.
>> What does the 'fuzzy' variable mean? If you use this to query via
>> "~someTerm" instead "someTerm"
>> then you should try the trunk of solr which is a lot faster for fuzzy or
>> other wildcard search.
>> 
>> Regards,
>> Peter.
>> 
>>> Data set: About 4,000 log files (will eventually grow to millions).  Average log file is 850k.  Largest log file (so far) is about 70MB.
>>> 
>>> Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds).  Thanks in advance for any ideas!
>>> 
>>> 
>>> -Peter
>>> 
>>> 
>>> -------------------------------------------------------------------------------------------------------------------------------------
>>> 
>>> 4GB RAM server
>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>> 
>>> -------------------------------------------------------------------------------------------------------------------------------------
>>> 
>>> schema.xml changes:
>>> 
>>>     <fieldType name="text_pl" class="solr.TextField">
>>>       <analyzer>
>>>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>       <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>>>       </analyzer>
>>>     </fieldType>
>>> 
>>> ...
>>> 
>>>    <field name="body" type="text_pl" indexed="true" stored="true" multiValued="false" termVectors="true" termPositions="true" termOffsets="true" />
>>>     <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
>>>    <field name="version" type="string" indexed="true" stored="true" multiValued="false"/>
>>>    <field name="device" type="string" indexed="true" stored="true" multiValued="false"/>
>>>    <field name="filename" type="string" indexed="true" stored="true" multiValued="false"/>
>>>    <field name="filesize" type="long" indexed="true" stored="true" multiValued="false"/>
>>>    <field name="pversion" type="int" indexed="true" stored="true" multiValued="false"/>
>>>    <field name="first2md5" type="string" indexed="false" stored="true" multiValued="false"/>
>>>    <field name="ckey" type="string" indexed="true" stored="true" multiValued="false"/>
>>> 
>>> ...
>>> 
>>>  <dynamicField name="*" type="ignored" multiValued="true" />
>>>  <defaultSearchField>body</defaultSearchField>
>>>  <solrQueryParser defaultOperator="AND"/>
>>> 
>>> -------------------------------------------------------------------------------------------------------------------------------------
>>> 
>>> solrconfig.xml changes:
>>> 
>>>     <maxFieldLength>2147483647</maxFieldLength>
>>>     <ramBufferSizeMB>128</ramBufferSizeMB>
>>> 
>>> -------------------------------------------------------------------------------------------------------------------------------------
>>> 
>>> The query:
>>> 
>>> rowStr = "&rows=10"
>>> facet = "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>> regexv = "(?m)^.*\n.*\n.*$"
>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>>> 
>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? '' : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl + hl_regex
>>> 
>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' + p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>> 
>>> 
>>> 
>> 
>> 
>> --
>> http://karussell.wordpress.com/
>> 
>> 
> 
> 
> 
> -- 
> Lance Norskog
> goksron@gmail.com

Re: Solr searching performance issues, using large documents

Posted by Lance Norskog <go...@gmail.com>.

Wait- how much text are you highlighting? You say these logfiles are X
big- how big are the actual documents you are storing?



On Fri, Jul 30, 2010 at 1:16 PM, Peter Karich <pe...@yahoo.de> wrote:
> Hi Peter :-),
>
> did you already try other values for
>
> hl.maxAnalyzedChars=2147483647
>
> ? Also regular expression highlighting is more expensive, I think.
> What does the 'fuzzy' variable mean? If you use this to query via
> "~someTerm" instead "someTerm"
> then you should try the trunk of solr which is a lot faster for fuzzy or
> other wildcard search.
>
> Regards,
> Peter.
>
>> Data set: About 4,000 log files (will eventually grow to millions).  Average log file is 850k.  Largest log file (so far) is about 70MB.
>>
>> Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds).  Thanks in advance for any ideas!
>>
>>
>> -Peter
>>
>>
>> -------------------------------------------------------------------------------------------------------------------------------------
>>
>> 4GB RAM server
>> % java -Xms2048M -Xmx3072M -jar start.jar
>>
>> -------------------------------------------------------------------------------------------------------------------------------------
>>
>> schema.xml changes:
>>
>>     <fieldType name="text_pl" class="solr.TextField">
>>       <analyzer>
>>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>       <filter class="solr.LowerCaseFilterFactory"/>
>>       <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>>       </analyzer>
>>     </fieldType>
>>
>> ...
>>
>>    <field name="body" type="text_pl" indexed="true" stored="true" multiValued="false" termVectors="true" termPositions="true" termOffsets="true" />
>>     <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
>>    <field name="version" type="string" indexed="true" stored="true" multiValued="false"/>
>>    <field name="device" type="string" indexed="true" stored="true" multiValued="false"/>
>>    <field name="filename" type="string" indexed="true" stored="true" multiValued="false"/>
>>    <field name="filesize" type="long" indexed="true" stored="true" multiValued="false"/>
>>    <field name="pversion" type="int" indexed="true" stored="true" multiValued="false"/>
>>    <field name="first2md5" type="string" indexed="false" stored="true" multiValued="false"/>
>>    <field name="ckey" type="string" indexed="true" stored="true" multiValued="false"/>
>>
>> ...
>>
>>  <dynamicField name="*" type="ignored" multiValued="true" />
>>  <defaultSearchField>body</defaultSearchField>
>>  <solrQueryParser defaultOperator="AND"/>
>>
>> -------------------------------------------------------------------------------------------------------------------------------------
>>
>> solrconfig.xml changes:
>>
>>     <maxFieldLength>2147483647</maxFieldLength>
>>     <ramBufferSizeMB>128</ramBufferSizeMB>
>>
>> -------------------------------------------------------------------------------------------------------------------------------------
>>
>> The query:
>>
>> rowStr = "&rows=10"
>> facet = "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>> regexv = "(?m)^.*\n.*\n.*$"
>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>>
>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? '' : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl + hl_regex
>>
>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' + p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>
>>
>>
>
>
> --
> http://karussell.wordpress.com/
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Solr searching performance issues, using large documents

Posted by Peter Karich <pe...@yahoo.de>.

Hi Peter :-),

did you already try other values for

hl.maxAnalyzedChars=2147483647

? Also regular expression highlighting is more expensive, I think.
What does the 'fuzzy' variable mean? If you use this to query via
"~someTerm" instead "someTerm"
then you should try the trunk of solr which is a lot faster for fuzzy or
other wildcard search.

Regards,
Peter.
 
> Data set: About 4,000 log files (will eventually grow to millions).  Average log file is 850k.  Largest log file (so far) is about 70MB.
>
> Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds).  Thanks in advance for any ideas!
>
>
> -Peter
>
>
> -------------------------------------------------------------------------------------------------------------------------------------
>
> 4GB RAM server
> % java -Xms2048M -Xmx3072M -jar start.jar
>
> -------------------------------------------------------------------------------------------------------------------------------------
>
> schema.xml changes:
>
>     <fieldType name="text_pl" class="solr.TextField">
>       <analyzer>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>  	<filter class="solr.LowerCaseFilterFactory"/> 
> 	<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>       </analyzer>
>     </fieldType>
>
> ...
>
>    <field name="body" type="text_pl" indexed="true" stored="true" multiValued="false" termVectors="true" termPositions="true" termOffsets="true" />
>     <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
>    <field name="version" type="string" indexed="true" stored="true" multiValued="false"/>
>    <field name="device" type="string" indexed="true" stored="true" multiValued="false"/>
>    <field name="filename" type="string" indexed="true" stored="true" multiValued="false"/>
>    <field name="filesize" type="long" indexed="true" stored="true" multiValued="false"/>
>    <field name="pversion" type="int" indexed="true" stored="true" multiValued="false"/>
>    <field name="first2md5" type="string" indexed="false" stored="true" multiValued="false"/>
>    <field name="ckey" type="string" indexed="true" stored="true" multiValued="false"/>
>
> ...
>
>  <dynamicField name="*" type="ignored" multiValued="true" />
>  <defaultSearchField>body</defaultSearchField>
>  <solrQueryParser defaultOperator="AND"/>
>
> -------------------------------------------------------------------------------------------------------------------------------------
>
> solrconfig.xml changes:
>
>     <maxFieldLength>2147483647</maxFieldLength>
>     <ramBufferSizeMB>128</ramBufferSizeMB>
>
> -------------------------------------------------------------------------------------------------------------------------------------
>
> The query:
>
> rowStr = "&rows=10"
> facet = "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
> termvectors = "&tv=true&qt=tvrh&tv.all=true"
> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
> regexv = "(?m)^.*\n.*\n.*$"
> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>
> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? '' : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl + hl_regex
>
> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' + p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>
>
>   


-- 
http://karussell.wordpress.com/

Re: Solr searching performance issues, using large documents

Posted by Peter Spam <ps...@mac.com>.

I do store term vector:

<field name="body" type="text_pl" indexed="true" stored="true" multiValued="false" termVectors="true" termPositions="true" termOffsets="true" />

-Pete

On Jul 30, 2010, at 7:30 AM, Li Li wrote:

> hightlight's time is mainly spent on getting the field which you want
> to highlight and tokenize this field(If you don't store term vector) .
> you can check what's wrong,
> 
> 2010/7/30 Peter Spam <ps...@mac.com>:
>> If I don't do highlighting, it's really fast.  Optimize has no effect.
>> 
>> -Peter
>> 
>> On Jul 29, 2010, at 11:54 AM, dc tech wrote:
>> 
>>> Are you storing the entire log file text in SOLR? That's almost 3gb of
>>> text that you are storing in the SOLR. Try to
>>> 1) Is this first time performance or on repaat queries with the same fields?
>>> 2) Optimze the index and test performance again
>>> 3) index without storing the text and see what the performance looks like.
>>> 
>>> 
>>> On 7/29/10, Peter Spam <ps...@mac.com> wrote:
>>>> Any ideas?  I've got 5000 documents with an average size of 850k each, and
>>>> it sometimes takes 2 minutes for a query to come back when highlighting is
>>>> turned on!  Help!
>>>> 
>>>> 
>>>> -Pete
>>>> 
>>>> On Jul 21, 2010, at 2:41 PM, Peter Spam wrote:
>>>> 
>>>>> From the mailing list archive, Koji wrote:
>>>>> 
>>>>>> 1. Provide another field for highlighting and use copyField to copy
>>>>>> plainText to the highlighting field.
>>>>> 
>>>>> and Lance wrote:
>>>>> http://www.mail-archive.com/solr-user@lucene.apache.org/msg35548.html
>>>>> 
>>>>>> If you want to highlight field X, doing the
>>>>>> termOffsets/termPositions/termVectors will make highlighting that field
>>>>>> faster. You should make a separate field and apply these options to that
>>>>>> field.
>>>>>> 
>>>>>> Now: doing a copyfield adds a "value" to a multiValued field. For a text
>>>>>> field, you get a multi-valued text field. You should only copy one value
>>>>>> to the highlighted field, so just copyField the document to your special
>>>>>> field. To enforce this, I would add multiValued="false" to that field,
>>>>>> just to avoid mistakes.
>>>>>> 
>>>>>> So, all_text should be indexed without the term* attributes, and should
>>>>>> not be stored. Then your document stored in a separate field that you use
>>>>>> for highlighting and has the term* attributes.
>>>>> 
>>>>> I've been experimenting with this, and here's what I've tried:
>>>>> 
>>>>>  <field name="body" type="text_pl" indexed="true" stored="false"
>>>>> multiValued="true" termVectors="true" termPositions="true" termOff
>>>>> sets="true" />
>>>>>  <field name="body_all" type="text_pl" indexed="false" stored="true"
>>>>> multiValued="true" />
>>>>>  <copyField source="body" dest="body_all"/>
>>>>> 
>>>>> ... but it's still very slow (10+ seconds).  Why is it better to have two
>>>>> fields (one indexed but not stored, and the other not indexed but stored)
>>>>> rather than just one field that's both indexed and stored?
>>>>> 
>>>>> 
>>>>> From the Perf wiki page http://wiki.apache.org/solr/SolrPerformanceFactors
>>>>> 
>>>>>> If you aren't always using all the stored fields, then enabling lazy
>>>>>> field loading can be a huge boon, especially if compressed fields are
>>>>>> used.
>>>>> 
>>>>> What does this mean?  How do you load a field lazily?
>>>>> 
>>>>> Thanks for your time, guys - this has started to become frustrating, since
>>>>> it works so well, but is very slow!
>>>>> 
>>>>> 
>>>>> -Pete
>>>>> 
>>>>> On Jul 20, 2010, at 5:36 PM, Peter Spam wrote:
>>>>> 
>>>>>> Data set: About 4,000 log files (will eventually grow to millions).
>>>>>> Average log file is 850k.  Largest log file (so far) is about 70MB.
>>>>>> 
>>>>>> Problem: When I search for common terms, the query time goes from under
>>>>>> 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I
>>>>>> disable highlighting, performance improves a lot, but is still slow for
>>>>>> some queries (7 seconds).  Thanks in advance for any ideas!
>>>>>> 
>>>>>> 
>>>>>> -Peter
>>>>>> 
>>>>>> 
>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>> 
>>>>>> 4GB RAM server
>>>>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>>>>> 
>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>> 
>>>>>> schema.xml changes:
>>>>>> 
>>>>>>  <fieldType name="text_pl" class="solr.TextField">
>>>>>>    <analyzer>
>>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0"
>>>>>> generateNumberParts="0" catenateWords="0" catenateNumbers="0"
>>>>>> catenateAll="0" splitOnCaseChange="0"/>
>>>>>>    </analyzer>
>>>>>>  </fieldType>
>>>>>> 
>>>>>> ...
>>>>>> 
>>>>>> <field name="body" type="text_pl" indexed="true" stored="true"
>>>>>> multiValued="false" termVectors="true" termPositions="true"
>>>>>> termOffsets="true" />
>>>>>>  <field name="timestamp" type="date" indexed="true" stored="true"
>>>>>> default="NOW" multiValued="false"/>
>>>>>> <field name="version" type="string" indexed="true" stored="true"
>>>>>> multiValued="false"/>
>>>>>> <field name="device" type="string" indexed="true" stored="true"
>>>>>> multiValued="false"/>
>>>>>> <field name="filename" type="string" indexed="true" stored="true"
>>>>>> multiValued="false"/>
>>>>>> <field name="filesize" type="long" indexed="true" stored="true"
>>>>>> multiValued="false"/>
>>>>>> <field name="pversion" type="int" indexed="true" stored="true"
>>>>>> multiValued="false"/>
>>>>>> <field name="first2md5" type="string" indexed="false" stored="true"
>>>>>> multiValued="false"/>
>>>>>> <field name="ckey" type="string" indexed="true" stored="true"
>>>>>> multiValued="false"/>
>>>>>> 
>>>>>> ...
>>>>>> 
>>>>>> <dynamicField name="*" type="ignored" multiValued="true" />
>>>>>> <defaultSearchField>body</defaultSearchField>
>>>>>> <solrQueryParser defaultOperator="AND"/>
>>>>>> 
>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>> 
>>>>>> solrconfig.xml changes:
>>>>>> 
>>>>>>  <maxFieldLength>2147483647</maxFieldLength>
>>>>>>  <ramBufferSizeMB>128</ramBufferSizeMB>
>>>>>> 
>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>> 
>>>>>> The query:
>>>>>> 
>>>>>> rowStr = "&rows=10"
>>>>>> facet =
>>>>>> "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>>>>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>>>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>>>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>>>>> regexv = "(?m)^.*\n.*\n.*$"
>>>>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) +
>>>>>> "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>>>>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/,
>>>>>> '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>>>>>> 
>>>>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? ''
>>>>>> : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors
>>>>>> + hl + hl_regex
>>>>>> 
>>>>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' +
>>>>>> p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> --
>>> Sent from my mobile device
>> 
>>

Re: Solr searching performance issues, using large documents

Posted by Li Li <fa...@gmail.com>.

hightlight's time is mainly spent on getting the field which you want
to highlight and tokenize this field(If you don't store term vector) .
you can check what's wrong,

2010/7/30 Peter Spam <ps...@mac.com>:
> If I don't do highlighting, it's really fast.  Optimize has no effect.
>
> -Peter
>
> On Jul 29, 2010, at 11:54 AM, dc tech wrote:
>
>> Are you storing the entire log file text in SOLR? That's almost 3gb of
>> text that you are storing in the SOLR. Try to
>> 1) Is this first time performance or on repaat queries with the same fields?
>> 2) Optimze the index and test performance again
>> 3) index without storing the text and see what the performance looks like.
>>
>>
>> On 7/29/10, Peter Spam <ps...@mac.com> wrote:
>>> Any ideas?  I've got 5000 documents with an average size of 850k each, and
>>> it sometimes takes 2 minutes for a query to come back when highlighting is
>>> turned on!  Help!
>>>
>>>
>>> -Pete
>>>
>>> On Jul 21, 2010, at 2:41 PM, Peter Spam wrote:
>>>
>>>> From the mailing list archive, Koji wrote:
>>>>
>>>>> 1. Provide another field for highlighting and use copyField to copy
>>>>> plainText to the highlighting field.
>>>>
>>>> and Lance wrote:
>>>> http://www.mail-archive.com/solr-user@lucene.apache.org/msg35548.html
>>>>
>>>>> If you want to highlight field X, doing the
>>>>> termOffsets/termPositions/termVectors will make highlighting that field
>>>>> faster. You should make a separate field and apply these options to that
>>>>> field.
>>>>>
>>>>> Now: doing a copyfield adds a "value" to a multiValued field. For a text
>>>>> field, you get a multi-valued text field. You should only copy one value
>>>>> to the highlighted field, so just copyField the document to your special
>>>>> field. To enforce this, I would add multiValued="false" to that field,
>>>>> just to avoid mistakes.
>>>>>
>>>>> So, all_text should be indexed without the term* attributes, and should
>>>>> not be stored. Then your document stored in a separate field that you use
>>>>> for highlighting and has the term* attributes.
>>>>
>>>> I've been experimenting with this, and here's what I've tried:
>>>>
>>>>  <field name="body" type="text_pl" indexed="true" stored="false"
>>>> multiValued="true" termVectors="true" termPositions="true" termOff
>>>> sets="true" />
>>>>  <field name="body_all" type="text_pl" indexed="false" stored="true"
>>>> multiValued="true" />
>>>>  <copyField source="body" dest="body_all"/>
>>>>
>>>> ... but it's still very slow (10+ seconds).  Why is it better to have two
>>>> fields (one indexed but not stored, and the other not indexed but stored)
>>>> rather than just one field that's both indexed and stored?
>>>>
>>>>
>>>> From the Perf wiki page http://wiki.apache.org/solr/SolrPerformanceFactors
>>>>
>>>>> If you aren't always using all the stored fields, then enabling lazy
>>>>> field loading can be a huge boon, especially if compressed fields are
>>>>> used.
>>>>
>>>> What does this mean?  How do you load a field lazily?
>>>>
>>>> Thanks for your time, guys - this has started to become frustrating, since
>>>> it works so well, but is very slow!
>>>>
>>>>
>>>> -Pete
>>>>
>>>> On Jul 20, 2010, at 5:36 PM, Peter Spam wrote:
>>>>
>>>>> Data set: About 4,000 log files (will eventually grow to millions).
>>>>> Average log file is 850k.  Largest log file (so far) is about 70MB.
>>>>>
>>>>> Problem: When I search for common terms, the query time goes from under
>>>>> 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I
>>>>> disable highlighting, performance improves a lot, but is still slow for
>>>>> some queries (7 seconds).  Thanks in advance for any ideas!
>>>>>
>>>>>
>>>>> -Peter
>>>>>
>>>>>
>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>
>>>>> 4GB RAM server
>>>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>>>>
>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>
>>>>> schema.xml changes:
>>>>>
>>>>>  <fieldType name="text_pl" class="solr.TextField">
>>>>>    <analyzer>
>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>>    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0"
>>>>> generateNumberParts="0" catenateWords="0" catenateNumbers="0"
>>>>> catenateAll="0" splitOnCaseChange="0"/>
>>>>>    </analyzer>
>>>>>  </fieldType>
>>>>>
>>>>> ...
>>>>>
>>>>> <field name="body" type="text_pl" indexed="true" stored="true"
>>>>> multiValued="false" termVectors="true" termPositions="true"
>>>>> termOffsets="true" />
>>>>>  <field name="timestamp" type="date" indexed="true" stored="true"
>>>>> default="NOW" multiValued="false"/>
>>>>> <field name="version" type="string" indexed="true" stored="true"
>>>>> multiValued="false"/>
>>>>> <field name="device" type="string" indexed="true" stored="true"
>>>>> multiValued="false"/>
>>>>> <field name="filename" type="string" indexed="true" stored="true"
>>>>> multiValued="false"/>
>>>>> <field name="filesize" type="long" indexed="true" stored="true"
>>>>> multiValued="false"/>
>>>>> <field name="pversion" type="int" indexed="true" stored="true"
>>>>> multiValued="false"/>
>>>>> <field name="first2md5" type="string" indexed="false" stored="true"
>>>>> multiValued="false"/>
>>>>> <field name="ckey" type="string" indexed="true" stored="true"
>>>>> multiValued="false"/>
>>>>>
>>>>> ...
>>>>>
>>>>> <dynamicField name="*" type="ignored" multiValued="true" />
>>>>> <defaultSearchField>body</defaultSearchField>
>>>>> <solrQueryParser defaultOperator="AND"/>
>>>>>
>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>
>>>>> solrconfig.xml changes:
>>>>>
>>>>>  <maxFieldLength>2147483647</maxFieldLength>
>>>>>  <ramBufferSizeMB>128</ramBufferSizeMB>
>>>>>
>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>
>>>>> The query:
>>>>>
>>>>> rowStr = "&rows=10"
>>>>> facet =
>>>>> "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>>>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>>>> regexv = "(?m)^.*\n.*\n.*$"
>>>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) +
>>>>> "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>>>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/,
>>>>> '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>>>>>
>>>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? ''
>>>>> : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors
>>>>> + hl + hl_regex
>>>>>
>>>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' +
>>>>> p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>>>>
>>>>
>>>
>>>
>>
>> --
>> Sent from my mobile device
>
>

Re: Solr searching performance issues, using large documents

Posted by Peter Spam <ps...@mac.com>.

If I don't do highlighting, it's really fast.  Optimize has no effect.

-Peter

On Jul 29, 2010, at 11:54 AM, dc tech wrote:

> Are you storing the entire log file text in SOLR? That's almost 3gb of
> text that you are storing in the SOLR. Try to
> 1) Is this first time performance or on repaat queries with the same fields?
> 2) Optimze the index and test performance again
> 3) index without storing the text and see what the performance looks like.
> 
> 
> On 7/29/10, Peter Spam <ps...@mac.com> wrote:
>> Any ideas?  I've got 5000 documents with an average size of 850k each, and
>> it sometimes takes 2 minutes for a query to come back when highlighting is
>> turned on!  Help!
>> 
>> 
>> -Pete
>> 
>> On Jul 21, 2010, at 2:41 PM, Peter Spam wrote:
>> 
>>> From the mailing list archive, Koji wrote:
>>> 
>>>> 1. Provide another field for highlighting and use copyField to copy
>>>> plainText to the highlighting field.
>>> 
>>> and Lance wrote:
>>> http://www.mail-archive.com/solr-user@lucene.apache.org/msg35548.html
>>> 
>>>> If you want to highlight field X, doing the
>>>> termOffsets/termPositions/termVectors will make highlighting that field
>>>> faster. You should make a separate field and apply these options to that
>>>> field.
>>>> 
>>>> Now: doing a copyfield adds a "value" to a multiValued field. For a text
>>>> field, you get a multi-valued text field. You should only copy one value
>>>> to the highlighted field, so just copyField the document to your special
>>>> field. To enforce this, I would add multiValued="false" to that field,
>>>> just to avoid mistakes.
>>>> 
>>>> So, all_text should be indexed without the term* attributes, and should
>>>> not be stored. Then your document stored in a separate field that you use
>>>> for highlighting and has the term* attributes.
>>> 
>>> I've been experimenting with this, and here's what I've tried:
>>> 
>>>  <field name="body" type="text_pl" indexed="true" stored="false"
>>> multiValued="true" termVectors="true" termPositions="true" termOff
>>> sets="true" />
>>>  <field name="body_all" type="text_pl" indexed="false" stored="true"
>>> multiValued="true" />
>>>  <copyField source="body" dest="body_all"/>
>>> 
>>> ... but it's still very slow (10+ seconds).  Why is it better to have two
>>> fields (one indexed but not stored, and the other not indexed but stored)
>>> rather than just one field that's both indexed and stored?
>>> 
>>> 
>>> From the Perf wiki page http://wiki.apache.org/solr/SolrPerformanceFactors
>>> 
>>>> If you aren't always using all the stored fields, then enabling lazy
>>>> field loading can be a huge boon, especially if compressed fields are
>>>> used.
>>> 
>>> What does this mean?  How do you load a field lazily?
>>> 
>>> Thanks for your time, guys - this has started to become frustrating, since
>>> it works so well, but is very slow!
>>> 
>>> 
>>> -Pete
>>> 
>>> On Jul 20, 2010, at 5:36 PM, Peter Spam wrote:
>>> 
>>>> Data set: About 4,000 log files (will eventually grow to millions).
>>>> Average log file is 850k.  Largest log file (so far) is about 70MB.
>>>> 
>>>> Problem: When I search for common terms, the query time goes from under
>>>> 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I
>>>> disable highlighting, performance improves a lot, but is still slow for
>>>> some queries (7 seconds).  Thanks in advance for any ideas!
>>>> 
>>>> 
>>>> -Peter
>>>> 
>>>> 
>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>> 
>>>> 4GB RAM server
>>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>>> 
>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>> 
>>>> schema.xml changes:
>>>> 
>>>>  <fieldType name="text_pl" class="solr.TextField">
>>>>    <analyzer>
>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>> 	<filter class="solr.LowerCaseFilterFactory"/>
>>>> 	<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0"
>>>> generateNumberParts="0" catenateWords="0" catenateNumbers="0"
>>>> catenateAll="0" splitOnCaseChange="0"/>
>>>>    </analyzer>
>>>>  </fieldType>
>>>> 
>>>> ...
>>>> 
>>>> <field name="body" type="text_pl" indexed="true" stored="true"
>>>> multiValued="false" termVectors="true" termPositions="true"
>>>> termOffsets="true" />
>>>>  <field name="timestamp" type="date" indexed="true" stored="true"
>>>> default="NOW" multiValued="false"/>
>>>> <field name="version" type="string" indexed="true" stored="true"
>>>> multiValued="false"/>
>>>> <field name="device" type="string" indexed="true" stored="true"
>>>> multiValued="false"/>
>>>> <field name="filename" type="string" indexed="true" stored="true"
>>>> multiValued="false"/>
>>>> <field name="filesize" type="long" indexed="true" stored="true"
>>>> multiValued="false"/>
>>>> <field name="pversion" type="int" indexed="true" stored="true"
>>>> multiValued="false"/>
>>>> <field name="first2md5" type="string" indexed="false" stored="true"
>>>> multiValued="false"/>
>>>> <field name="ckey" type="string" indexed="true" stored="true"
>>>> multiValued="false"/>
>>>> 
>>>> ...
>>>> 
>>>> <dynamicField name="*" type="ignored" multiValued="true" />
>>>> <defaultSearchField>body</defaultSearchField>
>>>> <solrQueryParser defaultOperator="AND"/>
>>>> 
>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>> 
>>>> solrconfig.xml changes:
>>>> 
>>>>  <maxFieldLength>2147483647</maxFieldLength>
>>>>  <ramBufferSizeMB>128</ramBufferSizeMB>
>>>> 
>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>> 
>>>> The query:
>>>> 
>>>> rowStr = "&rows=10"
>>>> facet =
>>>> "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>>> regexv = "(?m)^.*\n.*\n.*$"
>>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) +
>>>> "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/,
>>>> '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>>>> 
>>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? ''
>>>> : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors
>>>> + hl + hl_regex
>>>> 
>>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' +
>>>> p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>>> 
>>> 
>> 
>> 
> 
> -- 
> Sent from my mobile device

Re: Solr searching performance issues, using large documents

Posted by dc tech <dc...@gmail.com>.

Are you storing the entire log file text in SOLR? That's almost 3gb of
text that you are storing in the SOLR. Try to
1) Is this first time performance or on repaat queries with the same fields?
2) Optimze the index and test performance again
3) index without storing the text and see what the performance looks like.


On 7/29/10, Peter Spam <ps...@mac.com> wrote:
> Any ideas?  I've got 5000 documents with an average size of 850k each, and
> it sometimes takes 2 minutes for a query to come back when highlighting is
> turned on!  Help!
>
>
> -Pete
>
> On Jul 21, 2010, at 2:41 PM, Peter Spam wrote:
>
>> From the mailing list archive, Koji wrote:
>>
>>> 1. Provide another field for highlighting and use copyField to copy
>>> plainText to the highlighting field.
>>
>> and Lance wrote:
>> http://www.mail-archive.com/solr-user@lucene.apache.org/msg35548.html
>>
>>> If you want to highlight field X, doing the
>>> termOffsets/termPositions/termVectors will make highlighting that field
>>> faster. You should make a separate field and apply these options to that
>>> field.
>>>
>>> Now: doing a copyfield adds a "value" to a multiValued field. For a text
>>> field, you get a multi-valued text field. You should only copy one value
>>> to the highlighted field, so just copyField the document to your special
>>> field. To enforce this, I would add multiValued="false" to that field,
>>> just to avoid mistakes.
>>>
>>> So, all_text should be indexed without the term* attributes, and should
>>> not be stored. Then your document stored in a separate field that you use
>>> for highlighting and has the term* attributes.
>>
>> I've been experimenting with this, and here's what I've tried:
>>
>>   <field name="body" type="text_pl" indexed="true" stored="false"
>> multiValued="true" termVectors="true" termPositions="true" termOff
>> sets="true" />
>>   <field name="body_all" type="text_pl" indexed="false" stored="true"
>> multiValued="true" />
>>   <copyField source="body" dest="body_all"/>
>>
>> ... but it's still very slow (10+ seconds).  Why is it better to have two
>> fields (one indexed but not stored, and the other not indexed but stored)
>> rather than just one field that's both indexed and stored?
>>
>>
>> From the Perf wiki page http://wiki.apache.org/solr/SolrPerformanceFactors
>>
>>> If you aren't always using all the stored fields, then enabling lazy
>>> field loading can be a huge boon, especially if compressed fields are
>>> used.
>>
>> What does this mean?  How do you load a field lazily?
>>
>> Thanks for your time, guys - this has started to become frustrating, since
>> it works so well, but is very slow!
>>
>>
>> -Pete
>>
>> On Jul 20, 2010, at 5:36 PM, Peter Spam wrote:
>>
>>> Data set: About 4,000 log files (will eventually grow to millions).
>>> Average log file is 850k.  Largest log file (so far) is about 70MB.
>>>
>>> Problem: When I search for common terms, the query time goes from under
>>> 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I
>>> disable highlighting, performance improves a lot, but is still slow for
>>> some queries (7 seconds).  Thanks in advance for any ideas!
>>>
>>>
>>> -Peter
>>>
>>>
>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>
>>> 4GB RAM server
>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>>
>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>
>>> schema.xml changes:
>>>
>>>   <fieldType name="text_pl" class="solr.TextField">
>>>     <analyzer>
>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>> 	<filter class="solr.LowerCaseFilterFactory"/>
>>> 	<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0"
>>> generateNumberParts="0" catenateWords="0" catenateNumbers="0"
>>> catenateAll="0" splitOnCaseChange="0"/>
>>>     </analyzer>
>>>   </fieldType>
>>>
>>> ...
>>>
>>>  <field name="body" type="text_pl" indexed="true" stored="true"
>>> multiValued="false" termVectors="true" termPositions="true"
>>> termOffsets="true" />
>>>   <field name="timestamp" type="date" indexed="true" stored="true"
>>> default="NOW" multiValued="false"/>
>>>  <field name="version" type="string" indexed="true" stored="true"
>>> multiValued="false"/>
>>>  <field name="device" type="string" indexed="true" stored="true"
>>> multiValued="false"/>
>>>  <field name="filename" type="string" indexed="true" stored="true"
>>> multiValued="false"/>
>>>  <field name="filesize" type="long" indexed="true" stored="true"
>>> multiValued="false"/>
>>>  <field name="pversion" type="int" indexed="true" stored="true"
>>> multiValued="false"/>
>>>  <field name="first2md5" type="string" indexed="false" stored="true"
>>> multiValued="false"/>
>>>  <field name="ckey" type="string" indexed="true" stored="true"
>>> multiValued="false"/>
>>>
>>> ...
>>>
>>> <dynamicField name="*" type="ignored" multiValued="true" />
>>> <defaultSearchField>body</defaultSearchField>
>>> <solrQueryParser defaultOperator="AND"/>
>>>
>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>
>>> solrconfig.xml changes:
>>>
>>>   <maxFieldLength>2147483647</maxFieldLength>
>>>   <ramBufferSizeMB>128</ramBufferSizeMB>
>>>
>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>
>>> The query:
>>>
>>> rowStr = "&rows=10"
>>> facet =
>>> "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>> regexv = "(?m)^.*\n.*\n.*$"
>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) +
>>> "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/,
>>> '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>>>
>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? ''
>>> : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors
>>> + hl + hl_regex
>>>
>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' +
>>> p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>>
>>
>
>

-- 
Sent from my mobile device

Re: Solr searching performance issues, using large documents

Posted by Peter Spam <ps...@mac.com>.

Any ideas?  I've got 5000 documents with an average size of 850k each, and it sometimes takes 2 minutes for a query to come back when highlighting is turned on!  Help!


-Pete

On Jul 21, 2010, at 2:41 PM, Peter Spam wrote:

> From the mailing list archive, Koji wrote:
> 
>> 1. Provide another field for highlighting and use copyField to copy plainText to the highlighting field.
> 
> and Lance wrote: http://www.mail-archive.com/solr-user@lucene.apache.org/msg35548.html
> 
>> If you want to highlight field X, doing the termOffsets/termPositions/termVectors will make highlighting that field faster. You should make a separate field and apply these options to that field.
>> 
>> Now: doing a copyfield adds a "value" to a multiValued field. For a text field, you get a multi-valued text field. You should only copy one value to the highlighted field, so just copyField the document to your special field. To enforce this, I would add multiValued="false" to that field, just to avoid mistakes.
>> 
>> So, all_text should be indexed without the term* attributes, and should not be stored. Then your document stored in a separate field that you use for highlighting and has the term* attributes.
> 
> I've been experimenting with this, and here's what I've tried:
> 
>   <field name="body" type="text_pl" indexed="true" stored="false" multiValued="true" termVectors="true" termPositions="true" termOff
> sets="true" />
>   <field name="body_all" type="text_pl" indexed="false" stored="true" multiValued="true" />
>   <copyField source="body" dest="body_all"/>
> 
> ... but it's still very slow (10+ seconds).  Why is it better to have two fields (one indexed but not stored, and the other not indexed but stored) rather than just one field that's both indexed and stored?
> 
> 
> From the Perf wiki page http://wiki.apache.org/solr/SolrPerformanceFactors
> 
>> If you aren't always using all the stored fields, then enabling lazy field loading can be a huge boon, especially if compressed fields are used.
> 
> What does this mean?  How do you load a field lazily?
> 
> Thanks for your time, guys - this has started to become frustrating, since it works so well, but is very slow!
> 
> 
> -Pete
> 
> On Jul 20, 2010, at 5:36 PM, Peter Spam wrote:
> 
>> Data set: About 4,000 log files (will eventually grow to millions).  Average log file is 850k.  Largest log file (so far) is about 70MB.
>> 
>> Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds).  Thanks in advance for any ideas!
>> 
>> 
>> -Peter
>> 
>> 
>> -------------------------------------------------------------------------------------------------------------------------------------
>> 
>> 4GB RAM server
>> % java -Xms2048M -Xmx3072M -jar start.jar
>> 
>> -------------------------------------------------------------------------------------------------------------------------------------
>> 
>> schema.xml changes:
>> 
>>   <fieldType name="text_pl" class="solr.TextField">
>>     <analyzer>
>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> 	<filter class="solr.LowerCaseFilterFactory"/> 
>> 	<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>>     </analyzer>
>>   </fieldType>
>> 
>> ...
>> 
>>  <field name="body" type="text_pl" indexed="true" stored="true" multiValued="false" termVectors="true" termPositions="true" termOffsets="true" />
>>   <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
>>  <field name="version" type="string" indexed="true" stored="true" multiValued="false"/>
>>  <field name="device" type="string" indexed="true" stored="true" multiValued="false"/>
>>  <field name="filename" type="string" indexed="true" stored="true" multiValued="false"/>
>>  <field name="filesize" type="long" indexed="true" stored="true" multiValued="false"/>
>>  <field name="pversion" type="int" indexed="true" stored="true" multiValued="false"/>
>>  <field name="first2md5" type="string" indexed="false" stored="true" multiValued="false"/>
>>  <field name="ckey" type="string" indexed="true" stored="true" multiValued="false"/>
>> 
>> ...
>> 
>> <dynamicField name="*" type="ignored" multiValued="true" />
>> <defaultSearchField>body</defaultSearchField>
>> <solrQueryParser defaultOperator="AND"/>
>> 
>> -------------------------------------------------------------------------------------------------------------------------------------
>> 
>> solrconfig.xml changes:
>> 
>>   <maxFieldLength>2147483647</maxFieldLength>
>>   <ramBufferSizeMB>128</ramBufferSizeMB>
>> 
>> -------------------------------------------------------------------------------------------------------------------------------------
>> 
>> The query:
>> 
>> rowStr = "&rows=10"
>> facet = "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>> regexv = "(?m)^.*\n.*\n.*$"
>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>> 
>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? '' : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl + hl_regex
>> 
>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' + p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>> 
>

Re: Solr searching performance issues, using large documents

Posted by Peter Spam <ps...@mac.com>.

>From the mailing list archive, Koji wrote:

> 1. Provide another field for highlighting and use copyField to copy plainText to the highlighting field.

and Lance wrote: http://www.mail-archive.com/solr-user@lucene.apache.org/msg35548.html

> If you want to highlight field X, doing the termOffsets/termPositions/termVectors will make highlighting that field faster. You should make a separate field and apply these options to that field.
> 
> Now: doing a copyfield adds a "value" to a multiValued field. For a text field, you get a multi-valued text field. You should only copy one value to the highlighted field, so just copyField the document to your special field. To enforce this, I would add multiValued="false" to that field, just to avoid mistakes.
> 
> So, all_text should be indexed without the term* attributes, and should not be stored. Then your document stored in a separate field that you use for highlighting and has the term* attributes.

I've been experimenting with this, and here's what I've tried:

   <field name="body" type="text_pl" indexed="true" stored="false" multiValued="true" termVectors="true" termPositions="true" termOff
sets="true" />
   <field name="body_all" type="text_pl" indexed="false" stored="true" multiValued="true" />
   <copyField source="body" dest="body_all"/>

... but it's still very slow (10+ seconds).  Why is it better to have two fields (one indexed but not stored, and the other not indexed but stored) rather than just one field that's both indexed and stored?


>From the Perf wiki page http://wiki.apache.org/solr/SolrPerformanceFactors

> If you aren't always using all the stored fields, then enabling lazy field loading can be a huge boon, especially if compressed fields are used.

What does this mean?  How do you load a field lazily?

Thanks for your time, guys - this has started to become frustrating, since it works so well, but is very slow!


-Pete

On Jul 20, 2010, at 5:36 PM, Peter Spam wrote:

> Data set: About 4,000 log files (will eventually grow to millions).  Average log file is 850k.  Largest log file (so far) is about 70MB.
> 
> Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds).  Thanks in advance for any ideas!
> 
> 
> -Peter
> 
> 
> -------------------------------------------------------------------------------------------------------------------------------------
> 
> 4GB RAM server
> % java -Xms2048M -Xmx3072M -jar start.jar
> 
> -------------------------------------------------------------------------------------------------------------------------------------
> 
> schema.xml changes:
> 
>    <fieldType name="text_pl" class="solr.TextField">
>      <analyzer>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> 	<filter class="solr.LowerCaseFilterFactory"/> 
> 	<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>      </analyzer>
>    </fieldType>
> 
> ...
> 
>   <field name="body" type="text_pl" indexed="true" stored="true" multiValued="false" termVectors="true" termPositions="true" termOffsets="true" />
>    <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
>   <field name="version" type="string" indexed="true" stored="true" multiValued="false"/>
>   <field name="device" type="string" indexed="true" stored="true" multiValued="false"/>
>   <field name="filename" type="string" indexed="true" stored="true" multiValued="false"/>
>   <field name="filesize" type="long" indexed="true" stored="true" multiValued="false"/>
>   <field name="pversion" type="int" indexed="true" stored="true" multiValued="false"/>
>   <field name="first2md5" type="string" indexed="false" stored="true" multiValued="false"/>
>   <field name="ckey" type="string" indexed="true" stored="true" multiValued="false"/>
> 
> ...
> 
> <dynamicField name="*" type="ignored" multiValued="true" />
> <defaultSearchField>body</defaultSearchField>
> <solrQueryParser defaultOperator="AND"/>
> 
> -------------------------------------------------------------------------------------------------------------------------------------
> 
> solrconfig.xml changes:
> 
>    <maxFieldLength>2147483647</maxFieldLength>
>    <ramBufferSizeMB>128</ramBufferSizeMB>
> 
> -------------------------------------------------------------------------------------------------------------------------------------
> 
> The query:
> 
> rowStr = "&rows=10"
> facet = "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
> termvectors = "&tv=true&qt=tvrh&tv.all=true"
> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
> regexv = "(?m)^.*\n.*\n.*$"
> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
> 
> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? '' : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl + hl_regex
> 
> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' + p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>