You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Grant Ingersoll <gs...@apache.org> on 2008/12/01 03:38:46 UTC
Re: Upgrade from 1.2 to 1.3 gives 3x slowdown + script!
Hi Fergie,
Haven't forgotten about you, but I've been traveling and then into
some US Holidays here.
To confirm I am understanding, you are seeing a slowdown between 1.3-
dev from April and one from September, right?
Can you produce an MD5 hash of the WAR file or something, such that I
can know I have the exact bits. Better yet, perhaps you can put those
files up somewhere where they can be downloaded.
Thanks,
Grant
On Nov 26, 2008, at 10:54 AM, Fergus McMenemie wrote:
> Hello Grant,
>
> Not much good with Java profilers (yet!) so I thought I
> would send a script!
>
> Details... details! Having decided to produce a script to
> replicate the 1.2 vis 1.3 speed problem. The required rigor
> revealed a lot more.
>
> 1) The faster version I have previously referred to as 1.2,
> was actually a "1.3-dev" I had downloaded as part of the
> solr bootcamp class at ApacheCon Europe 2008. The ID
> string in the CHANGES.txt document is:-
> $Id: CHANGES.txt 643465 2008-04-01 16:10:19Z gsingers $
>
> 2) I did actually download and speed test a version of 1.2
> from the internet. It's CHANGES.txt id is:-
> $Id: CHANGES.txt 543263 2007-05-31 21:19:02Z yonik $
> Speed wise it was about the same as 1.3 at 64min. It also
> had lots of char set issues and is ignored from now on.
>
> 3) The version I was planning to use, till I found this,
> speed issue was the "latest" official version:-
> $Id: CHANGES.txt 694377 2008-09-11 17:40:11Z klaas $
> I also verified the behavior with a nightly build.
> $Id: CHANGES.txt 712457 2008-11-09 01:24:11Z koji $
>
> Anyway, The following script indexes the content in 22min
> for the 1.3-dev version and takes 68min for the newer releases
> of 1.3. I took the conf directory from the 1.3dev (bootcamp)
> release and used it replace the conf directory from the
> official 1.3 release. The 3x slow down was still there; it is
> not a configuration issue!
> =================================
>
>
>
>
>
>
> #! /bin/bash
>
> # This script assumes a /usr/local/tomcat link to whatever version
> # of tomcat you have installed. I have "apache-tomcat-5.5.20" Also
> # /usr/local/tomcat/conf/Catalina/localhost contains no solr.xml.
> # All the following was done as root.
>
>
> # I have a directory /usr/local/ts which contains four versions of
> solr. The
> # "official" 1.2 along with two 1.3 releases and a version of 1.2 or
> a 1.3beata
> # I got while attending a solr bootcamp. I indexed the same content
> using the
> # different versions of solr as follows:
> cd /usr/local/ts
> if [ "" ]
> then
> echo "Starting from a-fresh"
> sleep 5 # allow time for me to interrupt!
> cp -Rp apache-solr-bc/example/solr ./solrbc #bc = bootcamp
> cp -Rp apache-solr-nightly/example/solr ./solrnightly
> cp -Rp apache-solr-1.3.0/example/solr ./solr13
>
> # the gaz is regularly updated and its name keeps changing :-) The
> page
> # http://earth-info.nga.mil/gns/html/namefiles.htm has a link to
> the latest
> # version.
> curl "http://earth-info.nga.mil/gns/html/geonames_dd_dms_date_20081118.zip
> " > geonames.zip
> unzip -q geonames.zip
> # delete corrupt blips!
> perl -i -n -e 'print unless
> ($. > 2128495 and $. < 2128505) or
> ($. > 5944254 and $. < 5944260)
> ;' geonames_dd_dms_date_20081118.txt
> #following was used to detect bad short records
> #perl -a -F\\t -n -e ' print "line $. is bad with ",scalar(@F),"
> args\n" if (@F != 26);' geonames_dd_dms_date_20081118.txt
>
> # my set of fields and copyfields for the schema.xml
> fields='
> <fields>
> <field name="UNI" type="string" indexed="true"
> stored="true" required="true" />
> <field name="CCODE" type="string" indexed="true"
> stored="true"/>
> <field name="DSG" type="string" indexed="true"
> stored="true"/>
> <field name="CC1" type="string" indexed="true"
> stored="true"/>
> <field name="LAT" type="sfloat" indexed="true"
> stored="true"/>
> <field name="LONG" type="sfloat" indexed="true"
> stored="true"/>
> <field name="MGRS" type="string" indexed="false"
> stored="true"/>
> <field name="JOG" type="string" indexed="false"
> stored="true"/>
> <field name="FULL_NAME" type="string" indexed="true"
> stored="true"/>
> <field name="FULL_NAME_ND" type="string" indexed="true"
> stored="true"/>
> <!--field name="text" type="text" indexed="true"
> stored="false" multiValued="true"/ -->
> <!--field name="timestamp" type="date" indexed="true"
> stored="true" default="NOW" multiValued="false"/-->
> '
> copyfields='
> </fields>
> <copyField source="FULL_NAME" dest="text"/>
> <copyField source="FULL_NAME_ND" dest="text"/>
> '
>
> # add in my fields and copyfields
> perl -i -p -e "print qq($fields) if s/<fields>//;" solr*/
> conf/schema.xml
> perl -i -p -e "print qq($copyfields) if s[</fields>][];" solr*/
> conf/schema.xml
> # change the unique key and mark the "id" field as not required
> perl -i -p -e "s/<uniqueKey>id/<uniqueKey>UNI/i;" solr*/
> conf/schema.xml
> perl -i -p -e 's/required="true"//i if m/<field name="id"/;' solr*/
> conf/schema.xml
> # enable remote streaming in solrconfig file
> perl -i -p -e 's/enableRemoteStreaming="false"/
> enableRemoteStreaming="true"/;' solr*/conf/solrconfig.xml
> fi
>
> # some constants to keep the curl command shorter
> skip
> =
> "MODIFY_DATE
> ,RC
> ,UFI
> ,DMS_LAT
> ,DMS_LONG
> ,FC,PC,ADM1,ADM2,POP,ELEV,CC2,NT,LC,SHORT_FORM,GENERIC,SORT_NAME"
> file=`pwd`"/geonames.txt"
>
> export JAVA_OPTS=" -Xmx512M -Xms512M -Dsolr.home=`pwd`/solr -
> Dsolr.solr.home=`pwd`/solr"
>
> echo 'Getting ready to index the data set using solrbc (bc =
> bootcamp)'
> /usr/local/tomcat/bin/shutdown.sh
> sleep 15
> if [ -n "`ps awxww | grep tomcat | grep -v grep`" ]
> then
> echo "Tomcat would not shutdown"
> exit
> fi
> rm -r /usr/local/tomcat/webapps/solr*
> rm -r /usr/local/tomcat/logs/*.out
> rm -r /usr/local/tomcat/work/Catalina/localhost/solr
> cp apache-solr-bc/example/webapps/solr.war /usr/local/tomcat/webapps
> rm solr # rm the symbolic link
> ln -s solrbc solr
> rm -r solr/data
> /usr/local/tomcat/bin/startup.sh
> sleep 10 # give solr time to launch and setup
> echo "Starting indexing at " `date` " with solrbc (bc = bootcamp)"
> time curl "http://localhost:8080/solr/update/csv?commit=true&stream.file=$file&escape=%00&separator=%09&skip=$skip
> "
>
> echo "Getting ready to index the data set using solrnightly"
> /usr/local/tomcat/bin/shutdown.sh
> sleep 15
> if [ -n "`ps awxww | grep tomcat | grep -v grep`" ]
> then
> echo "Tomcat would not shutdown"
> exit
> fi
> rm -r /usr/local/tomcat/webapps/solr*
> rm -r /usr/local/tomcat/logs/*.out
> rm -r /usr/local/tomcat/work/Catalina/localhost/solr
> cp apache-solr-nightly/example/webapps/solr.war /usr/local/tomcat/
> webapps
> rm solr # rm the symbolic link
> ln -s solrnightly solr
> rm -r solr/data
> /usr/local/tomcat/bin/startup.sh
> sleep 10 # give solr time to launch and setup
> echo "Starting indexing at " `date` " with solrnightly"
> time curl "http://localhost:8080/solr/update/csv?commit=true&stream.file=$file&escape=%00&separator=%09&skip=$skip
> "
>
>
>
>
>> On Nov 20, 2008, at 9:18 AM, Fergus McMenemie wrote:
>>
>>> Hello Grant,
>>>
>>>> Were you overwriting the existing index or did you also clean out
>>>> the
>>>> Solr data directory, too? In other words, was it a fresh index, or
>>>> an
>>>> existing one? And was that also the case for the 22 minute time?
>>>
>>> No in each case it was a new index. I store the indexes (the "data"
>>> dir)
>>> outside the solr home directory. For the moment I, rm -rf the index
>>> dir
>>> after each edit to the solrconfig.sml or schema.xml file and reindex
>>> from scratch. The relaunch of tomcat recreates the index dir.
>>>
>>>> Would it be possible to profile the two instance and see if you
>>>> notice
>>>> anything different?
>>> I dont understand this. Do mean run a profiler against the tomcat
>>> image as indexing takes place, or somehow compare the indexes?
>>
>> Something like JProfiler or any other Java profiler.
>>
>>>
>>>
>>> I was think of making a short script that replicates the results,
>>> and posting it here, would that help?
>>
>>
>> Very much so.
>>
>>
>>>
>>>
>>>>
>>>> Thanks,
>>>> Grant
>>>>
>>>> On Nov 19, 2008, at 8:25 AM, Fergus McMenemie wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I have a CSV file with 6M records which took 22min to index with
>>>>> solr 1.2. I then stopped tomcat replaced the solr stuff inside
>>>>> webapps with version 1.3, wiped my index and restarted tomcat.
>>>>>
>>>>> Indexing the exact same content now takes 69min. My machine has
>>>>> 2GB of RAM and tomcat is running with $JAVA_OPTS -Xmx512M -
>>>>> Xms512M.
>>>>>
>>>>> Are there any tweaks I can use to get the original index time
>>>>> back. I read through the release notes and was expecting a
>>>>> speed up. I saw the bit about increasing ramBufferSizeMB and set
>>>>> it to 64MB; it had no effect.
>>>>> --
>
> --
>
> ===============================================================
> Fergus McMenemie Email:fergus@twig.me.uk
> Techmore Ltd Phone:(UK) 07721 376021
>
> Unix/Mac/Intranets Analyst Programmer
> ===============================================================
--------------------------
Grant Ingersoll
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
Re: Upgrade from 1.2 to 1.3 gives 3x slowdown + script!
Posted by Fergus McMenemie <fe...@twig.me.uk>.
Hello Grant,
>
>Haven't forgotten about you, but I've been traveling and then into
>some US Holidays here.
Happy thanks giving!
>
>To confirm I am understanding, you are seeing a slowdown between 1.3-
>dev from April and one from September, right?
Yep.
Here are the MD5 hashes:-
fergus: md5 *.war
MD5 (solr-bc.war) = 8d4f95628d6978c959d63d304788bc25
MD5 (solr-nightly.war) = 10281455a66b0035ee1f805496d880da
This is the META-INF/MANIFEST.MF from a recent nightly build. (slow)
Manifest-Version: 1.0
Ant-Version: Apache Ant 1.7.0
Created-By: 1.5.0_06-b05 (Sun Microsystems Inc.)
Extension-Name: org.apache.solr
Specification-Title: Apache Solr Search Server
Specification-Version: 1.3.0.2008.11.13.08.16.12
Specification-Vendor: The Apache Software Foundation
Implementation-Title: org.apache.solr
Implementation-Version: nightly exported - yonik - 2008-11-13 08:16:12
Implementation-Vendor: The Apache Software Foundation
X-Compile-Source-JDK: 1.5
X-Compile-Target-JDK: 1.5
This is war file we were given on the course
Manifest-Version: 1.0
Ant-Version: Apache Ant 1.7.0
Created-By: 1.5.0_13-121 ("Apple Computer, Inc.")
Extension-Name: org.apache.solr
Specification-Title: Apache Solr Search Server
Specification-Version: 1.2.2008.04.04.08.09.14
Specification-Vendor: The Apache Software Foundation
Implementation-Title: org.apache.solr
Implementation-Version: 1.3-dev exported - erik - 2008-04-04 08:09:14
Implementation-Vendor: The Apache Software Foundation
X-Compile-Source-JDK: 1.5
X-Compile-Target-JDK: 1.5
I have copied both war files to a web site
http://www.twig.me.uk/solr/solr-bc.war (solr 1.3 dev == bootcamp)
http://www.twig.me.uk/solr/solr-nightly.war (nightly)
Regards Fergus.
>Can you produce an MD5 hash of the WAR file or something, such that I
>can know I have the exact bits. Better yet, perhaps you can put those
>files up somewhere where they can be downloaded.
>
>Thanks,
>Grant
>
>On Nov 26, 2008, at 10:54 AM, Fergus McMenemie wrote:
>
>> Hello Grant,
>>
>> Not much good with Java profilers (yet!) so I thought I
>> would send a script!
>>
>> Details... details! Having decided to produce a script to
>> replicate the 1.2 vis 1.3 speed problem. The required rigor
>> revealed a lot more.
>>
>> 1) The faster version I have previously referred to as 1.2,
>> was actually a "1.3-dev" I had downloaded as part of the
>> solr bootcamp class at ApacheCon Europe 2008. The ID
>> string in the CHANGES.txt document is:-
>> $Id: CHANGES.txt 643465 2008-04-01 16:10:19Z gsingers $
>>
>> 2) I did actually download and speed test a version of 1.2
>> from the internet. It's CHANGES.txt id is:-
>> $Id: CHANGES.txt 543263 2007-05-31 21:19:02Z yonik $
>> Speed wise it was about the same as 1.3 at 64min. It also
>> had lots of char set issues and is ignored from now on.
>>
>> 3) The version I was planning to use, till I found this,
>> speed issue was the "latest" official version:-
>> $Id: CHANGES.txt 694377 2008-09-11 17:40:11Z klaas $
>> I also verified the behavior with a nightly build.
>> $Id: CHANGES.txt 712457 2008-11-09 01:24:11Z koji $
>>
>> Anyway, The following script indexes the content in 22min
>> for the 1.3-dev version and takes 68min for the newer releases
>> of 1.3. I took the conf directory from the 1.3dev (bootcamp)
>> release and used it replace the conf directory from the
>> official 1.3 release. The 3x slow down was still there; it is
>> not a configuration issue!
>> =================================
>>
>>
>>
>>
>>
>>
>> #! /bin/bash
>>
>> # This script assumes a /usr/local/tomcat link to whatever version
>> # of tomcat you have installed. I have "apache-tomcat-5.5.20" Also
>> # /usr/local/tomcat/conf/Catalina/localhost contains no solr.xml.
>> # All the following was done as root.
>>
>>
>> # I have a directory /usr/local/ts which contains four versions of
>> solr. The
>> # "official" 1.2 along with two 1.3 releases and a version of 1.2 or
>> a 1.3beata
>> # I got while attending a solr bootcamp. I indexed the same content
>> using the
>> # different versions of solr as follows:
>> cd /usr/local/ts
>> if [ "" ]
>> then
>> echo "Starting from a-fresh"
>> sleep 5 # allow time for me to interrupt!
>> cp -Rp apache-solr-bc/example/solr ./solrbc #bc = bootcamp
>> cp -Rp apache-solr-nightly/example/solr ./solrnightly
>> cp -Rp apache-solr-1.3.0/example/solr ./solr13
>>
>> # the gaz is regularly updated and its name keeps changing :-) The
>> page
>> # http://earth-info.nga.mil/gns/html/namefiles.htm has a link to
>> the latest
>> # version.
>> curl "http://earth-info.nga.mil/gns/html/geonames_dd_dms_date_20081118.zip
>> " > geonames.zip
>> unzip -q geonames.zip
>> # delete corrupt blips!
>> perl -i -n -e 'print unless
>> ($. > 2128495 and $. < 2128505) or
>> ($. > 5944254 and $. < 5944260)
>> ;' geonames_dd_dms_date_20081118.txt
>> #following was used to detect bad short records
>> #perl -a -F\\t -n -e ' print "line $. is bad with ",scalar(@F),"
>> args\n" if (@F != 26);' geonames_dd_dms_date_20081118.txt
>>
>> # my set of fields and copyfields for the schema.xml
>> fields='
>> <fields>
>> <field name="UNI" type="string" indexed="true"
>> stored="true" required="true" />
>> <field name="CCODE" type="string" indexed="true"
>> stored="true"/>
>> <field name="DSG" type="string" indexed="true"
>> stored="true"/>
>> <field name="CC1" type="string" indexed="true"
>> stored="true"/>
>> <field name="LAT" type="sfloat" indexed="true"
>> stored="true"/>
>> <field name="LONG" type="sfloat" indexed="true"
>> stored="true"/>
>> <field name="MGRS" type="string" indexed="false"
>> stored="true"/>
>> <field name="JOG" type="string" indexed="false"
>> stored="true"/>
>> <field name="FULL_NAME" type="string" indexed="true"
>> stored="true"/>
>> <field name="FULL_NAME_ND" type="string" indexed="true"
>> stored="true"/>
>> <!--field name="text" type="text" indexed="true"
>> stored="false" multiValued="true"/ -->
>> <!--field name="timestamp" type="date" indexed="true"
>> stored="true" default="NOW" multiValued="false"/-->
>> '
>> copyfields='
>> </fields>
>> <copyField source="FULL_NAME" dest="text"/>
>> <copyField source="FULL_NAME_ND" dest="text"/>
>> '
>>
>> # add in my fields and copyfields
>> perl -i -p -e "print qq($fields) if s/<fields>//;" solr*/
>> conf/schema.xml
>> perl -i -p -e "print qq($copyfields) if s[</fields>][];" solr*/
>> conf/schema.xml
>> # change the unique key and mark the "id" field as not required
>> perl -i -p -e "s/<uniqueKey>id/<uniqueKey>UNI/i;" solr*/
>> conf/schema.xml
>> perl -i -p -e 's/required="true"//i if m/<field name="id"/;' solr*/
>> conf/schema.xml
>> # enable remote streaming in solrconfig file
>> perl -i -p -e 's/enableRemoteStreaming="false"/
>> enableRemoteStreaming="true"/;' solr*/conf/solrconfig.xml
>> fi
>>
>> # some constants to keep the curl command shorter
>> skip
>> =
>> "MODIFY_DATE
>> ,RC
>> ,UFI
>> ,DMS_LAT
>> ,DMS_LONG
>> ,FC,PC,ADM1,ADM2,POP,ELEV,CC2,NT,LC,SHORT_FORM,GENERIC,SORT_NAME"
>> file=`pwd`"/geonames.txt"
>>
>> export JAVA_OPTS=" -Xmx512M -Xms512M -Dsolr.home=`pwd`/solr -
>> Dsolr.solr.home=`pwd`/solr"
>>
>> echo 'Getting ready to index the data set using solrbc (bc =
>> bootcamp)'
>> /usr/local/tomcat/bin/shutdown.sh
>> sleep 15
>> if [ -n "`ps awxww | grep tomcat | grep -v grep`" ]
>> then
>> echo "Tomcat would not shutdown"
>> exit
>> fi
>> rm -r /usr/local/tomcat/webapps/solr*
>> rm -r /usr/local/tomcat/logs/*.out
>> rm -r /usr/local/tomcat/work/Catalina/localhost/solr
>> cp apache-solr-bc/example/webapps/solr.war /usr/local/tomcat/webapps
>> rm solr # rm the symbolic link
>> ln -s solrbc solr
>> rm -r solr/data
>> /usr/local/tomcat/bin/startup.sh
>> sleep 10 # give solr time to launch and setup
>> echo "Starting indexing at " `date` " with solrbc (bc = bootcamp)"
>> time curl "http://localhost:8080/solr/update/csv?commit=true&stream.file=$file&escape=%00&separator=%09&skip=$skip
>> "
>>
>> echo "Getting ready to index the data set using solrnightly"
>> /usr/local/tomcat/bin/shutdown.sh
>> sleep 15
>> if [ -n "`ps awxww | grep tomcat | grep -v grep`" ]
>> then
>> echo "Tomcat would not shutdown"
>> exit
>> fi
>> rm -r /usr/local/tomcat/webapps/solr*
>> rm -r /usr/local/tomcat/logs/*.out
>> rm -r /usr/local/tomcat/work/Catalina/localhost/solr
>> cp apache-solr-nightly/example/webapps/solr.war /usr/local/tomcat/
>> webapps
>> rm solr # rm the symbolic link
>> ln -s solrnightly solr
>> rm -r solr/data
>> /usr/local/tomcat/bin/startup.sh
>> sleep 10 # give solr time to launch and setup
>> echo "Starting indexing at " `date` " with solrnightly"
>> time curl "http://localhost:8080/solr/update/csv?commit=true&stream.file=$file&escape=%00&separator=%09&skip=$skip
>> "
>>
>>
>>
>>
>>> On Nov 20, 2008, at 9:18 AM, Fergus McMenemie wrote:
>>>
>>>> Hello Grant,
>>>>
>>>>> Were you overwriting the existing index or did you also clean out
>>>>> the
>>>>> Solr data directory, too? In other words, was it a fresh index, or
>>>>> an
>>>>> existing one? And was that also the case for the 22 minute time?
>>>>
>>>> No in each case it was a new index. I store the indexes (the "data"
>>>> dir)
>>>> outside the solr home directory. For the moment I, rm -rf the index
>>>> dir
>>>> after each edit to the solrconfig.sml or schema.xml file and reindex
>>>> from scratch. The relaunch of tomcat recreates the index dir.
>>>>
>>>>> Would it be possible to profile the two instance and see if you
>>>>> notice
>>>>> anything different?
>>>> I dont understand this. Do mean run a profiler against the tomcat
>>>> image as indexing takes place, or somehow compare the indexes?
>>>
>>> Something like JProfiler or any other Java profiler.
>>>
>>>>
>>>>
>>>> I was think of making a short script that replicates the results,
>>>> and posting it here, would that help?
>>>
>>>
>>> Very much so.
>>>
>>>
>>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> Grant
>>>>>
>>>>> On Nov 19, 2008, at 8:25 AM, Fergus McMenemie wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I have a CSV file with 6M records which took 22min to index with
>>>>>> solr 1.2. I then stopped tomcat replaced the solr stuff inside
>>>>>> webapps with version 1.3, wiped my index and restarted tomcat.
>>>>>>
>>>>>> Indexing the exact same content now takes 69min. My machine has
>>>>>> 2GB of RAM and tomcat is running with $JAVA_OPTS -Xmx512M -
>>>>>> Xms512M.
>>>>>>
>>>>>> Are there any tweaks I can use to get the original index time
>>>>>> back. I read through the release notes and was expecting a
>>>>>> speed up. I saw the bit about increasing ramBufferSizeMB and set
>>>>>> it to 64MB; it had no effect.
>>>>>> --
>>
>> --
>>
>> ===============================================================
>> Fergus McMenemie Email:fergus@twig.me.uk
>> Techmore Ltd Phone:(UK) 07721 376021
>>
>> Unix/Mac/Intranets Analyst Programmer
>> ===============================================================
>
>--------------------------
>Grant Ingersoll
>
>Lucene Helpful Hints:
>http://wiki.apache.org/lucene-java/BasicsOfPerformance
>http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
>
>
></x-flowed>
--
===============================================================
Fergus McMenemie Email:fergus@twig.me.uk
Techmore Ltd Phone:(UK) 07721 376021
Unix/Mac/Intranets Analyst Programmer
===============================================================