You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "David Smiley (JIRA)" <ji...@apache.org> on 2011/01/03 17:58:46 UTC

[jira] Created: (LUCENE-2844) benchmark geospatial performance based on geonames.org

benchmark geospatial performance based on geonames.org
------------------------------------------------------

                 Key: LUCENE-2844
                 URL: https://issues.apache.org/jira/browse/LUCENE-2844
             Project: Lucene - Java
          Issue Type: New Feature
          Components: contrib/benchmark
            Reporter: David Smiley
            Priority: Minor
             Fix For: 4.0


Until now (with this patch), the benchmark contrib module did not include a means to test geospatial data.  This patch includes some new files and changes to existing ones.  Here is a summary of what is being added in this patch per file (all files below are within the benchmark contrib module) along with my notes:

Changes:
* build.xml -- Add dependency on Lucene's spatial module and Solr.
** It was a real pain to figure out the convoluted ant build system to make this work, and I doubt I did it the proper way.  
** Rob Muir thought it would be a good idea to make the benchmark contrib module be top level module (i.e. be alongside analysis) so that it can depend on everything.  http://lucene.472066.n3.nabble.com/Re-Geospatial-search-in-Lucene-Solr-tp2157146p2157824.html  I agree 
* ReadTask.java -- Added a search.useHitTotal boolean option that will use the total hits number for reporting purposes, instead of the existing behavior.
** The existing behavior (i.e. when search.useHitTotal=false) doesn't look very useful since the response integer is the sum of several things instead of just one thing.  I don't see how anyone makes use of it.

Note that on my local system, I also changed ReportTask & RepSelectByPrefTask to not include the '-' every other line, and also changed Format.java to not use commas in the numbers.  These changes are to make copy-pasting into excel more streamlined.

New Files:
* geoname-spatial.alg -- my algorithm file.
**  Note the ":0" trailing the Populate sequence.  This is a trick I use to skip building the index, since it takes a while to build and I'm not interested in benchmarking index construction.  You'll want to set this to :1 and then subsequently put it back for further runs as long as you keep the doc.geo.schemaField or any other configuration elements affecting index the same.
** In the patch, doc.geo.schemaField=geohash but unless you're tinkering with SOLR-2155, you'll probably want to set this to "latlon"
* GeoNamesContentSource.java -- a ContentSource for a geonames.org data file (either a single country like US.txt or allCountries.txt).
** Uses a subclass of DocData to store all the fields.  The existing DocData wasn't very applicable to data that is not composed of a title and body.
** Doesn't reuse the docdata parameter to getNextDocData(); a new one is created every time.
** Only supports content.source.forever=false
* GeoNamesDocMaker.java -- a subclass of DocMaker that works very differently than the existing DocMaker.
** Instead of assuming that each line from geonames.org will correspond to one Lucene document, this implementation supports, via configuration, creating a variable number of documents, each with a variable number of points taken randomly from a GeoNamesContentSource.
** doc.geo.docsToGenerate:  The number of documents to generate.  If blank it defaults to the number of rows in GeoNamesContentSource.
** doc.geo.avgPlacesPerDoc: The average number of places to be added to a document.  A random number between 0 and one less than twice this amount is chosen on a per document basis.  If this is set to 1, then exactly one is always used.  In order to support a value greater than 1, use the geohash field type and incorporate SOLR-2155 (geohash prefix technique).
** doc.geo.oneDocPerPlace: Whether at most one document should use the same place.  In other words, Can more than one document have the same place?  If so, set this to false.
** doc.geo.schemaField: references a field name in schema.xml.  The field should implement SpatialQueryable.
* GeoPerfData.java: This class is a singleton storing data in memory that is shared by GeoNamesDocMaker.java and GeoQueryMaker.java.
** content.geo.zeroPopSubst: if a population is encountered that is <= 0, then use this population value instead.  Default is 100.
** content.geo.maxPlaces: A limit on the number of rows read in from GeoNamesContentSource.java can be set here.  Defaults to Integer.MAX_VALUE.
** GeoPerfData is primarily responsible for reading in data from GeoNamesContentSource into memory to store the lat, lon, and population.  When a random place is asked for, you get one weighted according to population.  The idea is to skew the data towards more referenced places, and a population number is a decent way of doing it.
* GeoQueryMaker.java -- returns random queries from GeoPerfData by taking a random point and using a particular configured radius.  A pure lat-lon bounding box query is ultimately done.
** query.geo.radiuskm: The radius of the query in kilometers.
* schema.xml -- a Solr schema file to configure SpatialQueriable fields referenced by doc.geo.schemaField.

When I run this algorithm as provided with the file in the patch, I get this result:
{noformat}
Operation   round ____km   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
Search_40       0    350        1      4811687 1,206,541.38        3.99   117,722,664    191,934,464
{noformat}

The key metrics I use are the average milliseconds per query, and the average places per query.  The number of queries performed is the trailing numeric suffix to Operation.  The Formulas:
* avg ms/query:  elapsedSec*1000/queries  == 98.8
* avg places / query:  recsPerRun/queries == 120,292

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2844) benchmark geospatial performance based on geonames.org

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976823#action_12976823 ] 

Robert Muir commented on LUCENE-2844:
-------------------------------------

David, I'll first create an issue to propose moving benchmark/ to modules.

I've personally been frustrated by this before (just simple stuff like wanting to benchmark some analysis 
definition in a schema.xml for ReadTokens/indexing speed and having to actually write an Analyzer.java to do it)


> benchmark geospatial performance based on geonames.org
> ------------------------------------------------------
>
>                 Key: LUCENE-2844
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2844
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/benchmark
>            Reporter: David Smiley
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: benchmark-geo.patch
>
>
> Until now (with this patch), the benchmark contrib module did not include a means to test geospatial data.  This patch includes some new files and changes to existing ones.  Here is a summary of what is being added in this patch per file (all files below are within the benchmark contrib module) along with my notes:
> Changes:
> * build.xml -- Add dependency on Lucene's spatial module and Solr.
> ** It was a real pain to figure out the convoluted ant build system to make this work, and I doubt I did it the proper way.  
> ** Rob Muir thought it would be a good idea to make the benchmark contrib module be top level module (i.e. be alongside analysis) so that it can depend on everything.  http://lucene.472066.n3.nabble.com/Re-Geospatial-search-in-Lucene-Solr-tp2157146p2157824.html  I agree 
> * ReadTask.java -- Added a search.useHitTotal boolean option that will use the total hits number for reporting purposes, instead of the existing behavior.
> ** The existing behavior (i.e. when search.useHitTotal=false) doesn't look very useful since the response integer is the sum of several things instead of just one thing.  I don't see how anyone makes use of it.
> Note that on my local system, I also changed ReportTask & RepSelectByPrefTask to not include the '-' every other line, and also changed Format.java to not use commas in the numbers.  These changes are to make copy-pasting into excel more streamlined.
> New Files:
> * geoname-spatial.alg -- my algorithm file.
> **  Note the ":0" trailing the Populate sequence.  This is a trick I use to skip building the index, since it takes a while to build and I'm not interested in benchmarking index construction.  You'll want to set this to :1 and then subsequently put it back for further runs as long as you keep the doc.geo.schemaField or any other configuration elements affecting index the same.
> ** In the patch, doc.geo.schemaField=geohash but unless you're tinkering with SOLR-2155, you'll probably want to set this to "latlon"
> * GeoNamesContentSource.java -- a ContentSource for a geonames.org data file (either a single country like US.txt or allCountries.txt).
> ** Uses a subclass of DocData to store all the fields.  The existing DocData wasn't very applicable to data that is not composed of a title and body.
> ** Doesn't reuse the docdata parameter to getNextDocData(); a new one is created every time.
> ** Only supports content.source.forever=false
> * GeoNamesDocMaker.java -- a subclass of DocMaker that works very differently than the existing DocMaker.
> ** Instead of assuming that each line from geonames.org will correspond to one Lucene document, this implementation supports, via configuration, creating a variable number of documents, each with a variable number of points taken randomly from a GeoNamesContentSource.
> ** doc.geo.docsToGenerate:  The number of documents to generate.  If blank it defaults to the number of rows in GeoNamesContentSource.
> ** doc.geo.avgPlacesPerDoc: The average number of places to be added to a document.  A random number between 0 and one less than twice this amount is chosen on a per document basis.  If this is set to 1, then exactly one is always used.  In order to support a value greater than 1, use the geohash field type and incorporate SOLR-2155 (geohash prefix technique).
> ** doc.geo.oneDocPerPlace: Whether at most one document should use the same place.  In other words, Can more than one document have the same place?  If so, set this to false.
> ** doc.geo.schemaField: references a field name in schema.xml.  The field should implement SpatialQueryable.
> * GeoPerfData.java: This class is a singleton storing data in memory that is shared by GeoNamesDocMaker.java and GeoQueryMaker.java.
> ** content.geo.zeroPopSubst: if a population is encountered that is <= 0, then use this population value instead.  Default is 100.
> ** content.geo.maxPlaces: A limit on the number of rows read in from GeoNamesContentSource.java can be set here.  Defaults to Integer.MAX_VALUE.
> ** GeoPerfData is primarily responsible for reading in data from GeoNamesContentSource into memory to store the lat, lon, and population.  When a random place is asked for, you get one weighted according to population.  The idea is to skew the data towards more referenced places, and a population number is a decent way of doing it.
> * GeoQueryMaker.java -- returns random queries from GeoPerfData by taking a random point and using a particular configured radius.  A pure lat-lon bounding box query is ultimately done.
> ** query.geo.radiuskm: The radius of the query in kilometers.
> * schema.xml -- a Solr schema file to configure SpatialQueriable fields referenced by doc.geo.schemaField.
> When I run this algorithm as provided with the file in the patch, I get this result:
> {noformat}
> Operation   round ____km   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
> Search_40       0    350        1      4811687 1,206,541.38        3.99   117,722,664    191,934,464
> {noformat}
> The key metrics I use are the average milliseconds per query, and the average places per query.  The number of queries performed is the trailing numeric suffix to Operation.  The Formulas:
> * avg ms/query:  elapsedSec*1000/queries  == 98.8
> * avg places / query:  recsPerRun/queries == 120,292

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2844) benchmark geospatial performance based on geonames.org

Posted by "David Smiley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Smiley updated LUCENE-2844:
---------------------------------

    Attachment: benchmark-geo.patch

This is an update to the patch which considers the move of the benchmark contrib to /modules/benchmark.  It also includes GeoNamesSetSolrAnalyzerTask which will use Solr's field-specific analyzer.  It's very much tied to these set of classes in the patch.  There are ASF headers now too.

> benchmark geospatial performance based on geonames.org
> ------------------------------------------------------
>
>                 Key: LUCENE-2844
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2844
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/benchmark
>            Reporter: David Smiley
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: benchmark-geo.patch, benchmark-geo.patch
>
>
> Until now (with this patch), the benchmark contrib module did not include a means to test geospatial data.  This patch includes some new files and changes to existing ones.  Here is a summary of what is being added in this patch per file (all files below are within the benchmark contrib module) along with my notes:
> Changes:
> * build.xml -- Add dependency on Lucene's spatial module and Solr.
> ** It was a real pain to figure out the convoluted ant build system to make this work, and I doubt I did it the proper way.  
> ** Rob Muir thought it would be a good idea to make the benchmark contrib module be top level module (i.e. be alongside analysis) so that it can depend on everything.  http://lucene.472066.n3.nabble.com/Re-Geospatial-search-in-Lucene-Solr-tp2157146p2157824.html  I agree 
> * ReadTask.java -- Added a search.useHitTotal boolean option that will use the total hits number for reporting purposes, instead of the existing behavior.
> ** The existing behavior (i.e. when search.useHitTotal=false) doesn't look very useful since the response integer is the sum of several things instead of just one thing.  I don't see how anyone makes use of it.
> Note that on my local system, I also changed ReportTask & RepSelectByPrefTask to not include the '-' every other line, and also changed Format.java to not use commas in the numbers.  These changes are to make copy-pasting into excel more streamlined.
> New Files:
> * geoname-spatial.alg -- my algorithm file.
> **  Note the ":0" trailing the Populate sequence.  This is a trick I use to skip building the index, since it takes a while to build and I'm not interested in benchmarking index construction.  You'll want to set this to :1 and then subsequently put it back for further runs as long as you keep the doc.geo.schemaField or any other configuration elements affecting index the same.
> ** In the patch, doc.geo.schemaField=geohash but unless you're tinkering with SOLR-2155, you'll probably want to set this to "latlon"
> * GeoNamesContentSource.java -- a ContentSource for a geonames.org data file (either a single country like US.txt or allCountries.txt).
> ** Uses a subclass of DocData to store all the fields.  The existing DocData wasn't very applicable to data that is not composed of a title and body.
> ** Doesn't reuse the docdata parameter to getNextDocData(); a new one is created every time.
> ** Only supports content.source.forever=false
> * GeoNamesDocMaker.java -- a subclass of DocMaker that works very differently than the existing DocMaker.
> ** Instead of assuming that each line from geonames.org will correspond to one Lucene document, this implementation supports, via configuration, creating a variable number of documents, each with a variable number of points taken randomly from a GeoNamesContentSource.
> ** doc.geo.docsToGenerate:  The number of documents to generate.  If blank it defaults to the number of rows in GeoNamesContentSource.
> ** doc.geo.avgPlacesPerDoc: The average number of places to be added to a document.  A random number between 0 and one less than twice this amount is chosen on a per document basis.  If this is set to 1, then exactly one is always used.  In order to support a value greater than 1, use the geohash field type and incorporate SOLR-2155 (geohash prefix technique).
> ** doc.geo.oneDocPerPlace: Whether at most one document should use the same place.  In other words, Can more than one document have the same place?  If so, set this to false.
> ** doc.geo.schemaField: references a field name in schema.xml.  The field should implement SpatialQueryable.
> * GeoPerfData.java: This class is a singleton storing data in memory that is shared by GeoNamesDocMaker.java and GeoQueryMaker.java.
> ** content.geo.zeroPopSubst: if a population is encountered that is <= 0, then use this population value instead.  Default is 100.
> ** content.geo.maxPlaces: A limit on the number of rows read in from GeoNamesContentSource.java can be set here.  Defaults to Integer.MAX_VALUE.
> ** GeoPerfData is primarily responsible for reading in data from GeoNamesContentSource into memory to store the lat, lon, and population.  When a random place is asked for, you get one weighted according to population.  The idea is to skew the data towards more referenced places, and a population number is a decent way of doing it.
> * GeoQueryMaker.java -- returns random queries from GeoPerfData by taking a random point and using a particular configured radius.  A pure lat-lon bounding box query is ultimately done.
> ** query.geo.radiuskm: The radius of the query in kilometers.
> * schema.xml -- a Solr schema file to configure SpatialQueriable fields referenced by doc.geo.schemaField.
> When I run this algorithm as provided with the file in the patch, I get this result:
> {noformat}
> Operation   round ____km   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
> Search_40       0    350        1      4811687 1,206,541.38        3.99   117,722,664    191,934,464
> {noformat}
> The key metrics I use are the average milliseconds per query, and the average places per query.  The number of queries performed is the trailing numeric suffix to Operation.  The Formulas:
> * avg ms/query:  elapsedSec*1000/queries  == 98.8
> * avg places / query:  recsPerRun/queries == 120,292

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2844) benchmark geospatial performance based on geonames.org

Posted by "David Smiley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Smiley updated LUCENE-2844:
---------------------------------

    Attachment: benchmark-geo.patch

> benchmark geospatial performance based on geonames.org
> ------------------------------------------------------
>
>                 Key: LUCENE-2844
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2844
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/benchmark
>            Reporter: David Smiley
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: benchmark-geo.patch
>
>
> Until now (with this patch), the benchmark contrib module did not include a means to test geospatial data.  This patch includes some new files and changes to existing ones.  Here is a summary of what is being added in this patch per file (all files below are within the benchmark contrib module) along with my notes:
> Changes:
> * build.xml -- Add dependency on Lucene's spatial module and Solr.
> ** It was a real pain to figure out the convoluted ant build system to make this work, and I doubt I did it the proper way.  
> ** Rob Muir thought it would be a good idea to make the benchmark contrib module be top level module (i.e. be alongside analysis) so that it can depend on everything.  http://lucene.472066.n3.nabble.com/Re-Geospatial-search-in-Lucene-Solr-tp2157146p2157824.html  I agree 
> * ReadTask.java -- Added a search.useHitTotal boolean option that will use the total hits number for reporting purposes, instead of the existing behavior.
> ** The existing behavior (i.e. when search.useHitTotal=false) doesn't look very useful since the response integer is the sum of several things instead of just one thing.  I don't see how anyone makes use of it.
> Note that on my local system, I also changed ReportTask & RepSelectByPrefTask to not include the '-' every other line, and also changed Format.java to not use commas in the numbers.  These changes are to make copy-pasting into excel more streamlined.
> New Files:
> * geoname-spatial.alg -- my algorithm file.
> **  Note the ":0" trailing the Populate sequence.  This is a trick I use to skip building the index, since it takes a while to build and I'm not interested in benchmarking index construction.  You'll want to set this to :1 and then subsequently put it back for further runs as long as you keep the doc.geo.schemaField or any other configuration elements affecting index the same.
> ** In the patch, doc.geo.schemaField=geohash but unless you're tinkering with SOLR-2155, you'll probably want to set this to "latlon"
> * GeoNamesContentSource.java -- a ContentSource for a geonames.org data file (either a single country like US.txt or allCountries.txt).
> ** Uses a subclass of DocData to store all the fields.  The existing DocData wasn't very applicable to data that is not composed of a title and body.
> ** Doesn't reuse the docdata parameter to getNextDocData(); a new one is created every time.
> ** Only supports content.source.forever=false
> * GeoNamesDocMaker.java -- a subclass of DocMaker that works very differently than the existing DocMaker.
> ** Instead of assuming that each line from geonames.org will correspond to one Lucene document, this implementation supports, via configuration, creating a variable number of documents, each with a variable number of points taken randomly from a GeoNamesContentSource.
> ** doc.geo.docsToGenerate:  The number of documents to generate.  If blank it defaults to the number of rows in GeoNamesContentSource.
> ** doc.geo.avgPlacesPerDoc: The average number of places to be added to a document.  A random number between 0 and one less than twice this amount is chosen on a per document basis.  If this is set to 1, then exactly one is always used.  In order to support a value greater than 1, use the geohash field type and incorporate SOLR-2155 (geohash prefix technique).
> ** doc.geo.oneDocPerPlace: Whether at most one document should use the same place.  In other words, Can more than one document have the same place?  If so, set this to false.
> ** doc.geo.schemaField: references a field name in schema.xml.  The field should implement SpatialQueryable.
> * GeoPerfData.java: This class is a singleton storing data in memory that is shared by GeoNamesDocMaker.java and GeoQueryMaker.java.
> ** content.geo.zeroPopSubst: if a population is encountered that is <= 0, then use this population value instead.  Default is 100.
> ** content.geo.maxPlaces: A limit on the number of rows read in from GeoNamesContentSource.java can be set here.  Defaults to Integer.MAX_VALUE.
> ** GeoPerfData is primarily responsible for reading in data from GeoNamesContentSource into memory to store the lat, lon, and population.  When a random place is asked for, you get one weighted according to population.  The idea is to skew the data towards more referenced places, and a population number is a decent way of doing it.
> * GeoQueryMaker.java -- returns random queries from GeoPerfData by taking a random point and using a particular configured radius.  A pure lat-lon bounding box query is ultimately done.
> ** query.geo.radiuskm: The radius of the query in kilometers.
> * schema.xml -- a Solr schema file to configure SpatialQueriable fields referenced by doc.geo.schemaField.
> When I run this algorithm as provided with the file in the patch, I get this result:
> {noformat}
> Operation   round ____km   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
> Search_40       0    350        1      4811687 1,206,541.38        3.99   117,722,664    191,934,464
> {noformat}
> The key metrics I use are the average milliseconds per query, and the average places per query.  The number of queries performed is the trailing numeric suffix to Operation.  The Formulas:
> * avg ms/query:  elapsedSec*1000/queries  == 98.8
> * avg places / query:  recsPerRun/queries == 120,292

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org