You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@vxquery.apache.org by Eldon Carman <ec...@ucr.edu> on 2013/11/17 21:33:34 UTC

XML Benchmark Test Scenarios and Queries

The goal of the benchmark tests are to highlight the parallel aspects of
VXQuery. The tests need to show how VXQuery scales. In addition, other
queries may be added to highlight our specific speed improvements or where
improvements can still be made. At first we want to show how the system
works with parallel queries. We focus on three types of queries: filtering,
aggregation and nested loops (join).

For these three queries the following scaling tests will be completed:
scale up and speed up.
 * Scale up keeps the number of nodes in the cluster constant and increases
the data set in each successive test.
 * Speed up keeps the data set size constant and increases the number of
nodes processing the data in each successive test. (
http://en.wikipedia.org/wiki/Speedup)

Still working on the specific queries for our GHCN daily data, but you can
see the draft version here:
https://svn.apache.org/repos/asf/incubator/vxquery/trunk/vxquery/vxquery-benchmark/src/main/resources/noaa-ghcn-daily/queries/

Re: XML Benchmark Test Scenarios and Queries

Posted by Michael Carey <mj...@ics.uci.edu>.

A minor nit:  This definition of "speed up" is the correct/common use - 
you increase the number of nodes and hope to see that system get 
proportionally faster.  E.g., 10x the iron makes your query run 10x 
faster if you have "perfect speed up".  However, this is not a 
correct/common use of the term "scale up" - see the classic paper in 
Comm. ACM by DeWitt and Gray on parallel database systems and their 
performance.  "Batch scale up" is when you increase both the data set 
size AND the system size in tandem - and you hope to keep performance 
flat.  This means 10x the iron will run 10x the problem size in the same 
time as 1x ran the 1x problem  This is what it means to provide "perfect 
batch scale up".  There is also "transaction scale up" - this is when 
you increase the number of concurrent queries and the amount of iron in 
tandem - e.g., 10x the offered load and 10x the iron - and again, a 
"perfect" result is that the bigger system handles the bigger workload 
with the same performance as the 1x/1x case.

I suggest moving to the common "scale up" notion (and in this case, the 
common "batch scale up" notion).  So, as you increase problem size, also 
increase cluster size, and look for performance to stay flat as you goal.

Cheers,
Mike

On 11/17/13 12:33 PM, Eldon Carman wrote:
> The goal of the benchmark tests are to highlight the parallel aspects of
> VXQuery. The tests need to show how VXQuery scales. In addition, other
> queries may be added to highlight our specific speed improvements or where
> improvements can still be made. At first we want to show how the system
> works with parallel queries. We focus on three types of queries: filtering,
> aggregation and nested loops (join).
>
> For these three queries the following scaling tests will be completed:
> scale up and speed up.
>   * Scale up keeps the number of nodes in the cluster constant and increases
> the data set in each successive test.
>   * Speed up keeps the data set size constant and increases the number of
> nodes processing the data in each successive test. (
> http://en.wikipedia.org/wiki/Speedup)
>
> Still working on the specific queries for our GHCN daily data, but you can
> see the draft version here:
> https://svn.apache.org/repos/asf/incubator/vxquery/trunk/vxquery/vxquery-benchmark/src/main/resources/noaa-ghcn-daily/queries/
>

Re: XML Benchmark Test Scenarios and Queries

Posted by Michael Carey <mj...@ics.uci.edu>.

Nice!

On 11/21/13 11:22 AM, Eldon Carman wrote:
> Hi All,
>
> After a conversation with Till, I have updated the english queries. The new
> queries have specific details about which station, dates, etc. are used for
> the query. The list of queries has been reduced to only the ones that
> should be interesting for a user.
>
> Filtering Query
> Query: See historical data for Riverside, CA (ASN00008113) station by
> selecting the weather readings for December 25 over the last 10 years.
> Query: Find all reading for hurricane force wind warning or extreme wind
> warning. The warnings occur when the wind speed exceeds 110 mph.
>
> Aggregation Query
> Query: Find the annual precipitation for a Seattle using the airport
> station (USW00024233) for 1999.
> Query: Find the lowest/highest recorded temperature.
>
> Join Query
> Query: Find all the weather readings for Los Angeles county for a specific
> day 1976/7/4.
>
> Join and Aggregation Query
> Query: Find the lowest/highest recorded temperature in the state of Oregon
> for 2001.
>
>
> On Tue, Nov 19, 2013 at 8:18 PM, Eldon Carman <ec...@ucr.edu> wrote:
>
>> Thanks Mike. I will update my test plans.
>>
>> Vinayak, I realized later that I did not include the sample XML files.
>>
>> Weather Data Overview
>> The weather data has been downloaded from NOAA their HTTP available dat
>> file and set up to mimic the XML web service offered on their website. The
>> data set for Global Historical Climatology Network (GHCN)-Daily includes
>> summaries of climate recording. The core data includes fields for high and
>> low temperatures, snowfall, snow depth and rainfall. The full list of
>> fields can be found on NOAA site (
>> http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt). The readings
>> include the date, datatype, station id, value, and attributes about the
>> reading. In a separate web service query details about the station can be
>> downloaded. The station has its name, latitude, longitude, date of first
>> and last reading, and various names.
>>
>> Attached are two sample XML files: a single day's sensor readings and the
>> station details.
>>
>> Sensor Data basic scheme
>>
>> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
>> <dataCollection pageCount="1" totalCount="11">
>>    <data>
>>      <date>2013-09-22T00:00:00.000</date>
>>      <dataType>AWND</dataType>
>>      <station>GHCND:USW00003822</station>
>>      <value>17</value>
>>      <attributes>
>>        <attribute></attribute> <!--  measurement flag -->
>>        <attribute></attribute> <!-- quality flag -->
>>        <attribute>W</attribute> <!-- source flag -->
>>        <attribute></attribute> <!-- time of reading -->
>>      </attributes>
>>    </data>
>>    <!-- repeat data tag -->
>> </dataCollection>
>>
>> Station Data basic scheme
>>
>> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
>> <stationCollection pageSize="100" pageCount="1" totalCount="1">
>>    <station>
>>      <id>GHCND:USW00003822</id>
>>      <displayName>SAVANNAH INTERNATIONAL AIRPORT, GA US</displayName>
>>      <minDate>1948-01-01</minDate>
>>      <maxDate>2013-10-20</maxDate>
>>      <latitude>32.13</latitude>
>>      <longitude>-81.21</longitude>
>>      <elevation>14</elevation>
>>      <locationLabels>
>>        <type>ZIP</type>
>>        <id>ZIP:31408</id>
>>        <displayName>Savannah, GA 31408</displayName>
>>      </locationLabels>
>>      <!-- repeat location labels -->
>>      <coverage>1</coverage>
>>    </station>
>> </stationCollection>
>>
>>
>> Filtering Test
>> The filtering test only returns a sub section of data..
>>
>> Query: Select a single weather station reading.
>> Query: See historical data by select the weather readings for today last
>> year.
>> Query: Find all reading for severe wind readings.
>>
>>
>> Aggregation Test
>> The aggregation test compiles information into a single result. The
>> aggregation query’s focus on processing data to help with analysis.
>>
>> Query: Count the number of weather reading in the database.
>> Query: Find the annual precipitation for a station.
>> Query: Find the lowest/highest recorded temperature.
>>
>>
>> Join Test
>> The join test works when nested loops are present on different data sets.
>> The weather data has both weather data and station details.
>>
>> Query: Find the station’s name and date of the first sensor reading.
>> Query: Find regional weather readings for a specific day.
>>
>>
>> Side Question: Should we have a test for creating modified data results?
>>
>>
>>
>> On Tue, Nov 19, 2013 at 12:11 AM, Vinayak Borkar <vi...@gmail.com>
>> wrote:
>>> Preston,
>>>
>>>
>>> With respect to the benchmark queries, let me suggest the following
>> approach. Send out an email with:
>>>    - GHCN information content (schema with some English description of
>> each field).
>>>    - A list of questions in English that represent the interesting
>> questions to ask of that data.
>>> This will give everyone the necessary background to possibly suggest
>> modifications and other interesting queries to ask against the data.
>>> Finally, once there is consensus regarding the queries, we can translate
>> the English version into XQuery.
>>> Thanks,
>>> Vinayak
>>>
>>>
>>>
>>> On 11/17/13, 12:33 PM, Eldon Carman wrote:
>>>> The goal of the benchmark tests are to highlight the parallel aspects of
>>>> VXQuery. The tests need to show how VXQuery scales. In addition, other
>>>> queries may be added to highlight our specific speed improvements or
>> where
>>>> improvements can still be made. At first we want to show how the system
>>>> works with parallel queries. We focus on three types of queries:
>> filtering,
>>>> aggregation and nested loops (join).
>>>>
>>>> For these three queries the following scaling tests will be completed:
>>>> scale up and speed up.
>>>>    * Scale up keeps the number of nodes in the cluster constant and
>> increases
>>>> the data set in each successive test.
>>>>    * Speed up keeps the data set size constant and increases the number
>> of
>>>> nodes processing the data in each successive test. (
>>>> http://en.wikipedia.org/wiki/Speedup)
>>>>
>>>> Still working on the specific queries for our GHCN daily data, but you
>> can
>>>> see the draft version here:
>>>>
>> https://svn.apache.org/repos/asf/incubator/vxquery/trunk/vxquery/vxquery-benchmark/src/main/resources/noaa-ghcn-daily/queries/

Re: XML Benchmark Test Scenarios and Queries

Posted by Eldon Carman <ec...@ucr.edu>.

Hi All,

After a conversation with Till, I have updated the english queries. The new
queries have specific details about which station, dates, etc. are used for
the query. The list of queries has been reduced to only the ones that
should be interesting for a user.

Filtering Query
Query: See historical data for Riverside, CA (ASN00008113) station by
selecting the weather readings for December 25 over the last 10 years.
Query: Find all reading for hurricane force wind warning or extreme wind
warning. The warnings occur when the wind speed exceeds 110 mph.

Aggregation Query
Query: Find the annual precipitation for a Seattle using the airport
station (USW00024233) for 1999.
Query: Find the lowest/highest recorded temperature.

Join Query
Query: Find all the weather readings for Los Angeles county for a specific
day 1976/7/4.

Join and Aggregation Query
Query: Find the lowest/highest recorded temperature in the state of Oregon
for 2001.


On Tue, Nov 19, 2013 at 8:18 PM, Eldon Carman <ec...@ucr.edu> wrote:

> Thanks Mike. I will update my test plans.
>
> Vinayak, I realized later that I did not include the sample XML files.
>
> Weather Data Overview
> The weather data has been downloaded from NOAA their HTTP available dat
> file and set up to mimic the XML web service offered on their website. The
> data set for Global Historical Climatology Network (GHCN)-Daily includes
> summaries of climate recording. The core data includes fields for high and
> low temperatures, snowfall, snow depth and rainfall. The full list of
> fields can be found on NOAA site (
> http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt). The readings
> include the date, datatype, station id, value, and attributes about the
> reading. In a separate web service query details about the station can be
> downloaded. The station has its name, latitude, longitude, date of first
> and last reading, and various names.
>
> Attached are two sample XML files: a single day's sensor readings and the
> station details.
>
> Sensor Data basic scheme
>
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> <dataCollection pageCount="1" totalCount="11">
>   <data>
>     <date>2013-09-22T00:00:00.000</date>
>     <dataType>AWND</dataType>
>     <station>GHCND:USW00003822</station>
>     <value>17</value>
>     <attributes>
>       <attribute></attribute> <!--  measurement flag -->
>       <attribute></attribute> <!-- quality flag -->
>       <attribute>W</attribute> <!-- source flag -->
>       <attribute></attribute> <!-- time of reading -->
>     </attributes>
>   </data>
>   <!-- repeat data tag -->
> </dataCollection>
>
> Station Data basic scheme
>
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> <stationCollection pageSize="100" pageCount="1" totalCount="1">
>   <station>
>     <id>GHCND:USW00003822</id>
>     <displayName>SAVANNAH INTERNATIONAL AIRPORT, GA US</displayName>
>     <minDate>1948-01-01</minDate>
>     <maxDate>2013-10-20</maxDate>
>     <latitude>32.13</latitude>
>     <longitude>-81.21</longitude>
>     <elevation>14</elevation>
>     <locationLabels>
>       <type>ZIP</type>
>       <id>ZIP:31408</id>
>       <displayName>Savannah, GA 31408</displayName>
>     </locationLabels>
>     <!-- repeat location labels -->
>     <coverage>1</coverage>
>   </station>
> </stationCollection>
>
>
> Filtering Test
> The filtering test only returns a sub section of data..
>
> Query: Select a single weather station reading.
> Query: See historical data by select the weather readings for today last
> year.
> Query: Find all reading for severe wind readings.
>
>
> Aggregation Test
> The aggregation test compiles information into a single result. The
> aggregation query’s focus on processing data to help with analysis.
>
> Query: Count the number of weather reading in the database.
> Query: Find the annual precipitation for a station.
> Query: Find the lowest/highest recorded temperature.
>
>
> Join Test
> The join test works when nested loops are present on different data sets.
> The weather data has both weather data and station details.
>
> Query: Find the station’s name and date of the first sensor reading.
> Query: Find regional weather readings for a specific day.
>
>
> Side Question: Should we have a test for creating modified data results?
>
>
>
> On Tue, Nov 19, 2013 at 12:11 AM, Vinayak Borkar <vi...@gmail.com>
> wrote:
> >
> > Preston,
> >
> >
> > With respect to the benchmark queries, let me suggest the following
> approach. Send out an email with:
> >   - GHCN information content (schema with some English description of
> each field).
> >   - A list of questions in English that represent the interesting
> questions to ask of that data.
> >
> > This will give everyone the necessary background to possibly suggest
> modifications and other interesting queries to ask against the data.
> >
> > Finally, once there is consensus regarding the queries, we can translate
> the English version into XQuery.
> >
> > Thanks,
> > Vinayak
> >
> >
> >
> > On 11/17/13, 12:33 PM, Eldon Carman wrote:
> >>
> >> The goal of the benchmark tests are to highlight the parallel aspects of
> >> VXQuery. The tests need to show how VXQuery scales. In addition, other
> >> queries may be added to highlight our specific speed improvements or
> where
> >> improvements can still be made. At first we want to show how the system
> >> works with parallel queries. We focus on three types of queries:
> filtering,
> >> aggregation and nested loops (join).
> >>
> >> For these three queries the following scaling tests will be completed:
> >> scale up and speed up.
> >>   * Scale up keeps the number of nodes in the cluster constant and
> increases
> >> the data set in each successive test.
> >>   * Speed up keeps the data set size constant and increases the number
> of
> >> nodes processing the data in each successive test. (
> >> http://en.wikipedia.org/wiki/Speedup)
> >>
> >> Still working on the specific queries for our GHCN daily data, but you
> can
> >> see the draft version here:
> >>
> https://svn.apache.org/repos/asf/incubator/vxquery/trunk/vxquery/vxquery-benchmark/src/main/resources/noaa-ghcn-daily/queries/
> >>
> >
>

Re: XML Benchmark Test Scenarios and Queries

Posted by Eldon Carman <ec...@ucr.edu>.

Thanks Mike. I will update my test plans.

Vinayak, I realized later that I did not include the sample XML files.

Weather Data Overview
The weather data has been downloaded from NOAA their HTTP available dat
file and set up to mimic the XML web service offered on their website. The
data set for Global Historical Climatology Network (GHCN)-Daily includes
summaries of climate recording. The core data includes fields for high and
low temperatures, snowfall, snow depth and rainfall. The full list of
fields can be found on NOAA site (
http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt). The readings
include the date, datatype, station id, value, and attributes about the
reading. In a separate web service query details about the station can be
downloaded. The station has its name, latitude, longitude, date of first
and last reading, and various names.

Attached are two sample XML files: a single day's sensor readings and the
station details.

Sensor Data basic scheme

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<dataCollection pageCount="1" totalCount="11">
  <data>
    <date>2013-09-22T00:00:00.000</date>
    <dataType>AWND</dataType>
    <station>GHCND:USW00003822</station>
    <value>17</value>
    <attributes>
      <attribute></attribute> <!--  measurement flag -->
      <attribute></attribute> <!-- quality flag -->
      <attribute>W</attribute> <!-- source flag -->
      <attribute></attribute> <!-- time of reading -->
    </attributes>
  </data>
  <!-- repeat data tag -->
</dataCollection>

Station Data basic scheme

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<stationCollection pageSize="100" pageCount="1" totalCount="1">
  <station>
    <id>GHCND:USW00003822</id>
    <displayName>SAVANNAH INTERNATIONAL AIRPORT, GA US</displayName>
    <minDate>1948-01-01</minDate>
    <maxDate>2013-10-20</maxDate>
    <latitude>32.13</latitude>
    <longitude>-81.21</longitude>
    <elevation>14</elevation>
    <locationLabels>
      <type>ZIP</type>
      <id>ZIP:31408</id>
      <displayName>Savannah, GA 31408</displayName>
    </locationLabels>
    <!-- repeat location labels -->
    <coverage>1</coverage>
  </station>
</stationCollection>

Filtering Test
The filtering test only returns a sub section of data..

Query: Select a single weather station reading.
Query: See historical data by select the weather readings for today last
year.
Query: Find all reading for severe wind readings.

Aggregation Test
The aggregation test compiles information into a single result. The
aggregation query’s focus on processing data to help with analysis.

Query: Count the number of weather reading in the database.
Query: Find the annual precipitation for a station.
Query: Find the lowest/highest recorded temperature.

Join Test
The join test works when nested loops are present on different data sets.
The weather data has both weather data and station details.

Query: Find the station’s name and date of the first sensor reading.
Query: Find regional weather readings for a specific day.

Side Question: Should we have a test for creating modified data results?

On Tue, Nov 19, 2013 at 12:11 AM, Vinayak Borkar <vi...@gmail.com> wrote:
>
> Preston,
>
>
> With respect to the benchmark queries, let me suggest the following
approach. Send out an email with:
>   - GHCN information content (schema with some English description of
each field).
>   - A list of questions in English that represent the interesting
questions to ask of that data.
>
> This will give everyone the necessary background to possibly suggest
modifications and other interesting queries to ask against the data.
>
> Finally, once there is consensus regarding the queries, we can translate
the English version into XQuery.
>
> Thanks,
> Vinayak
>
>
>
> On 11/17/13, 12:33 PM, Eldon Carman wrote:
>>
>> The goal of the benchmark tests are to highlight the parallel aspects of
>> VXQuery. The tests need to show how VXQuery scales. In addition, other
>> queries may be added to highlight our specific speed improvements or
where
>> improvements can still be made. At first we want to show how the system
>> works with parallel queries. We focus on three types of queries:
filtering,
>> aggregation and nested loops (join).
>>
>> For these three queries the following scaling tests will be completed:
>> scale up and speed up.
>>   * Scale up keeps the number of nodes in the cluster constant and
increases
>> the data set in each successive test.
>>   * Speed up keeps the data set size constant and increases the number of
>> nodes processing the data in each successive test. (
>> http://en.wikipedia.org/wiki/Speedup)
>>
>> Still working on the specific queries for our GHCN daily data, but you
can
>> see the draft version here:
>>
https://svn.apache.org/repos/asf/incubator/vxquery/trunk/vxquery/vxquery-benchmark/src/main/resources/noaa-ghcn-daily/queries/
>>
>

Re: XML Benchmark Test Scenarios and Queries

Posted by Vinayak Borkar <vi...@gmail.com>.

Preston,


With respect to the benchmark queries, let me suggest the following 
approach. Send out an email with:
   - GHCN information content (schema with some English description of 
each field).
   - A list of questions in English that represent the interesting 
questions to ask of that data.

This will give everyone the necessary background to possibly suggest 
modifications and other interesting queries to ask against the data.

Finally, once there is consensus regarding the queries, we can translate 
the English version into XQuery.

Thanks,
Vinayak


On 11/17/13, 12:33 PM, Eldon Carman wrote:
> The goal of the benchmark tests are to highlight the parallel aspects of
> VXQuery. The tests need to show how VXQuery scales. In addition, other
> queries may be added to highlight our specific speed improvements or where
> improvements can still be made. At first we want to show how the system
> works with parallel queries. We focus on three types of queries: filtering,
> aggregation and nested loops (join).
>
> For these three queries the following scaling tests will be completed:
> scale up and speed up.
>   * Scale up keeps the number of nodes in the cluster constant and increases
> the data set in each successive test.
>   * Speed up keeps the data set size constant and increases the number of
> nodes processing the data in each successive test. (
> http://en.wikipedia.org/wiki/Speedup)
>
> Still working on the specific queries for our GHCN daily data, but you can
> see the draft version here:
> https://svn.apache.org/repos/asf/incubator/vxquery/trunk/vxquery/vxquery-benchmark/src/main/resources/noaa-ghcn-daily/queries/
>