You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@vxquery.apache.org by Eldon Carman <ec...@ucr.edu> on 2013/11/17 21:07:14 UTC

XML Datasets for Benchmarking

VXQuery has been created to work on a large set of small XML files. Our
goal for benchmarking is to find a dataset that does not need the XML files
to be modified. The only task would be to download and distributed the
dataset on to a cluster.

The initial benchmark test will focus on NOAA's National Climatic Data
Center (NCDC) which provides Global Historical Climate Network data. The
dataset includes various daily weather sensor readings from ~90,000
stations across the globe. Each station has varied amounts of data based on
how long the station has been active. The oldest stations have data from
the 1890's.

NCDC offers two methods of accessing the information: dat files and a web
service (XML and JSON). I created a script that downloads the dat files and
uses these to generate the equivalent XML file from the web service. The
web service query is a month's data for a single station. The script allows
for a single download of all the historical data and then is process
locally. The station information is offered in a separate web service query
and contains more information than available in the dat files.Thus, I have
a second script to download all stations separately.

I have started to download open street maps to consider this data set as
another benchmark source.

Side Question: Do you have any ideas of other sources that fit our
requirements?