You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@vxquery.apache.org by Eldon Carman <ec...@ucr.edu> on 2015/09/11 00:05:11 UTC

Re: Amazon (Scale) Test

The XMark benchmark include a data generator [1]. The generator makes one
large XML file. To run this in parallel on a large cluster I think we
should make two changes to the script.

1. Split data types into separate folder and files

The script should generate a new file for each type of data (people, open
auctions, etc). The script supports two parameters for determining the data
files: data set size and max individual file size. If the data size is
larger, it will make many file max individual file files to create the full
data set. (I think I made it sound complicated, its really not.) Our change
would only start a new file (and folder) for each data type.

Dataset size: 2G
File size: 1G

Example (current)
xmark.1.xml (1G)
xmark.2.xml (1G)

Example (suggested)
people/people.1.xml (100M)
open_auctions/open_auctions.1.xml (1G)
open_auctions/open_auctions.2.xml (500M)
etc.

2. Simultaneous data generation in a cluster

The script must create unique ids on each node. Consider the people type.
Each person is given a incremental id. When the data generation is spread
out on the cluster each person still needs a unique id. One option would be
to specify the id range for each node. The script could use this range to
create unique people ids across the whole cluster.

Does this make sense?

Mahalo,
Preston

[1] http://www.xml-benchmark.org/downloads.html

On Wed, Aug 12, 2015 at 2:49 AM, Efi <ef...@gmail.com> wrote:

> On this note, I did tests on a cluster with 3 nodes for the HDFS tests and
> everything went fine, I would like to help with this one as much as I can.
>
> I have submitted a pull request [1] with instructions on how to run the
> HDFS queries.
>
> I also have some ansible scripts that I use to set up quickly hadoop
> clusters that could help with that installation.
>
> Best regards,
> Efi
>
> [1] https://github.com/apache/vxquery/pull/24
>
>
> On 11/08/2015 10:58 μμ, Michael Carey wrote:
>
>> PS - https://github.com/TU-Berlin-DIMA/myriad-toolkit/wiki
>>
>> On 8/11/15 12:25 PM, Eldon Carman wrote:
>>
>>> Hi Guys,
>>>
>>> We have an opportunity to test VXQuery on AWS. I wanted to get feedback
>>> the
>>> tasks needed to prepare a scale test on AWS. What do we need to create in
>>> AWS and what are the coding tasks we need to complete first?
>>>
>>> I think the test would be a great opportunity to test out the Yarn and
>>> HDFS
>>> code (from GSOC). Also, now that we have a handful of XMark queries
>>> working
>>> (also from GSOC), they could be used for the test. XMark includes a XML
>>> generator for a single machine.
>>>
>>> What type so scale tests would be good in this environment? Typically we
>>> use Scale-Up and Speed-Up tests.
>>>
>>> What would the AWS architecture would the test require? I am new to AWS
>>> so
>>> please post your suggestions. One requirement for our test will be to
>>> exceed local memory by five times.
>>>
>>> Testing Requirements (suggested):
>>> - AWS Architecture
>>> - XMark Benchmark
>>> - HDFS as data storage
>>> - Data size 5 times local memory for each node (at least for the largest
>>> scale-up test)
>>> - Scale-Up test (how big can we go?)
>>>
>>> Here is a list of tasks that I can think of right now...
>>>
>>> Coding
>>>   - Finish GSOC projects
>>>   - Update XMark XML generator to work in a cluster environment. (Create
>>> local node data in parallel that is unique across the cluster.)
>>>   - Benchmark scripts for the XMark query tests.
>>>
>>> AWS
>>>   - Determine architecture for the test.
>>>   - Scripts/configuration for cluster build out.
>>>
>>> Previous tests only used eight local server nodes. The AWS test will test
>>> Apache VXQuery in a cloud environment and could scale to a much larger
>>> cluster size.
>>>
>>> Thanks for your feedback.
>>> Preston
>>>
>>>
>>
>>
>

Re: Amazon (Scale) Test

Posted by Michael Carey <mj...@ics.uci.edu>.
Indeed, that makes a lot of sense....  (More so than the original one 
giant file setup.)

On 9/10/15 3:05 PM, Eldon Carman wrote:
> The XMark benchmark include a data generator [1]. The generator makes one
> large XML file. To run this in parallel on a large cluster I think we
> should make two changes to the script.
>
> 1. Split data types into separate folder and files
>
> The script should generate a new file for each type of data (people, open
> auctions, etc). The script supports two parameters for determining the data
> files: data set size and max individual file size. If the data size is
> larger, it will make many file max individual file files to create the full
> data set. (I think I made it sound complicated, its really not.) Our change
> would only start a new file (and folder) for each data type.
>
> Dataset size: 2G
> File size: 1G
>
> Example (current)
> xmark.1.xml (1G)
> xmark.2.xml (1G)
>
> Example (suggested)
> people/people.1.xml (100M)
> open_auctions/open_auctions.1.xml (1G)
> open_auctions/open_auctions.2.xml (500M)
> etc.
>
> 2. Simultaneous data generation in a cluster
>
> The script must create unique ids on each node. Consider the people type.
> Each person is given a incremental id. When the data generation is spread
> out on the cluster each person still needs a unique id. One option would be
> to specify the id range for each node. The script could use this range to
> create unique people ids across the whole cluster.
>
> Does this make sense?
>
> Mahalo,
> Preston
>
> [1] http://www.xml-benchmark.org/downloads.html
>
> On Wed, Aug 12, 2015 at 2:49 AM, Efi <ef...@gmail.com> wrote:
>
>> On this note, I did tests on a cluster with 3 nodes for the HDFS tests and
>> everything went fine, I would like to help with this one as much as I can.
>>
>> I have submitted a pull request [1] with instructions on how to run the
>> HDFS queries.
>>
>> I also have some ansible scripts that I use to set up quickly hadoop
>> clusters that could help with that installation.
>>
>> Best regards,
>> Efi
>>
>> [1] https://github.com/apache/vxquery/pull/24
>>
>>
>> On 11/08/2015 10:58 μμ, Michael Carey wrote:
>>
>>> PS - https://github.com/TU-Berlin-DIMA/myriad-toolkit/wiki
>>>
>>> On 8/11/15 12:25 PM, Eldon Carman wrote:
>>>
>>>> Hi Guys,
>>>>
>>>> We have an opportunity to test VXQuery on AWS. I wanted to get feedback
>>>> the
>>>> tasks needed to prepare a scale test on AWS. What do we need to create in
>>>> AWS and what are the coding tasks we need to complete first?
>>>>
>>>> I think the test would be a great opportunity to test out the Yarn and
>>>> HDFS
>>>> code (from GSOC). Also, now that we have a handful of XMark queries
>>>> working
>>>> (also from GSOC), they could be used for the test. XMark includes a XML
>>>> generator for a single machine.
>>>>
>>>> What type so scale tests would be good in this environment? Typically we
>>>> use Scale-Up and Speed-Up tests.
>>>>
>>>> What would the AWS architecture would the test require? I am new to AWS
>>>> so
>>>> please post your suggestions. One requirement for our test will be to
>>>> exceed local memory by five times.
>>>>
>>>> Testing Requirements (suggested):
>>>> - AWS Architecture
>>>> - XMark Benchmark
>>>> - HDFS as data storage
>>>> - Data size 5 times local memory for each node (at least for the largest
>>>> scale-up test)
>>>> - Scale-Up test (how big can we go?)
>>>>
>>>> Here is a list of tasks that I can think of right now...
>>>>
>>>> Coding
>>>>    - Finish GSOC projects
>>>>    - Update XMark XML generator to work in a cluster environment. (Create
>>>> local node data in parallel that is unique across the cluster.)
>>>>    - Benchmark scripts for the XMark query tests.
>>>>
>>>> AWS
>>>>    - Determine architecture for the test.
>>>>    - Scripts/configuration for cluster build out.
>>>>
>>>> Previous tests only used eight local server nodes. The AWS test will test
>>>> Apache VXQuery in a cloud environment and could scale to a much larger
>>>> cluster size.
>>>>
>>>> Thanks for your feedback.
>>>> Preston
>>>>
>>>>
>>>