You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@phoenix.apache.org by James Heather <ja...@mendeley.com> on 2015/09/09 12:55:17 UTC

How to add lots of dummy data to a table efficiently

I've been trying to create a table and then bung lots of random data 
into it, but without generating the data client-side and then upserting 
it. I'd rather find a way to have it generated on the server.

My table has three columns, all BIGINT, and the first is the primary key.

I'd ideally like to write something like

     UPSERT into linktab (id, first_val, second_val) SELECT next value 
for linktab_next_id, random_func, random_func ...

and complete the query so that the SELECT generates its values without 
needing to read from a table. But there doesn't seem to be a way of 
writing a SELECT that doesn't need a table parameter.

So I've ended up doing this. First I upsert a single row into the table; 
then, I double the number of rows in it by a self-referential UPSERT SELECT:

0: jdbc:phoenix:172.31.31.143> upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
8 rows affected (0.193 seconds)
0: jdbc:phoenix:172.31.31.143> upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
16 rows affected (0.202 seconds)
0: jdbc:phoenix:172.31.31.143> upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
32 rows affected (0.205 seconds)
0: jdbc:phoenix:172.31.31.143> upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
64 rows affected (0.235 seconds)
0: jdbc:phoenix:172.31.31.143> upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
128 rows affected (0.28 seconds)
0: jdbc:phoenix:172.31.31.143> upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
256 rows affected (0.378 seconds)
0: jdbc:phoenix:172.31.31.143> upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
512 rows affected (0.53 seconds)
0: jdbc:phoenix:172.31.31.143> upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
1,024 rows affected (0.848 seconds)
0: jdbc:phoenix:172.31.31.143> upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
2,048 rows affected (1.513 seconds)
0: jdbc:phoenix:172.31.31.143> upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
4,096 rows affected (2.817 seconds)
0: jdbc:phoenix:172.31.31.143> upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
8,192 rows affected (5.314 seconds)
0: jdbc:phoenix:172.31.31.143> upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
16,384 rows affected (10.807 seconds)
0: jdbc:phoenix:172.31.31.143> upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
32,768 rows affected (21.306 seconds)
0: jdbc:phoenix:172.31.31.143> upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
65,536 rows affected (45.713 seconds)
0: jdbc:phoenix:172.31.31.143> explain upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
+------------------------------------------+
|                   PLAN                   |
+------------------------------------------+
| UPSERT SELECT                            |
| CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER LINKTAB |
|     SERVER FILTER BY FIRST KEY ONLY      |
| CLIENT RESERVE VALUES FROM 1 SEQUENCE    |
+------------------------------------------+
4 rows selected (0.192 seconds)
0: jdbc:phoenix:172.31.31.143>


This works, but as you can see from the timings, it's getting quite 
slow, and this seems to be because it has to do a full scan of the table 
to retrieve the existing rows in order to construct the dummy data to 
put in. This seems a bit of a shame when it doesn't actually need the data!

Is there a way round this?

Could we have a SELECT ... LIMIT n command that doesn't require a table 
to be specified as long as none of the columns comes from a table?

James

Re: How to add lots of dummy data to a table efficiently

Posted by Cody Marcel <cm...@salesforce.com>.
Have you looked at the pherf module in Phoenix? You can predefined
rules and have it generate data based on that.

Sent from my iPhone

> On Sep 9, 2015, at 3:55 AM, James Heather <ja...@mendeley.com> wrote:
>
> I've been trying to create a table and then bung lots of random data into it, but without generating the data client-side and then upserting it. I'd rather find a way to have it generated on the server.
>
> My table has three columns, all BIGINT, and the first is the primary key.
>
> I'd ideally like to write something like
>
>    UPSERT into linktab (id, first_val, second_val) SELECT next value for linktab_next_id, random_func, random_func ...
>
> and complete the query so that the SELECT generates its values without needing to read from a table. But there doesn't seem to be a way of writing a SELECT that doesn't need a table parameter.
>
> So I've ended up doing this. First I upsert a single row into the table; then, I double the number of rows in it by a self-referential UPSERT SELECT:
>
> 0: jdbc:phoenix:172.31.31.143> upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
> 8 rows affected (0.193 seconds)
> 0: jdbc:phoenix:172.31.31.143> upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
> 16 rows affected (0.202 seconds)
> 0: jdbc:phoenix:172.31.31.143> upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
> 32 rows affected (0.205 seconds)
> 0: jdbc:phoenix:172.31.31.143> upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
> 64 rows affected (0.235 seconds)
> 0: jdbc:phoenix:172.31.31.143> upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
> 128 rows affected (0.28 seconds)
> 0: jdbc:phoenix:172.31.31.143> upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
> 256 rows affected (0.378 seconds)
> 0: jdbc:phoenix:172.31.31.143> upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
> 512 rows affected (0.53 seconds)
> 0: jdbc:phoenix:172.31.31.143> upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
> 1,024 rows affected (0.848 seconds)
> 0: jdbc:phoenix:172.31.31.143> upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
> 2,048 rows affected (1.513 seconds)
> 0: jdbc:phoenix:172.31.31.143> upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
> 4,096 rows affected (2.817 seconds)
> 0: jdbc:phoenix:172.31.31.143> upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
> 8,192 rows affected (5.314 seconds)
> 0: jdbc:phoenix:172.31.31.143> upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
> 16,384 rows affected (10.807 seconds)
> 0: jdbc:phoenix:172.31.31.143> upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
> 32,768 rows affected (21.306 seconds)
> 0: jdbc:phoenix:172.31.31.143> upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
> 65,536 rows affected (45.713 seconds)
> 0: jdbc:phoenix:172.31.31.143> explain upsert into linktab (id, first_val, second_val) select next value for linktab_next_id, round(rand()*1000000000000), round(rand()*1000000000000) from linktab;
> +------------------------------------------+
> |                   PLAN                   |
> +------------------------------------------+
> | UPSERT SELECT                            |
> | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER LINKTAB |
> |     SERVER FILTER BY FIRST KEY ONLY      |
> | CLIENT RESERVE VALUES FROM 1 SEQUENCE    |
> +------------------------------------------+
> 4 rows selected (0.192 seconds)
> 0: jdbc:phoenix:172.31.31.143>
>
>
> This works, but as you can see from the timings, it's getting quite slow, and this seems to be because it has to do a full scan of the table to retrieve the existing rows in order to construct the dummy data to put in. This seems a bit of a shame when it doesn't actually need the data!
>
> Is there a way round this?
>
> Could we have a SELECT ... LIMIT n command that doesn't require a table to be specified as long as none of the columns comes from a table?
>
> James