You are viewing a plain text version of this content. The canonical link for it is here.

Posted to derby-user@db.apache.org by Olivier Chedru <oC...@infovista.com> on 2007/07/16 09:39:01 UTC

Horizontal partitioning?

Hi all,

 

I am trying to achieve horizontal partitioning with Derby.

 

My application stores *a lot* of data as time passes, and it needs to
drop obsolete data.

 

Table partitioning would be the ideal solution, but it is not
implemented in Derby. So I create one table per period of time. All
tables have the same columns, the first one being the data timestamp.
Dropping obsolete data is then made easy: just drop old tables.

 

Now, I need to query data without knowing how tables are organized. So,
I create a view on all tables (the view is deleted/created every time a
data table is created/dropped): fine.

However, checking the runtime statistics, I notice ALL tables are
scanned when I use the view! Note I added constraints on the Timestamp
field.

 

Sample SQL:

 

create table t1 (ts timestamp, a int, b int, c int);

create table t2 (ts timestamp, a int, b int, c int);

create table t3 (ts timestamp, a int, b int, c int);

alter table t1 add constraint t1_c check (ts >= '2007-07-16 09:00:00.0'
and ts < '2007-07-16 09:01:00.0');

alter table t2 add constraint t2_c check (ts >= '2007-07-16 09:01:00.0'
and ts < '2007-07-16 09:02:00.0');

alter table t3 add constraint t3_c check (ts >= '2007-07-16 09:02:00.0'
and ts < '2007-07-16 09:03:00.0');

create view t_view as select * from t1 union select * from t2 union
select * from t3;

CALL SYSCS_UTIL.SYSCS_SET_RUNTIMESTATISTICS(1);

CALL SYSCS_UTIL.SYSCS_SET_STATISTICS_TIMING(1);

select * from t_view where ts >= '2007-07-16 09:01:30.0' and ts <
'2007-07-16 09:02:00.0';

VALUES SYSCS_UTIL.SYSCS_GET_RUNTIMESTATISTICS();

 

The displayed execution plan shows t1, t2 and t3 are scanned. I expect
only t2 is used. I also tried to force using constraints by defining the
view this way, but it did not improve the behavior:

 

create view t_view as select * from t1 --DERBY-PROPERTIES
constraint=t1_c union select * from t2 --DERBY-PROPERTIES
constraint=t2_c union select * from t3 --DERBY-PROPERTIES
constraint=t3_c;

 

Any idea on how to achieve horizontal partitioning?

 

Thanks!

Olivier.

RE: Horizontal partitioning?

Posted by Olivier Chedru <oC...@infovista.com>.

Hi David,

This looks like a very elegant solution and I will give it a try!

Thanks for the tip!
Olivier.

-----Original Message-----
From: david.vancouvering@gmail.com [mailto:david.vancouvering@gmail.com]
On Behalf Of David Van Couvering
Sent: Wednesday, July 18, 2007 8:46 PM
To: Derby Discussion
Subject: Re: Horizontal partitioning?

Hi, Olivier, thanks for your response,  I understand your situation
now.  The use of the term "horizontal partitioning" triggered a
standard interpretation for me.

One thing you could do is write a Java procedure, see

http://wiki.apache.org/db-derby/DerbySQLroutines

Using this approach, you could take as parameters the begin date, the
end date, and then a string for the actual query (without the
timestamp constraints).  You could then compose a query against the
relevant tables, given the timestamp constraints, and execute this
query, which is returned as a result set to the user.

David

On 7/18/07, Olivier Chedru <oC...@infovista.com> wrote:
> Hi David,
>
> The purpose of my application is to keep measurement data locally, and
> Derby is perfect to manage a local database. Because of deployment
> reasons, I cannot use a distributed database.
> The measurement data never change (they are just dropped when their
> lifetime is reached). This application aims at running 24/24, and I
need
> to keep the database size stable once its lifetime is reached.
Database
> users will want to query data over a given short period of time, with
> reasonable response times.
>
> If I use a single table, I will fall in the following issues:
> - Index maintenance will become more and more expensive as the table
> grows. And it will grow a lot: typically, I will store 10000 rows per
> second and keep 1 week of data. This would generate more than 6
billion
> rows!
> - Dropping obsolete data will mean using DELETE FROM, which is much
more
> expensive than dropping a table. And this will probably lead to
database
> fragmentation over time.
>
> So I try to proceed this way:
> 1 - When some criteria is met (period of time, number of rows...), I
> create a new table without indexes.
> 2 - Data are stored using batch insert statements.
> 3 - Once the table is considered full (using the same criteria as
> above), I create the necessary indexes on the table.
> 4 - Goto 1.
>
> To provide an easy access to data and hide all those tables, a view
> seems very appropriate. Unfortunately, as shown in my first post, the
> view does not take into account constraints on the timestamp columns,
> and scans all tables. No need to say the query never ends...
>
> Any idea about this problem is welcome!
>
> Thanks,
> Olivier.
>
> -----Original Message-----
> From: david.vancouvering@gmail.com
[mailto:david.vancouvering@gmail.com]
> On Behalf Of David Van Couvering
> Sent: Wednesday, July 18, 2007 7:09 AM
> To: Derby Discussion
> Subject: Re: Horizontal partitioning?
>
> Hm, if you're creating a view on all the tables, that must mean all
> the tables are under a single database?  If that's the case, then this
> doesn't really sound like horizontal partitioning (which usually means
> splitting it into multiple databases, each with its own disk and
> usually each with its own CPU).
>
> If you're keeping everything within one database, then partitioning
> across multiple tables really is not useful or necessary, as Derby
> itself accomplishes quick access to any row in the table using the
> index on the primary key.
>
> Normally, if you have N partitions, then you would have N databases on
> N machines.  There would be no way to create a view across all of
> them.
>
> This is how folks like eBay and Google do it, for example.  This
> usually only works if the data in each partition is completely or
> almost completely independent and you don't need to do queries that
> span tables (e.g. you can issue the query against one specific
> partition depending on things like timestamp, user id, geographic
> location, etc).  If your queries can't be isolated to a single
> partition, then your application code is going to have to deal with
> logic normally assigned to the database, such as sorting and merging
> data, and now you're in the realm of a clustered database, and having
> worked on two of these in the past, I think I can safely say you don't
> want to go there...
>
> David
>
> On 7/16/07, Olivier Chedru <oC...@infovista.com> wrote:
> >
> >
> >
> >
> > Hi all,
> >
> >
> >
> > I am trying to achieve horizontal partitioning with Derby.
> >
> >
> >
> > My application stores *a lot* of data as time passes, and it needs
to
> drop
> > obsolete data.
> >
> >
> >
> > Table partitioning would be the ideal solution, but it is not
> implemented in
> > Derby. So I create one table per period of time. All tables have the
> same
> > columns, the first one being the data timestamp. Dropping obsolete
> data is
> > then made easy: just drop old tables.
> >
> >
> >
> > Now, I need to query data without knowing how tables are organized.
> So, I
> > create a view on all tables (the view is deleted/created every time
a
> data
> > table is created/dropped): fine.
> >
> > However, checking the runtime statistics, I notice ALL tables are
> scanned
> > when I use the view! Note I added constraints on the Timestamp
field.
> >
> >
> >
> > Sample SQL:
> >
> >
> >
> > create table t1 (ts timestamp, a int, b int, c int);
> >
> > create table t2 (ts timestamp, a int, b int, c int);
> >
> > create table t3 (ts timestamp, a int, b int, c int);
> >
> > alter table t1 add constraint t1_c check (ts >= '2007-07-16
> 09:00:00.0' and
> > ts < '2007-07-16 09:01:00.0');
> >
> > alter table t2 add constraint t2_c check (ts >= '2007-07-16
> 09:01:00.0' and
> > ts < '2007-07-16 09:02:00.0');
> >
> > alter table t3 add constraint t3_c check (ts >= '2007-07-16
> 09:02:00.0' and
> > ts < '2007-07-16 09:03:00.0');
> >
> > create view t_view as select * from t1 union select * from t2 union
> select *
> > from t3;
> >
> > CALL SYSCS_UTIL.SYSCS_SET_RUNTIMESTATISTICS(1);
> >
> > CALL SYSCS_UTIL.SYSCS_SET_STATISTICS_TIMING(1);
> >
> > select * from t_view where ts >= '2007-07-16 09:01:30.0' and ts <
> > '2007-07-16 09:02:00.0';
> >
> > VALUES SYSCS_UTIL.SYSCS_GET_RUNTIMESTATISTICS();
> >
> >
> >
> > The displayed execution plan shows t1, t2 and t3 are scanned. I
expect
> only
> > t2 is used. I also tried to force using constraints by defining the
> view
> > this way, but it did not improve the behavior:
> >
> >
> >
> > create view t_view as select * from t1 --DERBY-PROPERTIES
> constraint=t1_c
> > union select * from t2 --DERBY-PROPERTIES constraint=t2_c union
select
> *
> > from t3 --DERBY-PROPERTIES constraint=t3_c;
> >
> >
> >
> > Any idea on how to achieve horizontal partitioning?
> >
> >
> >
> > Thanks!
> >
> > Olivier.
>

Re: Horizontal partitioning?

Posted by David Van Couvering <da...@vancouvering.com>.

Hi, Olivier, thanks for your response,  I understand your situation
now.  The use of the term "horizontal partitioning" triggered a
standard interpretation for me.

One thing you could do is write a Java procedure, see

http://wiki.apache.org/db-derby/DerbySQLroutines

Using this approach, you could take as parameters the begin date, the
end date, and then a string for the actual query (without the
timestamp constraints).  You could then compose a query against the
relevant tables, given the timestamp constraints, and execute this
query, which is returned as a result set to the user.

David

On 7/18/07, Olivier Chedru <oC...@infovista.com> wrote:
> Hi David,
>
> The purpose of my application is to keep measurement data locally, and
> Derby is perfect to manage a local database. Because of deployment
> reasons, I cannot use a distributed database.
> The measurement data never change (they are just dropped when their
> lifetime is reached). This application aims at running 24/24, and I need
> to keep the database size stable once its lifetime is reached. Database
> users will want to query data over a given short period of time, with
> reasonable response times.
>
> If I use a single table, I will fall in the following issues:
> - Index maintenance will become more and more expensive as the table
> grows. And it will grow a lot: typically, I will store 10000 rows per
> second and keep 1 week of data. This would generate more than 6 billion
> rows!
> - Dropping obsolete data will mean using DELETE FROM, which is much more
> expensive than dropping a table. And this will probably lead to database
> fragmentation over time.
>
> So I try to proceed this way:
> 1 - When some criteria is met (period of time, number of rows...), I
> create a new table without indexes.
> 2 - Data are stored using batch insert statements.
> 3 - Once the table is considered full (using the same criteria as
> above), I create the necessary indexes on the table.
> 4 - Goto 1.
>
> To provide an easy access to data and hide all those tables, a view
> seems very appropriate. Unfortunately, as shown in my first post, the
> view does not take into account constraints on the timestamp columns,
> and scans all tables. No need to say the query never ends...
>
> Any idea about this problem is welcome!
>
> Thanks,
> Olivier.
>
> -----Original Message-----
> From: david.vancouvering@gmail.com [mailto:david.vancouvering@gmail.com]
> On Behalf Of David Van Couvering
> Sent: Wednesday, July 18, 2007 7:09 AM
> To: Derby Discussion
> Subject: Re: Horizontal partitioning?
>
> Hm, if you're creating a view on all the tables, that must mean all
> the tables are under a single database?  If that's the case, then this
> doesn't really sound like horizontal partitioning (which usually means
> splitting it into multiple databases, each with its own disk and
> usually each with its own CPU).
>
> If you're keeping everything within one database, then partitioning
> across multiple tables really is not useful or necessary, as Derby
> itself accomplishes quick access to any row in the table using the
> index on the primary key.
>
> Normally, if you have N partitions, then you would have N databases on
> N machines.  There would be no way to create a view across all of
> them.
>
> This is how folks like eBay and Google do it, for example.  This
> usually only works if the data in each partition is completely or
> almost completely independent and you don't need to do queries that
> span tables (e.g. you can issue the query against one specific
> partition depending on things like timestamp, user id, geographic
> location, etc).  If your queries can't be isolated to a single
> partition, then your application code is going to have to deal with
> logic normally assigned to the database, such as sorting and merging
> data, and now you're in the realm of a clustered database, and having
> worked on two of these in the past, I think I can safely say you don't
> want to go there...
>
> David
>
> On 7/16/07, Olivier Chedru <oC...@infovista.com> wrote:
> >
> >
> >
> >
> > Hi all,
> >
> >
> >
> > I am trying to achieve horizontal partitioning with Derby.
> >
> >
> >
> > My application stores *a lot* of data as time passes, and it needs to
> drop
> > obsolete data.
> >
> >
> >
> > Table partitioning would be the ideal solution, but it is not
> implemented in
> > Derby. So I create one table per period of time. All tables have the
> same
> > columns, the first one being the data timestamp. Dropping obsolete
> data is
> > then made easy: just drop old tables.
> >
> >
> >
> > Now, I need to query data without knowing how tables are organized.
> So, I
> > create a view on all tables (the view is deleted/created every time a
> data
> > table is created/dropped): fine.
> >
> > However, checking the runtime statistics, I notice ALL tables are
> scanned
> > when I use the view! Note I added constraints on the Timestamp field.
> >
> >
> >
> > Sample SQL:
> >
> >
> >
> > create table t1 (ts timestamp, a int, b int, c int);
> >
> > create table t2 (ts timestamp, a int, b int, c int);
> >
> > create table t3 (ts timestamp, a int, b int, c int);
> >
> > alter table t1 add constraint t1_c check (ts >= '2007-07-16
> 09:00:00.0' and
> > ts < '2007-07-16 09:01:00.0');
> >
> > alter table t2 add constraint t2_c check (ts >= '2007-07-16
> 09:01:00.0' and
> > ts < '2007-07-16 09:02:00.0');
> >
> > alter table t3 add constraint t3_c check (ts >= '2007-07-16
> 09:02:00.0' and
> > ts < '2007-07-16 09:03:00.0');
> >
> > create view t_view as select * from t1 union select * from t2 union
> select *
> > from t3;
> >
> > CALL SYSCS_UTIL.SYSCS_SET_RUNTIMESTATISTICS(1);
> >
> > CALL SYSCS_UTIL.SYSCS_SET_STATISTICS_TIMING(1);
> >
> > select * from t_view where ts >= '2007-07-16 09:01:30.0' and ts <
> > '2007-07-16 09:02:00.0';
> >
> > VALUES SYSCS_UTIL.SYSCS_GET_RUNTIMESTATISTICS();
> >
> >
> >
> > The displayed execution plan shows t1, t2 and t3 are scanned. I expect
> only
> > t2 is used. I also tried to force using constraints by defining the
> view
> > this way, but it did not improve the behavior:
> >
> >
> >
> > create view t_view as select * from t1 --DERBY-PROPERTIES
> constraint=t1_c
> > union select * from t2 --DERBY-PROPERTIES constraint=t2_c union select
> *
> > from t3 --DERBY-PROPERTIES constraint=t3_c;
> >
> >
> >
> > Any idea on how to achieve horizontal partitioning?
> >
> >
> >
> > Thanks!
> >
> > Olivier.
>

RE: Horizontal partitioning?

Posted by Olivier Chedru <oC...@infovista.com>.

Hi David,

The purpose of my application is to keep measurement data locally, and
Derby is perfect to manage a local database. Because of deployment
reasons, I cannot use a distributed database.
The measurement data never change (they are just dropped when their
lifetime is reached). This application aims at running 24/24, and I need
to keep the database size stable once its lifetime is reached. Database
users will want to query data over a given short period of time, with
reasonable response times. 

If I use a single table, I will fall in the following issues:
- Index maintenance will become more and more expensive as the table
grows. And it will grow a lot: typically, I will store 10000 rows per
second and keep 1 week of data. This would generate more than 6 billion
rows!
- Dropping obsolete data will mean using DELETE FROM, which is much more
expensive than dropping a table. And this will probably lead to database
fragmentation over time.

So I try to proceed this way:
1 - When some criteria is met (period of time, number of rows...), I
create a new table without indexes.
2 - Data are stored using batch insert statements.
3 - Once the table is considered full (using the same criteria as
above), I create the necessary indexes on the table.
4 - Goto 1.

To provide an easy access to data and hide all those tables, a view
seems very appropriate. Unfortunately, as shown in my first post, the
view does not take into account constraints on the timestamp columns,
and scans all tables. No need to say the query never ends...

Any idea about this problem is welcome!

Thanks,
Olivier.

-----Original Message-----
From: david.vancouvering@gmail.com [mailto:david.vancouvering@gmail.com]
On Behalf Of David Van Couvering
Sent: Wednesday, July 18, 2007 7:09 AM
To: Derby Discussion
Subject: Re: Horizontal partitioning?

Hm, if you're creating a view on all the tables, that must mean all
the tables are under a single database?  If that's the case, then this
doesn't really sound like horizontal partitioning (which usually means
splitting it into multiple databases, each with its own disk and
usually each with its own CPU).

If you're keeping everything within one database, then partitioning
across multiple tables really is not useful or necessary, as Derby
itself accomplishes quick access to any row in the table using the
index on the primary key.

Normally, if you have N partitions, then you would have N databases on
N machines.  There would be no way to create a view across all of
them.

This is how folks like eBay and Google do it, for example.  This
usually only works if the data in each partition is completely or
almost completely independent and you don't need to do queries that
span tables (e.g. you can issue the query against one specific
partition depending on things like timestamp, user id, geographic
location, etc).  If your queries can't be isolated to a single
partition, then your application code is going to have to deal with
logic normally assigned to the database, such as sorting and merging
data, and now you're in the realm of a clustered database, and having
worked on two of these in the past, I think I can safely say you don't
want to go there...

David

On 7/16/07, Olivier Chedru <oC...@infovista.com> wrote:
>
>
>
>
> Hi all,
>
>
>
> I am trying to achieve horizontal partitioning with Derby.
>
>
>
> My application stores *a lot* of data as time passes, and it needs to
drop
> obsolete data.
>
>
>
> Table partitioning would be the ideal solution, but it is not
implemented in
> Derby. So I create one table per period of time. All tables have the
same
> columns, the first one being the data timestamp. Dropping obsolete
data is
> then made easy: just drop old tables.
>
>
>
> Now, I need to query data without knowing how tables are organized.
So, I
> create a view on all tables (the view is deleted/created every time a
data
> table is created/dropped): fine.
>
> However, checking the runtime statistics, I notice ALL tables are
scanned
> when I use the view! Note I added constraints on the Timestamp field.
>
>
>
> Sample SQL:
>
>
>
> create table t1 (ts timestamp, a int, b int, c int);
>
> create table t2 (ts timestamp, a int, b int, c int);
>
> create table t3 (ts timestamp, a int, b int, c int);
>
> alter table t1 add constraint t1_c check (ts >= '2007-07-16
09:00:00.0' and
> ts < '2007-07-16 09:01:00.0');
>
> alter table t2 add constraint t2_c check (ts >= '2007-07-16
09:01:00.0' and
> ts < '2007-07-16 09:02:00.0');
>
> alter table t3 add constraint t3_c check (ts >= '2007-07-16
09:02:00.0' and
> ts < '2007-07-16 09:03:00.0');
>
> create view t_view as select * from t1 union select * from t2 union
select *
> from t3;
>
> CALL SYSCS_UTIL.SYSCS_SET_RUNTIMESTATISTICS(1);
>
> CALL SYSCS_UTIL.SYSCS_SET_STATISTICS_TIMING(1);
>
> select * from t_view where ts >= '2007-07-16 09:01:30.0' and ts <
> '2007-07-16 09:02:00.0';
>
> VALUES SYSCS_UTIL.SYSCS_GET_RUNTIMESTATISTICS();
>
>
>
> The displayed execution plan shows t1, t2 and t3 are scanned. I expect
only
> t2 is used. I also tried to force using constraints by defining the
view
> this way, but it did not improve the behavior:
>
>
>
> create view t_view as select * from t1 --DERBY-PROPERTIES
constraint=t1_c
> union select * from t2 --DERBY-PROPERTIES constraint=t2_c union select
*
> from t3 --DERBY-PROPERTIES constraint=t3_c;
>
>
>
> Any idea on how to achieve horizontal partitioning?
>
>
>
> Thanks!
>
> Olivier.

Re: Horizontal partitioning?

Posted by David Van Couvering <da...@vancouvering.com>.

Hm, if you're creating a view on all the tables, that must mean all
the tables are under a single database?  If that's the case, then this
doesn't really sound like horizontal partitioning (which usually means
splitting it into multiple databases, each with its own disk and
usually each with its own CPU).

If you're keeping everything within one database, then partitioning
across multiple tables really is not useful or necessary, as Derby
itself accomplishes quick access to any row in the table using the
index on the primary key.

Normally, if you have N partitions, then you would have N databases on
N machines.  There would be no way to create a view across all of
them.

This is how folks like eBay and Google do it, for example.  This
usually only works if the data in each partition is completely or
almost completely independent and you don't need to do queries that
span tables (e.g. you can issue the query against one specific
partition depending on things like timestamp, user id, geographic
location, etc).  If your queries can't be isolated to a single
partition, then your application code is going to have to deal with
logic normally assigned to the database, such as sorting and merging
data, and now you're in the realm of a clustered database, and having
worked on two of these in the past, I think I can safely say you don't
want to go there...

David

On 7/16/07, Olivier Chedru <oC...@infovista.com> wrote:
>
>
>
>
> Hi all,
>
>
>
> I am trying to achieve horizontal partitioning with Derby.
>
>
>
> My application stores *a lot* of data as time passes, and it needs to drop
> obsolete data.
>
>
>
> Table partitioning would be the ideal solution, but it is not implemented in
> Derby. So I create one table per period of time. All tables have the same
> columns, the first one being the data timestamp. Dropping obsolete data is
> then made easy: just drop old tables.
>
>
>
> Now, I need to query data without knowing how tables are organized. So, I
> create a view on all tables (the view is deleted/created every time a data
> table is created/dropped): fine.
>
> However, checking the runtime statistics, I notice ALL tables are scanned
> when I use the view! Note I added constraints on the Timestamp field.
>
>
>
> Sample SQL:
>
>
>
> create table t1 (ts timestamp, a int, b int, c int);
>
> create table t2 (ts timestamp, a int, b int, c int);
>
> create table t3 (ts timestamp, a int, b int, c int);
>
> alter table t1 add constraint t1_c check (ts >= '2007-07-16 09:00:00.0' and
> ts < '2007-07-16 09:01:00.0');
>
> alter table t2 add constraint t2_c check (ts >= '2007-07-16 09:01:00.0' and
> ts < '2007-07-16 09:02:00.0');
>
> alter table t3 add constraint t3_c check (ts >= '2007-07-16 09:02:00.0' and
> ts < '2007-07-16 09:03:00.0');
>
> create view t_view as select * from t1 union select * from t2 union select *
> from t3;
>
> CALL SYSCS_UTIL.SYSCS_SET_RUNTIMESTATISTICS(1);
>
> CALL SYSCS_UTIL.SYSCS_SET_STATISTICS_TIMING(1);
>
> select * from t_view where ts >= '2007-07-16 09:01:30.0' and ts <
> '2007-07-16 09:02:00.0';
>
> VALUES SYSCS_UTIL.SYSCS_GET_RUNTIMESTATISTICS();
>
>
>
> The displayed execution plan shows t1, t2 and t3 are scanned. I expect only
> t2 is used. I also tried to force using constraints by defining the view
> this way, but it did not improve the behavior:
>
>
>
> create view t_view as select * from t1 --DERBY-PROPERTIES constraint=t1_c
> union select * from t2 --DERBY-PROPERTIES constraint=t2_c union select *
> from t3 --DERBY-PROPERTIES constraint=t3_c;
>
>
>
> Any idea on how to achieve horizontal partitioning?
>
>
>
> Thanks!
>
> Olivier.