You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Madhusudan Borkar <mb...@etouch.net> on 2017/06/10 19:28:06 UTC

[PROPOSAL] for AWS Aurora relational database connector

Hi,
We are proposing to develop connector for AWS Aurora. Aurora being cluster
for relational database (MySQL) has no Java api for reading/writing other
than jdbc client. Although there is a JdbcIO available, it looks like it
doesn't work in parallel. The proposal is to provide split functionality
and then use transform to parallelize the operation. As mentioned above,
this is typical sql based database and not comparable with likes of Hive.
Hive implementation is based on abstraction over Hdfs file system of
Hadoop, which provides splits. Here none of these are applicable.
During implementation of Hive connector there was lot of discussion as how
to implement connector while strictly following Beam design principal using
Bounded source. I am not sure how Aurora connector will fit into these
design principals.
Here is our proposal.
1. Split functionality: If the table contains 'x' rows, it will be split
into 'n' bundles in the split method. This would be done like follows :
noOfSplits = 'x' * size of a single row / bundleSize hint from runner.
2. Then each of these 'pseudo' splits would be read in parallel
3. Each of these reads will use db connection from connection pool.
This will provide better bench marking. Please, let know your views.

Thanks
Madhu Borkar

Re: [PROPOSAL] for AWS Aurora relational database connector

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.

...Final note: performance when executing queries "limit A, B" and "limit
C, D" in sequence may be completely different than when executing them in
parallel. In particular, if they are being run in parallel, most likely a
lot fewer caching will happen. Make sure your benchmarks account for this
too.

On Tue, Jun 13, 2017 at 4:40 PM Eugene Kirpichov <ki...@google.com>
wrote:

> Most likely the identical performance you observed for "limit" clause is
> because you are not sorting the rows. Without sorting, a "limit" query is
> meaningless: the database is technically allowed return exactly the same
> result for "limit 0, 10" and "limit 10, 20", because both of these queries
> are in essence asking to give you 10 arbitrary rows from the table.
>
> On Tue, Jun 13, 2017 at 4:38 PM Eugene Kirpichov <ki...@google.com>
> wrote:
>
>> Thanks Madhusudan. Please note that in your case, likely, the time was
>> dominated by shipping the rows over the network, rather than executing the
>> query. Please make sure to include benchmarks where the query itself is
>> expensive to evaluate (e.g. "select count(*) from query" takes time
>> comparable to "select * from query"), and make sure that its results are
>> large enough to be non-cacheable.
>>
>> Also let me know your thoughts on ensuring *semantic* correctness - i.e.
>> the fact that "select * limit n, m" is non-deterministic, and requires an
>> ORDER BY clause to provide any kind of determinism - and that determinism,
>> too, is lost, if the table may be concurrently modified.
>> E.g. you have a table with values in column "id": 0, 10, 20, 30, 40, 50.
>> You split it into 2 parts - "order by id limit 0, 3" and "order by id limit
>> 3, 6".
>> While you read the "limit 0, 3" part, the value 5 is added, and this part
>> gets values 0, 10, 5.
>> When you read the "limit 3, 6" part, the value 7 is added, and this part
>> gets values 10, 20, 30.
>> As a result, you get the value 10 twice, and you completely miss the
>> values 7, 40 and 50.
>>
>> The only way to address this is if the database supports snapshot
>> (point-in-time) reads. If Aurora supports this and if you plan to use them,
>> then make sure that you use this feature in all of your benchmarks - it
>> might be less optimized by Aurora internally.
>>
>> On Tue, Jun 13, 2017 at 4:25 PM Madhusudan Borkar <mb...@etouch.net>
>> wrote:
>>
>>> Hi,
>>> Appreciate your questions.
>>> One thing I believe, AWS Aurora even though it is based on MySQL, it is
>>> no
>>> MySQL. The reason being, AWS has developed this database service RDS
>>> ground
>>> up and has improved or completely changed its implementation. That being
>>> said some of things that one may have experienced with MySQL standard
>>> implementation will not hold. They have changed the way query results are
>>> cached, new technique. RDS has multiple read replicas which are load
>>> balanced and separated from writer. There are many more improvements.
>>> The query to read m rows from row n works like
>>> select * from table limit n,m;
>>> My research with Aurora shows that this doesn't read up-to n-1 rows first
>>> and then discard that data before picking up next m rows. It looks like
>>> that it picks up from nth row. My table with 2M rows has returned data
>>> for
>>> 200K rows under 15 secs. If I repeat the query then it will only take
>>> less
>>> than 5 secs. I am not sorting the rows.
>>> We can get avg size of the table row by querying the metadata for the
>>> table
>>> from its metadata table information_schema.tables. It doesn't have to
>>> even
>>> access the actual table.
>>> Please, give us more time to provide more on bench marking.
>>>
>>>
>>> Madhu Borkar
>>>
>>> On Sat, Jun 10, 2017 at 10:51 PM, Eugene Kirpichov <
>>> kirpichov@google.com.invalid> wrote:
>>>
>>> > To elaborate a bit on what JB said:
>>> >
>>> > Suppose the table has 1,000,000 rows, and suppose you split it into
>>> 1000
>>> > bundles, 1000 rows per bundle.
>>> >
>>> > Does Aurora provide an API that allows to efficiently read the bundle
>>> > containing rows 999,000-1,000,000, that does not involve reading and
>>> > throwing away the first 999,000 rows?
>>> > Cause if the answer to this is "no", then reading these 1000 bundles
>>> would
>>> > involve scanning a total of 1000+2000+...+1,000,000 = 499,500,000 rows
>>> (and
>>> > throwing away 498,500,000 of them) instead of 1,000,000.
>>> >
>>> > For a typical relational database, the answer is "no" - that's why
>>> JdbcIO
>>> > does not provide splitting. Instead it reads the rows sequentially, but
>>> > uses a reshuffle-like transform to make sure that they will be
>>> *processed*
>>> > in parallel by downstream transforms.
>>> >
>>> > There's also a couple more questions that make this proposal much
>>> harder
>>> > than it may seem at first sight:
>>> > - In order to make sure the bundles cover each row exactly once, you
>>> need
>>> > to scan the table in a particular fixed order - otherwise "rows number
>>> X
>>> > through Y" is meaningless - this adds the overhead of an ORDER BY
>>> clause
>>> > (though for a table with a primary key it's probably negligible).
>>> > - If the table is changed, and rows are inserted and deleted while you
>>> read
>>> > it, then again "rows number X through Y" is a meaningless concept,
>>> because
>>> > what is "row number X" at one moment may be completely different at
>>> another
>>> > moment, and from reading 1000 bundles in parallel you might get
>>> duplicate
>>> > rows, lost rows, or both.
>>> > - You mention the "size of a single row" - I suppose you're referring
>>> to
>>> > the arithmetic mean of the sizes of all rows in the database. Does
>>> Aurora
>>> > provide a way to efficiently query for that? (without reading the whole
>>> > database and computing the size of each row)
>>> >
>>> > On Sat, Jun 10, 2017 at 10:36 PM Jean-Baptiste Onofré <jb@nanthrax.net
>>> >
>>> > wrote:
>>> >
>>> > > Hi,
>>> > >
>>> > > I created a Jira to add custom splitting to JdbcIO (but it's not so
>>> > trivial
>>> > > depending of the backends.
>>> > >
>>> > > Regarding your proposal it sounds interesting, but do you think we
>>> will
>>> > > have
>>> > > really "parallel" read of the split ? I think splitting makes sense
>>> if we
>>> > > can do
>>> > > parallel read: if we split to read on an unique backend, it doesn't
>>> bring
>>> > > lot of
>>> > > improvement.
>>> > >
>>> > > Regards
>>> > > JB
>>> > >
>>> > > On 06/10/2017 09:28 PM, Madhusudan Borkar wrote:
>>> > > > Hi,
>>> > > > We are proposing to develop connector for AWS Aurora. Aurora being
>>> > > cluster
>>> > > > for relational database (MySQL) has no Java api for reading/writing
>>> > other
>>> > > > than jdbc client. Although there is a JdbcIO available, it looks
>>> like
>>> > it
>>> > > > doesn't work in parallel. The proposal is to provide split
>>> > functionality
>>> > > > and then use transform to parallelize the operation. As mentioned
>>> > above,
>>> > > > this is typical sql based database and not comparable with likes of
>>> > Hive.
>>> > > > Hive implementation is based on abstraction over Hdfs file system
>>> of
>>> > > > Hadoop, which provides splits. Here none of these are applicable.
>>> > > > During implementation of Hive connector there was lot of
>>> discussion as
>>> > > how
>>> > > > to implement connector while strictly following Beam design
>>> principal
>>> > > using
>>> > > > Bounded source. I am not sure how Aurora connector will fit into
>>> these
>>> > > > design principals.
>>> > > > Here is our proposal.
>>> > > > 1. Split functionality: If the table contains 'x' rows, it will be
>>> > split
>>> > > > into 'n' bundles in the split method. This would be done like
>>> follows :
>>> > > > noOfSplits = 'x' * size of a single row / bundleSize hint from
>>> runner.
>>> > > > 2. Then each of these 'pseudo' splits would be read in parallel
>>> > > > 3. Each of these reads will use db connection from connection pool.
>>> > > > This will provide better bench marking. Please, let know your
>>> views.
>>> > > >
>>> > > > Thanks
>>> > > > Madhu Borkar
>>> > > >
>>> > >
>>> > > --
>>> > > Jean-Baptiste Onofré
>>> > > jbonofre@apache.org
>>> > > http://blog.nanthrax.net
>>> > > Talend - http://www.talend.com
>>> > >
>>> >
>>>
>>

Re: [PROPOSAL] for AWS Aurora relational database connector

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.

Most likely the identical performance you observed for "limit" clause is
because you are not sorting the rows. Without sorting, a "limit" query is
meaningless: the database is technically allowed return exactly the same
result for "limit 0, 10" and "limit 10, 20", because both of these queries
are in essence asking to give you 10 arbitrary rows from the table.

On Tue, Jun 13, 2017 at 4:38 PM Eugene Kirpichov <ki...@google.com>
wrote:

> Thanks Madhusudan. Please note that in your case, likely, the time was
> dominated by shipping the rows over the network, rather than executing the
> query. Please make sure to include benchmarks where the query itself is
> expensive to evaluate (e.g. "select count(*) from query" takes time
> comparable to "select * from query"), and make sure that its results are
> large enough to be non-cacheable.
>
> Also let me know your thoughts on ensuring *semantic* correctness - i.e.
> the fact that "select * limit n, m" is non-deterministic, and requires an
> ORDER BY clause to provide any kind of determinism - and that determinism,
> too, is lost, if the table may be concurrently modified.
> E.g. you have a table with values in column "id": 0, 10, 20, 30, 40, 50.
> You split it into 2 parts - "order by id limit 0, 3" and "order by id limit
> 3, 6".
> While you read the "limit 0, 3" part, the value 5 is added, and this part
> gets values 0, 10, 5.
> When you read the "limit 3, 6" part, the value 7 is added, and this part
> gets values 10, 20, 30.
> As a result, you get the value 10 twice, and you completely miss the
> values 7, 40 and 50.
>
> The only way to address this is if the database supports snapshot
> (point-in-time) reads. If Aurora supports this and if you plan to use them,
> then make sure that you use this feature in all of your benchmarks - it
> might be less optimized by Aurora internally.
>
> On Tue, Jun 13, 2017 at 4:25 PM Madhusudan Borkar <mb...@etouch.net>
> wrote:
>
>> Hi,
>> Appreciate your questions.
>> One thing I believe, AWS Aurora even though it is based on MySQL, it is no
>> MySQL. The reason being, AWS has developed this database service RDS
>> ground
>> up and has improved or completely changed its implementation. That being
>> said some of things that one may have experienced with MySQL standard
>> implementation will not hold. They have changed the way query results are
>> cached, new technique. RDS has multiple read replicas which are load
>> balanced and separated from writer. There are many more improvements.
>> The query to read m rows from row n works like
>> select * from table limit n,m;
>> My research with Aurora shows that this doesn't read up-to n-1 rows first
>> and then discard that data before picking up next m rows. It looks like
>> that it picks up from nth row. My table with 2M rows has returned data for
>> 200K rows under 15 secs. If I repeat the query then it will only take less
>> than 5 secs. I am not sorting the rows.
>> We can get avg size of the table row by querying the metadata for the
>> table
>> from its metadata table information_schema.tables. It doesn't have to even
>> access the actual table.
>> Please, give us more time to provide more on bench marking.
>>
>>
>> Madhu Borkar
>>
>> On Sat, Jun 10, 2017 at 10:51 PM, Eugene Kirpichov <
>> kirpichov@google.com.invalid> wrote:
>>
>> > To elaborate a bit on what JB said:
>> >
>> > Suppose the table has 1,000,000 rows, and suppose you split it into 1000
>> > bundles, 1000 rows per bundle.
>> >
>> > Does Aurora provide an API that allows to efficiently read the bundle
>> > containing rows 999,000-1,000,000, that does not involve reading and
>> > throwing away the first 999,000 rows?
>> > Cause if the answer to this is "no", then reading these 1000 bundles
>> would
>> > involve scanning a total of 1000+2000+...+1,000,000 = 499,500,000 rows
>> (and
>> > throwing away 498,500,000 of them) instead of 1,000,000.
>> >
>> > For a typical relational database, the answer is "no" - that's why
>> JdbcIO
>> > does not provide splitting. Instead it reads the rows sequentially, but
>> > uses a reshuffle-like transform to make sure that they will be
>> *processed*
>> > in parallel by downstream transforms.
>> >
>> > There's also a couple more questions that make this proposal much harder
>> > than it may seem at first sight:
>> > - In order to make sure the bundles cover each row exactly once, you
>> need
>> > to scan the table in a particular fixed order - otherwise "rows number X
>> > through Y" is meaningless - this adds the overhead of an ORDER BY clause
>> > (though for a table with a primary key it's probably negligible).
>> > - If the table is changed, and rows are inserted and deleted while you
>> read
>> > it, then again "rows number X through Y" is a meaningless concept,
>> because
>> > what is "row number X" at one moment may be completely different at
>> another
>> > moment, and from reading 1000 bundles in parallel you might get
>> duplicate
>> > rows, lost rows, or both.
>> > - You mention the "size of a single row" - I suppose you're referring to
>> > the arithmetic mean of the sizes of all rows in the database. Does
>> Aurora
>> > provide a way to efficiently query for that? (without reading the whole
>> > database and computing the size of each row)
>> >
>> > On Sat, Jun 10, 2017 at 10:36 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
>> > wrote:
>> >
>> > > Hi,
>> > >
>> > > I created a Jira to add custom splitting to JdbcIO (but it's not so
>> > trivial
>> > > depending of the backends.
>> > >
>> > > Regarding your proposal it sounds interesting, but do you think we
>> will
>> > > have
>> > > really "parallel" read of the split ? I think splitting makes sense
>> if we
>> > > can do
>> > > parallel read: if we split to read on an unique backend, it doesn't
>> bring
>> > > lot of
>> > > improvement.
>> > >
>> > > Regards
>> > > JB
>> > >
>> > > On 06/10/2017 09:28 PM, Madhusudan Borkar wrote:
>> > > > Hi,
>> > > > We are proposing to develop connector for AWS Aurora. Aurora being
>> > > cluster
>> > > > for relational database (MySQL) has no Java api for reading/writing
>> > other
>> > > > than jdbc client. Although there is a JdbcIO available, it looks
>> like
>> > it
>> > > > doesn't work in parallel. The proposal is to provide split
>> > functionality
>> > > > and then use transform to parallelize the operation. As mentioned
>> > above,
>> > > > this is typical sql based database and not comparable with likes of
>> > Hive.
>> > > > Hive implementation is based on abstraction over Hdfs file system of
>> > > > Hadoop, which provides splits. Here none of these are applicable.
>> > > > During implementation of Hive connector there was lot of discussion
>> as
>> > > how
>> > > > to implement connector while strictly following Beam design
>> principal
>> > > using
>> > > > Bounded source. I am not sure how Aurora connector will fit into
>> these
>> > > > design principals.
>> > > > Here is our proposal.
>> > > > 1. Split functionality: If the table contains 'x' rows, it will be
>> > split
>> > > > into 'n' bundles in the split method. This would be done like
>> follows :
>> > > > noOfSplits = 'x' * size of a single row / bundleSize hint from
>> runner.
>> > > > 2. Then each of these 'pseudo' splits would be read in parallel
>> > > > 3. Each of these reads will use db connection from connection pool.
>> > > > This will provide better bench marking. Please, let know your views.
>> > > >
>> > > > Thanks
>> > > > Madhu Borkar
>> > > >
>> > >
>> > > --
>> > > Jean-Baptiste Onofré
>> > > jbonofre@apache.org
>> > > http://blog.nanthrax.net
>> > > Talend - http://www.talend.com
>> > >
>> >
>>
>

Re: [PROPOSAL] for AWS Aurora relational database connector

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.

Thanks Madhusudan. Please note that in your case, likely, the time was
dominated by shipping the rows over the network, rather than executing the
query. Please make sure to include benchmarks where the query itself is
expensive to evaluate (e.g. "select count(*) from query" takes time
comparable to "select * from query"), and make sure that its results are
large enough to be non-cacheable.

Also let me know your thoughts on ensuring *semantic* correctness - i.e.
the fact that "select * limit n, m" is non-deterministic, and requires an
ORDER BY clause to provide any kind of determinism - and that determinism,
too, is lost, if the table may be concurrently modified.
E.g. you have a table with values in column "id": 0, 10, 20, 30, 40, 50.
You split it into 2 parts - "order by id limit 0, 3" and "order by id limit
3, 6".
While you read the "limit 0, 3" part, the value 5 is added, and this part
gets values 0, 10, 5.
When you read the "limit 3, 6" part, the value 7 is added, and this part
gets values 10, 20, 30.
As a result, you get the value 10 twice, and you completely miss the values
7, 40 and 50.

The only way to address this is if the database supports snapshot
(point-in-time) reads. If Aurora supports this and if you plan to use them,
then make sure that you use this feature in all of your benchmarks - it
might be less optimized by Aurora internally.

On Tue, Jun 13, 2017 at 4:25 PM Madhusudan Borkar <mb...@etouch.net>
wrote:

> Hi,
> Appreciate your questions.
> One thing I believe, AWS Aurora even though it is based on MySQL, it is no
> MySQL. The reason being, AWS has developed this database service RDS ground
> up and has improved or completely changed its implementation. That being
> said some of things that one may have experienced with MySQL standard
> implementation will not hold. They have changed the way query results are
> cached, new technique. RDS has multiple read replicas which are load
> balanced and separated from writer. There are many more improvements.
> The query to read m rows from row n works like
> select * from table limit n,m;
> My research with Aurora shows that this doesn't read up-to n-1 rows first
> and then discard that data before picking up next m rows. It looks like
> that it picks up from nth row. My table with 2M rows has returned data for
> 200K rows under 15 secs. If I repeat the query then it will only take less
> than 5 secs. I am not sorting the rows.
> We can get avg size of the table row by querying the metadata for the table
> from its metadata table information_schema.tables. It doesn't have to even
> access the actual table.
> Please, give us more time to provide more on bench marking.
>
>
> Madhu Borkar
>
> On Sat, Jun 10, 2017 at 10:51 PM, Eugene Kirpichov <
> kirpichov@google.com.invalid> wrote:
>
> > To elaborate a bit on what JB said:
> >
> > Suppose the table has 1,000,000 rows, and suppose you split it into 1000
> > bundles, 1000 rows per bundle.
> >
> > Does Aurora provide an API that allows to efficiently read the bundle
> > containing rows 999,000-1,000,000, that does not involve reading and
> > throwing away the first 999,000 rows?
> > Cause if the answer to this is "no", then reading these 1000 bundles
> would
> > involve scanning a total of 1000+2000+...+1,000,000 = 499,500,000 rows
> (and
> > throwing away 498,500,000 of them) instead of 1,000,000.
> >
> > For a typical relational database, the answer is "no" - that's why JdbcIO
> > does not provide splitting. Instead it reads the rows sequentially, but
> > uses a reshuffle-like transform to make sure that they will be
> *processed*
> > in parallel by downstream transforms.
> >
> > There's also a couple more questions that make this proposal much harder
> > than it may seem at first sight:
> > - In order to make sure the bundles cover each row exactly once, you need
> > to scan the table in a particular fixed order - otherwise "rows number X
> > through Y" is meaningless - this adds the overhead of an ORDER BY clause
> > (though for a table with a primary key it's probably negligible).
> > - If the table is changed, and rows are inserted and deleted while you
> read
> > it, then again "rows number X through Y" is a meaningless concept,
> because
> > what is "row number X" at one moment may be completely different at
> another
> > moment, and from reading 1000 bundles in parallel you might get duplicate
> > rows, lost rows, or both.
> > - You mention the "size of a single row" - I suppose you're referring to
> > the arithmetic mean of the sizes of all rows in the database. Does Aurora
> > provide a way to efficiently query for that? (without reading the whole
> > database and computing the size of each row)
> >
> > On Sat, Jun 10, 2017 at 10:36 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
> > wrote:
> >
> > > Hi,
> > >
> > > I created a Jira to add custom splitting to JdbcIO (but it's not so
> > trivial
> > > depending of the backends.
> > >
> > > Regarding your proposal it sounds interesting, but do you think we will
> > > have
> > > really "parallel" read of the split ? I think splitting makes sense if
> we
> > > can do
> > > parallel read: if we split to read on an unique backend, it doesn't
> bring
> > > lot of
> > > improvement.
> > >
> > > Regards
> > > JB
> > >
> > > On 06/10/2017 09:28 PM, Madhusudan Borkar wrote:
> > > > Hi,
> > > > We are proposing to develop connector for AWS Aurora. Aurora being
> > > cluster
> > > > for relational database (MySQL) has no Java api for reading/writing
> > other
> > > > than jdbc client. Although there is a JdbcIO available, it looks like
> > it
> > > > doesn't work in parallel. The proposal is to provide split
> > functionality
> > > > and then use transform to parallelize the operation. As mentioned
> > above,
> > > > this is typical sql based database and not comparable with likes of
> > Hive.
> > > > Hive implementation is based on abstraction over Hdfs file system of
> > > > Hadoop, which provides splits. Here none of these are applicable.
> > > > During implementation of Hive connector there was lot of discussion
> as
> > > how
> > > > to implement connector while strictly following Beam design principal
> > > using
> > > > Bounded source. I am not sure how Aurora connector will fit into
> these
> > > > design principals.
> > > > Here is our proposal.
> > > > 1. Split functionality: If the table contains 'x' rows, it will be
> > split
> > > > into 'n' bundles in the split method. This would be done like
> follows :
> > > > noOfSplits = 'x' * size of a single row / bundleSize hint from
> runner.
> > > > 2. Then each of these 'pseudo' splits would be read in parallel
> > > > 3. Each of these reads will use db connection from connection pool.
> > > > This will provide better bench marking. Please, let know your views.
> > > >
> > > > Thanks
> > > > Madhu Borkar
> > > >
> > >
> > > --
> > > Jean-Baptiste Onofré
> > > jbonofre@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
>

Re: [PROPOSAL] for AWS Aurora relational database connector

Posted by Madhusudan Borkar <mb...@etouch.net>.

Hi,
Appreciate your questions.
One thing I believe, AWS Aurora even though it is based on MySQL, it is no
MySQL. The reason being, AWS has developed this database service RDS ground
up and has improved or completely changed its implementation. That being
said some of things that one may have experienced with MySQL standard
implementation will not hold. They have changed the way query results are
cached, new technique. RDS has multiple read replicas which are load
balanced and separated from writer. There are many more improvements.
The query to read m rows from row n works like
select * from table limit n,m;
My research with Aurora shows that this doesn't read up-to n-1 rows first
and then discard that data before picking up next m rows. It looks like
that it picks up from nth row. My table with 2M rows has returned data for
200K rows under 15 secs. If I repeat the query then it will only take less
than 5 secs. I am not sorting the rows.
We can get avg size of the table row by querying the metadata for the table
from its metadata table information_schema.tables. It doesn't have to even
access the actual table.
Please, give us more time to provide more on bench marking.


Madhu Borkar

On Sat, Jun 10, 2017 at 10:51 PM, Eugene Kirpichov <
kirpichov@google.com.invalid> wrote:

> To elaborate a bit on what JB said:
>
> Suppose the table has 1,000,000 rows, and suppose you split it into 1000
> bundles, 1000 rows per bundle.
>
> Does Aurora provide an API that allows to efficiently read the bundle
> containing rows 999,000-1,000,000, that does not involve reading and
> throwing away the first 999,000 rows?
> Cause if the answer to this is "no", then reading these 1000 bundles would
> involve scanning a total of 1000+2000+...+1,000,000 = 499,500,000 rows (and
> throwing away 498,500,000 of them) instead of 1,000,000.
>
> For a typical relational database, the answer is "no" - that's why JdbcIO
> does not provide splitting. Instead it reads the rows sequentially, but
> uses a reshuffle-like transform to make sure that they will be *processed*
> in parallel by downstream transforms.
>
> There's also a couple more questions that make this proposal much harder
> than it may seem at first sight:
> - In order to make sure the bundles cover each row exactly once, you need
> to scan the table in a particular fixed order - otherwise "rows number X
> through Y" is meaningless - this adds the overhead of an ORDER BY clause
> (though for a table with a primary key it's probably negligible).
> - If the table is changed, and rows are inserted and deleted while you read
> it, then again "rows number X through Y" is a meaningless concept, because
> what is "row number X" at one moment may be completely different at another
> moment, and from reading 1000 bundles in parallel you might get duplicate
> rows, lost rows, or both.
> - You mention the "size of a single row" - I suppose you're referring to
> the arithmetic mean of the sizes of all rows in the database. Does Aurora
> provide a way to efficiently query for that? (without reading the whole
> database and computing the size of each row)
>
> On Sat, Jun 10, 2017 at 10:36 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
> > Hi,
> >
> > I created a Jira to add custom splitting to JdbcIO (but it's not so
> trivial
> > depending of the backends.
> >
> > Regarding your proposal it sounds interesting, but do you think we will
> > have
> > really "parallel" read of the split ? I think splitting makes sense if we
> > can do
> > parallel read: if we split to read on an unique backend, it doesn't bring
> > lot of
> > improvement.
> >
> > Regards
> > JB
> >
> > On 06/10/2017 09:28 PM, Madhusudan Borkar wrote:
> > > Hi,
> > > We are proposing to develop connector for AWS Aurora. Aurora being
> > cluster
> > > for relational database (MySQL) has no Java api for reading/writing
> other
> > > than jdbc client. Although there is a JdbcIO available, it looks like
> it
> > > doesn't work in parallel. The proposal is to provide split
> functionality
> > > and then use transform to parallelize the operation. As mentioned
> above,
> > > this is typical sql based database and not comparable with likes of
> Hive.
> > > Hive implementation is based on abstraction over Hdfs file system of
> > > Hadoop, which provides splits. Here none of these are applicable.
> > > During implementation of Hive connector there was lot of discussion as
> > how
> > > to implement connector while strictly following Beam design principal
> > using
> > > Bounded source. I am not sure how Aurora connector will fit into these
> > > design principals.
> > > Here is our proposal.
> > > 1. Split functionality: If the table contains 'x' rows, it will be
> split
> > > into 'n' bundles in the split method. This would be done like follows :
> > > noOfSplits = 'x' * size of a single row / bundleSize hint from runner.
> > > 2. Then each of these 'pseudo' splits would be read in parallel
> > > 3. Each of these reads will use db connection from connection pool.
> > > This will provide better bench marking. Please, let know your views.
> > >
> > > Thanks
> > > Madhu Borkar
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbonofre@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>

Re: [PROPOSAL] for AWS Aurora relational database connector

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.

To elaborate a bit on what JB said:

Suppose the table has 1,000,000 rows, and suppose you split it into 1000
bundles, 1000 rows per bundle.

Does Aurora provide an API that allows to efficiently read the bundle
containing rows 999,000-1,000,000, that does not involve reading and
throwing away the first 999,000 rows?
Cause if the answer to this is "no", then reading these 1000 bundles would
involve scanning a total of 1000+2000+...+1,000,000 = 499,500,000 rows (and
throwing away 498,500,000 of them) instead of 1,000,000.

For a typical relational database, the answer is "no" - that's why JdbcIO
does not provide splitting. Instead it reads the rows sequentially, but
uses a reshuffle-like transform to make sure that they will be *processed*
in parallel by downstream transforms.

There's also a couple more questions that make this proposal much harder
than it may seem at first sight:
- In order to make sure the bundles cover each row exactly once, you need
to scan the table in a particular fixed order - otherwise "rows number X
through Y" is meaningless - this adds the overhead of an ORDER BY clause
(though for a table with a primary key it's probably negligible).
- If the table is changed, and rows are inserted and deleted while you read
it, then again "rows number X through Y" is a meaningless concept, because
what is "row number X" at one moment may be completely different at another
moment, and from reading 1000 bundles in parallel you might get duplicate
rows, lost rows, or both.
- You mention the "size of a single row" - I suppose you're referring to
the arithmetic mean of the sizes of all rows in the database. Does Aurora
provide a way to efficiently query for that? (without reading the whole
database and computing the size of each row)

On Sat, Jun 10, 2017 at 10:36 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi,
>
> I created a Jira to add custom splitting to JdbcIO (but it's not so trivial
> depending of the backends.
>
> Regarding your proposal it sounds interesting, but do you think we will
> have
> really "parallel" read of the split ? I think splitting makes sense if we
> can do
> parallel read: if we split to read on an unique backend, it doesn't bring
> lot of
> improvement.
>
> Regards
> JB
>
> On 06/10/2017 09:28 PM, Madhusudan Borkar wrote:
> > Hi,
> > We are proposing to develop connector for AWS Aurora. Aurora being
> cluster
> > for relational database (MySQL) has no Java api for reading/writing other
> > than jdbc client. Although there is a JdbcIO available, it looks like it
> > doesn't work in parallel. The proposal is to provide split functionality
> > and then use transform to parallelize the operation. As mentioned above,
> > this is typical sql based database and not comparable with likes of Hive.
> > Hive implementation is based on abstraction over Hdfs file system of
> > Hadoop, which provides splits. Here none of these are applicable.
> > During implementation of Hive connector there was lot of discussion as
> how
> > to implement connector while strictly following Beam design principal
> using
> > Bounded source. I am not sure how Aurora connector will fit into these
> > design principals.
> > Here is our proposal.
> > 1. Split functionality: If the table contains 'x' rows, it will be split
> > into 'n' bundles in the split method. This would be done like follows :
> > noOfSplits = 'x' * size of a single row / bundleSize hint from runner.
> > 2. Then each of these 'pseudo' splits would be read in parallel
> > 3. Each of these reads will use db connection from connection pool.
> > This will provide better bench marking. Please, let know your views.
> >
> > Thanks
> > Madhu Borkar
> >
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [PROPOSAL] for AWS Aurora relational database connector

Posted by Sourabh Bajaj <so...@google.com.INVALID>.

+1 for S3 being more of a FS

@Madhusudan can you point to some documentation on how to do row-range
queries in Aurora as from a quick scan it follows the MySql 5.6 syntax so
you will still need an order by for the IO to do exactly once reads. So
wanted to learn more about how the questions raised by Eugene are handled.

Thanks
Sourabh

On Mon, Jun 12, 2017 at 9:32 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi,
>
> I think it's a mix of filesystem and IO. For S3, I see more a Beam
> filesystem
> than a pure IO.
>
> WDYT ?
>
> Regards
> JB
>
> On 06/13/2017 02:43 AM, tarush grover wrote:
> > Hi All,
> >
> > I think this can be added under java --> io --> aws-cloud-platform with
> > more io connectors can be added into it eg. S3 also.
> >
> > Regards,
> > Tarush
> >
> > On Mon, Jun 12, 2017 at 4:03 AM, Madhusudan Borkar <mb...@etouch.net>
> > wrote:
> >
> >> Yes, I believe so. Thanks for the Jira.
> >>
> >> Madhu Borkar
> >>
> >> On Sat, Jun 10, 2017 at 10:36 PM, Jean-Baptiste Onofré <jb@nanthrax.net
> >
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> I created a Jira to add custom splitting to JdbcIO (but it's not so
> >>> trivial depending of the backends.
> >>>
> >>> Regarding your proposal it sounds interesting, but do you think we will
> >>> have really "parallel" read of the split ? I think splitting makes
> sense
> >> if
> >>> we can do parallel read: if we split to read on an unique backend, it
> >>> doesn't bring lot of improvement.
> >>>
> >>> Regards
> >>> JB
> >>>
> >>>
> >>> On 06/10/2017 09:28 PM, Madhusudan Borkar wrote:
> >>>
> >>>> Hi,
> >>>> We are proposing to develop connector for AWS Aurora. Aurora being
> >> cluster
> >>>> for relational database (MySQL) has no Java api for reading/writing
> >> other
> >>>> than jdbc client. Although there is a JdbcIO available, it looks like
> it
> >>>> doesn't work in parallel. The proposal is to provide split
> functionality
> >>>> and then use transform to parallelize the operation. As mentioned
> above,
> >>>> this is typical sql based database and not comparable with likes of
> >> Hive.
> >>>> Hive implementation is based on abstraction over Hdfs file system of
> >>>> Hadoop, which provides splits. Here none of these are applicable.
> >>>> During implementation of Hive connector there was lot of discussion as
> >> how
> >>>> to implement connector while strictly following Beam design principal
> >>>> using
> >>>> Bounded source. I am not sure how Aurora connector will fit into these
> >>>> design principals.
> >>>> Here is our proposal.
> >>>> 1. Split functionality: If the table contains 'x' rows, it will be
> split
> >>>> into 'n' bundles in the split method. This would be done like follows
> :
> >>>> noOfSplits = 'x' * size of a single row / bundleSize hint from runner.
> >>>> 2. Then each of these 'pseudo' splits would be read in parallel
> >>>> 3. Each of these reads will use db connection from connection pool.
> >>>> This will provide better bench marking. Please, let know your views.
> >>>>
> >>>> Thanks
> >>>> Madhu Borkar
> >>>>
> >>>>
> >>> --
> >>> Jean-Baptiste Onofré
> >>> jbonofre@apache.org
> >>> http://blog.nanthrax.net
> >>> Talend - http://www.talend.com
> >>>
> >>
> >
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [PROPOSAL] for AWS Aurora relational database connector

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi,

I think it's a mix of filesystem and IO. For S3, I see more a Beam filesystem 
than a pure IO.

WDYT ?

Regards
JB

On 06/13/2017 02:43 AM, tarush grover wrote:
> Hi All,
> 
> I think this can be added under java --> io --> aws-cloud-platform with
> more io connectors can be added into it eg. S3 also.
> 
> Regards,
> Tarush
> 
> On Mon, Jun 12, 2017 at 4:03 AM, Madhusudan Borkar <mb...@etouch.net>
> wrote:
> 
>> Yes, I believe so. Thanks for the Jira.
>>
>> Madhu Borkar
>>
>> On Sat, Jun 10, 2017 at 10:36 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
>> wrote:
>>
>>> Hi,
>>>
>>> I created a Jira to add custom splitting to JdbcIO (but it's not so
>>> trivial depending of the backends.
>>>
>>> Regarding your proposal it sounds interesting, but do you think we will
>>> have really "parallel" read of the split ? I think splitting makes sense
>> if
>>> we can do parallel read: if we split to read on an unique backend, it
>>> doesn't bring lot of improvement.
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 06/10/2017 09:28 PM, Madhusudan Borkar wrote:
>>>
>>>> Hi,
>>>> We are proposing to develop connector for AWS Aurora. Aurora being
>> cluster
>>>> for relational database (MySQL) has no Java api for reading/writing
>> other
>>>> than jdbc client. Although there is a JdbcIO available, it looks like it
>>>> doesn't work in parallel. The proposal is to provide split functionality
>>>> and then use transform to parallelize the operation. As mentioned above,
>>>> this is typical sql based database and not comparable with likes of
>> Hive.
>>>> Hive implementation is based on abstraction over Hdfs file system of
>>>> Hadoop, which provides splits. Here none of these are applicable.
>>>> During implementation of Hive connector there was lot of discussion as
>> how
>>>> to implement connector while strictly following Beam design principal
>>>> using
>>>> Bounded source. I am not sure how Aurora connector will fit into these
>>>> design principals.
>>>> Here is our proposal.
>>>> 1. Split functionality: If the table contains 'x' rows, it will be split
>>>> into 'n' bundles in the split method. This would be done like follows :
>>>> noOfSplits = 'x' * size of a single row / bundleSize hint from runner.
>>>> 2. Then each of these 'pseudo' splits would be read in parallel
>>>> 3. Each of these reads will use db connection from connection pool.
>>>> This will provide better bench marking. Please, let know your views.
>>>>
>>>> Thanks
>>>> Madhu Borkar
>>>>
>>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbonofre@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>
> 

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [PROPOSAL] for AWS Aurora relational database connector

Posted by tarush grover <ta...@gmail.com>.

Hi All,

I think this can be added under java --> io --> aws-cloud-platform with
more io connectors can be added into it eg. S3 also.

Regards,
Tarush

On Mon, Jun 12, 2017 at 4:03 AM, Madhusudan Borkar <mb...@etouch.net>
wrote:

> Yes, I believe so. Thanks for the Jira.
>
> Madhu Borkar
>
> On Sat, Jun 10, 2017 at 10:36 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
> > Hi,
> >
> > I created a Jira to add custom splitting to JdbcIO (but it's not so
> > trivial depending of the backends.
> >
> > Regarding your proposal it sounds interesting, but do you think we will
> > have really "parallel" read of the split ? I think splitting makes sense
> if
> > we can do parallel read: if we split to read on an unique backend, it
> > doesn't bring lot of improvement.
> >
> > Regards
> > JB
> >
> >
> > On 06/10/2017 09:28 PM, Madhusudan Borkar wrote:
> >
> >> Hi,
> >> We are proposing to develop connector for AWS Aurora. Aurora being
> cluster
> >> for relational database (MySQL) has no Java api for reading/writing
> other
> >> than jdbc client. Although there is a JdbcIO available, it looks like it
> >> doesn't work in parallel. The proposal is to provide split functionality
> >> and then use transform to parallelize the operation. As mentioned above,
> >> this is typical sql based database and not comparable with likes of
> Hive.
> >> Hive implementation is based on abstraction over Hdfs file system of
> >> Hadoop, which provides splits. Here none of these are applicable.
> >> During implementation of Hive connector there was lot of discussion as
> how
> >> to implement connector while strictly following Beam design principal
> >> using
> >> Bounded source. I am not sure how Aurora connector will fit into these
> >> design principals.
> >> Here is our proposal.
> >> 1. Split functionality: If the table contains 'x' rows, it will be split
> >> into 'n' bundles in the split method. This would be done like follows :
> >> noOfSplits = 'x' * size of a single row / bundleSize hint from runner.
> >> 2. Then each of these 'pseudo' splits would be read in parallel
> >> 3. Each of these reads will use db connection from connection pool.
> >> This will provide better bench marking. Please, let know your views.
> >>
> >> Thanks
> >> Madhu Borkar
> >>
> >>
> > --
> > Jean-Baptiste Onofré
> > jbonofre@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>

Re: [PROPOSAL] for AWS Aurora relational database connector

Posted by Madhusudan Borkar <mb...@etouch.net>.

Yes, I believe so. Thanks for the Jira.

Madhu Borkar

On Sat, Jun 10, 2017 at 10:36 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi,
>
> I created a Jira to add custom splitting to JdbcIO (but it's not so
> trivial depending of the backends.
>
> Regarding your proposal it sounds interesting, but do you think we will
> have really "parallel" read of the split ? I think splitting makes sense if
> we can do parallel read: if we split to read on an unique backend, it
> doesn't bring lot of improvement.
>
> Regards
> JB
>
>
> On 06/10/2017 09:28 PM, Madhusudan Borkar wrote:
>
>> Hi,
>> We are proposing to develop connector for AWS Aurora. Aurora being cluster
>> for relational database (MySQL) has no Java api for reading/writing other
>> than jdbc client. Although there is a JdbcIO available, it looks like it
>> doesn't work in parallel. The proposal is to provide split functionality
>> and then use transform to parallelize the operation. As mentioned above,
>> this is typical sql based database and not comparable with likes of Hive.
>> Hive implementation is based on abstraction over Hdfs file system of
>> Hadoop, which provides splits. Here none of these are applicable.
>> During implementation of Hive connector there was lot of discussion as how
>> to implement connector while strictly following Beam design principal
>> using
>> Bounded source. I am not sure how Aurora connector will fit into these
>> design principals.
>> Here is our proposal.
>> 1. Split functionality: If the table contains 'x' rows, it will be split
>> into 'n' bundles in the split method. This would be done like follows :
>> noOfSplits = 'x' * size of a single row / bundleSize hint from runner.
>> 2. Then each of these 'pseudo' splits would be read in parallel
>> 3. Each of these reads will use db connection from connection pool.
>> This will provide better bench marking. Please, let know your views.
>>
>> Thanks
>> Madhu Borkar
>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [PROPOSAL] for AWS Aurora relational database connector

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi,

I created a Jira to add custom splitting to JdbcIO (but it's not so trivial 
depending of the backends.

Regarding your proposal it sounds interesting, but do you think we will have 
really "parallel" read of the split ? I think splitting makes sense if we can do 
parallel read: if we split to read on an unique backend, it doesn't bring lot of 
improvement.

Regards
JB

On 06/10/2017 09:28 PM, Madhusudan Borkar wrote:
> Hi,
> We are proposing to develop connector for AWS Aurora. Aurora being cluster
> for relational database (MySQL) has no Java api for reading/writing other
> than jdbc client. Although there is a JdbcIO available, it looks like it
> doesn't work in parallel. The proposal is to provide split functionality
> and then use transform to parallelize the operation. As mentioned above,
> this is typical sql based database and not comparable with likes of Hive.
> Hive implementation is based on abstraction over Hdfs file system of
> Hadoop, which provides splits. Here none of these are applicable.
> During implementation of Hive connector there was lot of discussion as how
> to implement connector while strictly following Beam design principal using
> Bounded source. I am not sure how Aurora connector will fit into these
> design principals.
> Here is our proposal.
> 1. Split functionality: If the table contains 'x' rows, it will be split
> into 'n' bundles in the split method. This would be done like follows :
> noOfSplits = 'x' * size of a single row / bundleSize hint from runner.
> 2. Then each of these 'pseudo' splits would be read in parallel
> 3. Each of these reads will use db connection from connection pool.
> This will provide better bench marking. Please, let know your views.
> 
> Thanks
> Madhu Borkar
> 

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com