You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@impala.apache.org by Oleksandr Baliev <al...@gmail.com> on 2018/03/10 22:12:42 UTC

Ways to synch different data sources in Impala

Hello,

I have a big table with raw numbers and ids in impala. When I generate some
reports I have bunch of ids for which I need to show some (from 100 to
100.000 it depends on query) corresponding names stored in some another
table in some RDBMS. So I have raw ids with some key figures for it in
Impala and user friendly names for this ids in RDBS, and i need to show
somehow all this info.

So I need to synch 2 data sources on query level and show user friendly
info with all aggregated key figures. Could you please give some advises,
what is the preferable way to deal with such tasks with Impala (and
probably it's more related for big data area in general)?

Few ways I see:
1. One way is to synch somehow RDBMS with Impala (question also how, but
can be resolved I guess with using some key based storage like Kudu or
Hbase), do some joins on Impala side (since subqueries are not supported in
select) and show in the end id + corresponding name. But joins could slow
down everything, not sure yet, want to make different test with all the
data of course, but maybe some of you are already tried that.

2. Another way is to get the report from Impala, resolve from RDBMS on some
application level -> too slow as for me, i need different reports and
flexible way to create them. But maybe it's possible to export to some
third party stack, like BI tools, but I need more real time queries, and
from my not big experience, they BI tools are good when they compile all
that data and can good visualize it, but that' it. Maybe some of you are
already using something reliable and flexible, would be interested to hear
your experience

3. Store id with name from the beginning in big table, use id for
aggregation, name for viewing. Here the question is, do you thing it's a
good approach? Won't performance be slowed down for any reasons or maybe
there is another pitfalls.

I would really appreciate if you could share your experience with such kind
of tasks.

with kind regards,
Oleksandr Baliev

Re: Ways to synch different data sources in Impala

Posted by Oleksandr Baliev <al...@gmail.com>.
Thanks Thomas for you reply, at least I know that other people are thinking
in more or less the same way : )

> This is a reasonable approach and Impala has good support for fast joins.
The main concern would be figuring out how to perform the sync, which will
depend on your use case, and the possibility of data being duplicated
between Impala and the other RDBMS, which could require more storage.

I will definitely look again into joins, I've made already few tests, one
partitioned parquet table and kind of big report could join around 6 id
fields without dropping much in performance, but it was just a playground,
was not sure if i want to proceed further, but seems will go for it in my
case.

One way to sync tables in Impala and RDBMS I saw in using just
Hive/Impala/Hbase integration. For this wanted to use kind of
https://www.confluent.io/product/connectors/ with some SQL connector,
especially in our case when we are already tracking changes for most
important RDBMS tables in some separate aud tables (hibernate envers) and
then use HBase sink which will constantly (1 minute delay is okay for us)
put changes. And from some tests I've discovered, whenever data is in
HBase, it's queriable from Impala and won't break existing queries, which
is cool, not sure about joins performance with Hbase and bigger amount of
data though. Of course there are also key design restriction but for my
case i will need id -> name tables only. But also that can be not very
scalable solution since in such cases in future I would need to create too
much such tables (maybe even not too much) and can meet some HBase
restrictions, as far as i remember something with amount of Regions per
table or something.

best,
Oleksandr


2018-03-14 0:24 GMT+01:00 Thomas Tauber-Marshall <tm...@cloudera.com>:

> The three ideas you outline are probably your best options, and which one
> you want to choose will just depend on your particular needs. It sounds
> like your on the right track, though. Comments about each are inline below:
>
> On Sat, Mar 10, 2018 at 2:12 PM Oleksandr Baliev <
> aleksanderbalev@gmail.com> wrote:
>
>> Hello,
>>
>> I have a big table with raw numbers and ids in impala. When I generate
>> some reports I have bunch of ids for which I need to show some (from 100 to
>> 100.000 it depends on query) corresponding names stored in some another
>> table in some RDBMS. So I have raw ids with some key figures for it in
>> Impala and user friendly names for this ids in RDBS, and i need to show
>> somehow all this info.
>>
>> So I need to synch 2 data sources on query level and show user friendly
>> info with all aggregated key figures. Could you please give some advises,
>> what is the preferable way to deal with such tasks with Impala (and
>> probably it's more related for big data area in general)?
>>
>> Few ways I see:
>> 1. One way is to synch somehow RDBMS with Impala (question also how, but
>> can be resolved I guess with using some key based storage like Kudu or
>> Hbase), do some joins on Impala side (since subqueries are not supported in
>> select) and show in the end id + corresponding name. But joins could slow
>> down everything, not sure yet, want to make different test with all the
>> data of course, but maybe some of you are already tried that.
>>
>
> This is a reasonable approach and Impala has good support for fast joins.
> The main concern would be figuring out how to perform the sync, which will
> depend on your use case, and the possibility of data being duplicated
> between Impala and the other RDBMS, which could require more storage.
>
>
>>
>> 2. Another way is to get the report from Impala, resolve from RDBMS on
>> some application level -> too slow as for me, i need different reports and
>> flexible way to create them. But maybe it's possible to export to some
>> third party stack, like BI tools, but I need more real time queries, and
>> from my not big experience, they BI tools are good when they compile all
>> that data and can good visualize it, but that' it. Maybe some of you are
>> already using something reliable and flexible, would be interested to hear
>> your experience
>>
>
> As you say, this approach is likely to be the slowest and least flexible,
> since you would have to write any sort of join logic that you need
> yourself. This would be necessary, though, if there are some requirements
> for your use case that prevent you from importing the data from the other
> RDBMS into Impala, as the other two options require.
>
>
>>
>> 3. Store id with name from the beginning in big table, use id for
>> aggregation, name for viewing. Here the question is, do you thing it's a
>> good approach? Won't performance be slowed down for any reasons or maybe
>> there is another pitfalls.
>>
>
> This option will probably give you the best performance in terms of query
> running time, depending on your setup.
> - One issue here is that by adding more columns to your table, you may
> increase the amount of data that has to be scanned for some queries, for
> example if you have a query that only accesses the ids but not the names.
> This can be avoided by using a columnar format such as parquet, so that
> Impala will only scan the columns that are needed.
> - Another possible problem is if this leads to duplicated data within
> Impala, eg. if the names are now stored in two different tables when
> previously they were only in one. Whether or not the extra storage needed
> is manageable will depend on your workload.
>
>
>>
>> I would really appreciate if you could share your experience with such
>> kind of tasks.
>>
>> with kind regards,
>> Oleksandr Baliev
>>
>>
>>
>>
>>

Re: Ways to synch different data sources in Impala

Posted by Thomas Tauber-Marshall <tm...@cloudera.com>.
The three ideas you outline are probably your best options, and which one
you want to choose will just depend on your particular needs. It sounds
like your on the right track, though. Comments about each are inline below:

On Sat, Mar 10, 2018 at 2:12 PM Oleksandr Baliev <al...@gmail.com>
wrote:

> Hello,
>
> I have a big table with raw numbers and ids in impala. When I generate
> some reports I have bunch of ids for which I need to show some (from 100 to
> 100.000 it depends on query) corresponding names stored in some another
> table in some RDBMS. So I have raw ids with some key figures for it in
> Impala and user friendly names for this ids in RDBS, and i need to show
> somehow all this info.
>
> So I need to synch 2 data sources on query level and show user friendly
> info with all aggregated key figures. Could you please give some advises,
> what is the preferable way to deal with such tasks with Impala (and
> probably it's more related for big data area in general)?
>
> Few ways I see:
> 1. One way is to synch somehow RDBMS with Impala (question also how, but
> can be resolved I guess with using some key based storage like Kudu or
> Hbase), do some joins on Impala side (since subqueries are not supported in
> select) and show in the end id + corresponding name. But joins could slow
> down everything, not sure yet, want to make different test with all the
> data of course, but maybe some of you are already tried that.
>

This is a reasonable approach and Impala has good support for fast joins.
The main concern would be figuring out how to perform the sync, which will
depend on your use case, and the possibility of data being duplicated
between Impala and the other RDBMS, which could require more storage.


>
> 2. Another way is to get the report from Impala, resolve from RDBMS on
> some application level -> too slow as for me, i need different reports and
> flexible way to create them. But maybe it's possible to export to some
> third party stack, like BI tools, but I need more real time queries, and
> from my not big experience, they BI tools are good when they compile all
> that data and can good visualize it, but that' it. Maybe some of you are
> already using something reliable and flexible, would be interested to hear
> your experience
>

As you say, this approach is likely to be the slowest and least flexible,
since you would have to write any sort of join logic that you need
yourself. This would be necessary, though, if there are some requirements
for your use case that prevent you from importing the data from the other
RDBMS into Impala, as the other two options require.


>
> 3. Store id with name from the beginning in big table, use id for
> aggregation, name for viewing. Here the question is, do you thing it's a
> good approach? Won't performance be slowed down for any reasons or maybe
> there is another pitfalls.
>

This option will probably give you the best performance in terms of query
running time, depending on your setup.
- One issue here is that by adding more columns to your table, you may
increase the amount of data that has to be scanned for some queries, for
example if you have a query that only accesses the ids but not the names.
This can be avoided by using a columnar format such as parquet, so that
Impala will only scan the columns that are needed.
- Another possible problem is if this leads to duplicated data within
Impala, eg. if the names are now stored in two different tables when
previously they were only in one. Whether or not the extra storage needed
is manageable will depend on your workload.


>
> I would really appreciate if you could share your experience with such
> kind of tasks.
>
> with kind regards,
> Oleksandr Baliev
>
>
>
>
>