You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Charles Givre <cg...@gmail.com> on 2019/12/04 03:06:01 UTC

Re: [DRILL with ALTERYX]

Hi Thiago, 
Welcome to the Drill community!  I'd be happy to help you out and from what you're describing, Drill may be a great tool for this use case.  Can you share a bit more about what kinds of systems you are looking to query with Drill?  I assume you've seen the documentation at drill.apache.org <http://drill.apache.org/>?  I'll put a shameless plug for the Drill book as well which might be useful. [1]

Best,
-- C

[1] https://amzn.to/33P2QwC <https://amzn.to/33P2QwC>



> On Dec 3, 2019, at 9:54 PM, Thiago Samuel dos Santos Ribeiro <th...@stitdata.com> wrote:
> 
> Hi Apache Team,
>  
> Please , in brazil we are trying to evaluate a very huge solution using Apache Drill in a TELCO company, however this solution must perform connection to several data sources, and  through this join bring data back to the Alteryx, in this case the DRILL should work as an abstraction data layer.
>  
> I am worried about the Drill´s community has no enough information or use-cases which we can take advantage and drive our project here.
>  
> Please, would someone guide me to someone or some community documentation about this approach ?
>  
>  <https://www.linkedin.com/in/thiago-ribeiro-45b6227>

Re: [DRILL with ALTERYX]

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
Hi Thiago,

Just wanted to follow up with a bit more detail.

The use case you describe is what is sometimes called "query integration": having a single tool accept a query, then turn around and issue other queries to other data sources. Finally, the query integrator combines the resulting data. Drill has some of this functionality, depending on the data sources you want to use.


You'd use a JDBC or ODBC driver to connect Drill to Alteryx so you can send queries to Drill, and obtain the results back from Drill.


Although Drill can connect to many data sources, query integration has not been the primary use case for Drill historically. Drill has mostly focused on reading tables in HDFS, MFS, S3 and other distributed file systems.


Query integration rapidly becomes complex: one must decide whether it is better to, say, scan both DBs A and B, or scan DB A and do per-row lookups in B, or perhaps visa-versa.

As it turns out, Drill uses Apache Calcite for query planning. One could add Calcite rules to help decide how best to divide up a query. You would need some statistics about your data source, such as the number of rows expected from a query to a DB. Getting these numbers right for each data source can be tricky. Still, if you've read about the existing data sources, you'll see the community has integrated with Kafka, HBase, MapRDB and more.

This kind of cross-DB planning exists in Drill in only the most rudimentary form. We'd welcome contributions to build on Calcite to expand this functionality.


You mention a data abstraction layer. In addition to just combining queries, such layers also handle type conversions. Maybe a product code is an INT in system A, but a VARCHAR in system B. Maybe names are stored as a single string in system B, but as First/last name in system C. Tools exist to handle this complexity, but Drill does not do so directly. You can create views that handle normalization, but you might need a tool that handles data unification if the differences between data models are significant. (Such a tool could be built on Drill, but I don't know of anyone who has yet done so.)

Can you explain a bit more about your use case so we can make better suggestions? For example, how many data sources do you need to query? How similar are the data models?

Thanks,
- Paul

 

    On Tuesday, December 3, 2019, 7:06:15 PM PST, Charles Givre <cg...@gmail.com> wrote:  
 
 Hi Thiago, 
Welcome to the Drill community!  I'd be happy to help you out and from what you're describing, Drill may be a great tool for this use case.  Can you share a bit more about what kinds of systems you are looking to query with Drill?  I assume you've seen the documentation at drill.apache.org <http://drill.apache.org/>?  I'll put a shameless plug for the Drill book as well which might be useful. [1]

Best,
-- C

[1] https://amzn.to/33P2QwC <https://amzn.to/33P2QwC>



> On Dec 3, 2019, at 9:54 PM, Thiago Samuel dos Santos Ribeiro <th...@stitdata.com> wrote:
> 
> Hi Apache Team,
>  
> Please , in brazil we are trying to evaluate a very huge solution using Apache Drill in a TELCO company, however this solution must perform connection to several data sources, and  through this join bring data back to the Alteryx, in this case the DRILL should work as an abstraction data layer.
>  
> I am worried about the Drill´s community has no enough information or use-cases which we can take advantage and drive our project here.
>  
> Please, would someone guide me to someone or some community documentation about this approach ?
>  
>  <https://www.linkedin.com/in/thiago-ribeiro-45b6227>  

Re: [DRILL with ALTERYX]

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
Hi Thiago,

Just wanted to follow up with a bit more detail.

The use case you describe is what is sometimes called "query integration": having a single tool accept a query, then turn around and issue other queries to other data sources. Finally, the query integrator combines the resulting data. Drill has some of this functionality, depending on the data sources you want to use.


You'd use a JDBC or ODBC driver to connect Drill to Alteryx so you can send queries to Drill, and obtain the results back from Drill.


Although Drill can connect to many data sources, query integration has not been the primary use case for Drill historically. Drill has mostly focused on reading tables in HDFS, MFS, S3 and other distributed file systems.


Query integration rapidly becomes complex: one must decide whether it is better to, say, scan both DBs A and B, or scan DB A and do per-row lookups in B, or perhaps visa-versa.

As it turns out, Drill uses Apache Calcite for query planning. One could add Calcite rules to help decide how best to divide up a query. You would need some statistics about your data source, such as the number of rows expected from a query to a DB. Getting these numbers right for each data source can be tricky. Still, if you've read about the existing data sources, you'll see the community has integrated with Kafka, HBase, MapRDB and more.

This kind of cross-DB planning exists in Drill in only the most rudimentary form. We'd welcome contributions to build on Calcite to expand this functionality.


You mention a data abstraction layer. In addition to just combining queries, such layers also handle type conversions. Maybe a product code is an INT in system A, but a VARCHAR in system B. Maybe names are stored as a single string in system B, but as First/last name in system C. Tools exist to handle this complexity, but Drill does not do so directly. You can create views that handle normalization, but you might need a tool that handles data unification if the differences between data models are significant. (Such a tool could be built on Drill, but I don't know of anyone who has yet done so.)

Can you explain a bit more about your use case so we can make better suggestions? For example, how many data sources do you need to query? How similar are the data models?

Thanks,
- Paul

 

    On Tuesday, December 3, 2019, 7:06:15 PM PST, Charles Givre <cg...@gmail.com> wrote:  
 
 Hi Thiago, 
Welcome to the Drill community!  I'd be happy to help you out and from what you're describing, Drill may be a great tool for this use case.  Can you share a bit more about what kinds of systems you are looking to query with Drill?  I assume you've seen the documentation at drill.apache.org <http://drill.apache.org/>?  I'll put a shameless plug for the Drill book as well which might be useful. [1]

Best,
-- C

[1] https://amzn.to/33P2QwC <https://amzn.to/33P2QwC>



> On Dec 3, 2019, at 9:54 PM, Thiago Samuel dos Santos Ribeiro <th...@stitdata.com> wrote:
> 
> Hi Apache Team,
>  
> Please , in brazil we are trying to evaluate a very huge solution using Apache Drill in a TELCO company, however this solution must perform connection to several data sources, and  through this join bring data back to the Alteryx, in this case the DRILL should work as an abstraction data layer.
>  
> I am worried about the Drill´s community has no enough information or use-cases which we can take advantage and drive our project here.
>  
> Please, would someone guide me to someone or some community documentation about this approach ?
>  
>  <https://www.linkedin.com/in/thiago-ribeiro-45b6227>