You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Jiang Wu <ji...@mulesoft.com.INVALID> on 2020/01/17 00:46:18 UTC

Non-Relational Data

Moving the topic on non-relational data to this dedicated thread.  First a
bit of context based on our use case:

* We want to do ad-hoc analyze data coming from diverse sources like APIs,
document stores, and relational stores.
* Data are not limited to relational structures, e.g. API returning complex
object collections.
* Data may change its structure over time, e.g. due to implementation
upgrades.
* We want to use high level declarative query languages such as SQL.

Various techniques exist to tackle non-relational data analysis such as
mapping to a relational schema or run custom code in a distributed compute
cluster (map-reduce, spark jobs, etc) on blob data.  These have their
drawbacks like data latency and effort on structure transformation, and
query latency and cost computing on blob data.

We built a columnar data store for non-relational data without pre-defined
schema.  For querying this data, technologies like Drill made it almost
possible to directly work with non-relational data using array and map data
types.  However, we feel more can be done to truly make non-relational data
a first class citizen:

1) functions on array and map -- e.g. sizeOf(person.addresses) where
person.addresses is an array.  Using FLATTEN is not the same as working
with complex objects directly,
2) heterogenous types -- better handling of heterogeneous data types within
the same column, e.g. product.version started as numbers, but some are
strings.  Treating every value as a String is a workaround.
3) better storage plugin support for complex types -- we had to re-generate
from our columnar vectors into objects to give to Drill, rather than
feeding vectors directly.

I don't think any of these are easy to do.  Much research and thinking will
be needed for a cohesive solution.

-- Jiang

Re: Non-Relational Data

Posted by Jiang Wu <ji...@mulesoft.com.INVALID>.

Hi Paul.  Thank you for all the analysis.  Let me digest the information
and i will come back with my views.

— Jiang


On Fri, Jan 17, 2020 at 11:02 AM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> Hi All,
>
> One more follow up. Perhaps there are two cases here: 1) "normal" columns
> which have evolved over time from one type to another, and 2) data which is
> inherently variable and non-relational.
>
> One can make the case that the power/convenience of SQL comes from having
> a known set of types for "normal" columns, such as the product version
> example we discussed. The "data lake guys" use ETL (usually into Parquet)
> and HMS schemas to enforce and define the types. Clearly works for the data
> lake use case. Impala & Hive were successful for many years using this
> approach. Even Amazon does the same with Lambdas, Glue, etc.
>
> For Drill, we're talking data integration outside of a data lake. One can
> argue that known types are still useful. So, perhaps we need a light-weight
> way, perhaps part of the storage plugin configuration, to resolve
> ambiguities. That is, tell Drill, "infer the type of most columns, because
> they are obvious. But, for product version, treat it as a VARCHAR."
>
> Then, there is the separate problem of types which are inherently
> non-relational: arbitrary JSON structures, say. Here, it is unclear what it
> means to use SQL to process data which contains a Customer in one row, an
> Order in the next, and a Blog Post in the third. The key is to understand
> real use cases to see what is done in actual practice since the naive
> conclusion would be that SQL is not a good tool for such data.
>
> For the users out there, can you explain how you'd like this to work for
> your application?
>
> Thanks,
> - Paul
>
>
>
>     On Thursday, January 16, 2020, 07:00:20 PM PST, Paul Rogers <
> par0328@yahoo.com> wrote:
>
>  As a follow-up, to avoid us rehashing the same solutions we've discussed
> for some time, here are just a few:
>
> * Provide a very light-weight schema "hint" only for those columns that
> need the hint. In the example, only provide schema for the ambiguous
> version column. If a column tends to have lots of nulls, tell Drill what
> type it will be when data actually appears. Arina has a good start on this
> approach.
>
>
> * Populate the above automagically as queries run. (Ted has long advocated
> this one.)
>
> * Populate the schema by converting existing specs, such as a Swagger spec.
>
> * Make the existing UNION and LIST (repeated UNION) types work fully
> (whatever that means; we'd need type conversion rules for all type
> combinations.)
>
> * Add true dynamic types. Drill half-way supports vectors that hold Java
> objects. Make this a first-class feature. Auto-convert conflicting fields.
> Also needs type conversion rules for all combinations. (Might be more handy
> for special cases, to heavy-weight for the typical schema-evolution case.
> Might be an easy way to handle complex structures such as images, complex
> JSON structures, etc.)
>
> * (Insert your ideas here.)
>
>
> Point is, we're not short on ideas. What we need to know is what folks
> want; then we can pick the idea that works best.
>
>
> Thanks,
> - Paul
>
>
>
>     On Thursday, January 16, 2020, 6:48:22 PM PST, Paul Rogers
> <pa...@yahoo.com.invalid> wrote:
>
>  Hi Jiang,
>
> Thanks for taking the time to explain your use case. In my experience, the
> scenario you describe is not unusual: the desire to integrate application
> data behind a common SQL interface. IMHO, this is where Drill differs from
> the "data lake guys" such as Impala and Presto. Would be helpful if you can
> help us get the solution right.
>
>
>  You have hit upon one of the key challenges in making this model work:
> the question of how to use SQL to work with a column that has a varying
> data type. Your product version use case is a classic example.
>
>
> In the non-relational world, types are messy. That's why we have Python:
> we can write code that forces the values to one type or another, we can do
> "if" statements based on types, or we can use "duck typing" to do dynamic
> method dispatch.
>
>
> But, you want to use the convenience (and speed) of SQL. We want to hide
> these details somehow. So we have to be clever.
>
> Hive (then Presto, Impala and even Drill, when used with HMS) solved the
> problem by requiring a schema stored in HMS. Simple idea, but HMS has
> become heavyweight. With HMS, we can declare that our version field is a
> VARCHAR and conversions can be done at read time (schema-on-read.) Problem
> solved. HMS works best for files. It does not work for the app integration
> use case: reading data from ad-hoc files, accessing APIs and so on.
>
> Schemas really help with distributed systems. Suppose I say SELECT a + b,
> SQL can work out that a is an INT, b is a BIGINT and the result should be a
> BIGINT. Work can be distributed across many independent nodes and all will
> make the same type decisions. Data can then be merged and types will agree.
> Basic stuff. In fact, types are part of the very definition of the
> relational theory on which SQL is based.
>
>
> But, if data types vary (no schema, as in Drill without HMS), things get
> messy. Drill can distribute filters. Suppose I say WHERE version = 10.
> Drillbit 1 reads the INT-valued version fields, Drillbit 2 reads the
> VARCHAR valued fields. How do we make sure that both nodes make the same
> decisions?
>
> Later, when data is merged, should there be a common type? What if I say
> ORDER BY version. What is the intended result? Sort the INT values before
> (or after) VARCHAR? Convert VARCHAR to INT (or visa-versa)?
>
> Another factor is speed. Python is slow because it does dynamic type
> handling on every operation. Java and C are fast because they are
> statically typed. Similarly, Impala is fast because of static types. Drill
> tries to be fast by having a fixed vector type for each column. But, Drill
> also tries to be flexible, where is when things start to get "interesting."
>
> The crude-but-effective solution, without a schema, is to require users to
> explicitly include the CAST and other statements in every query. It seems
> this was standard Hadoop practice before HMS came along. But, since you
> want to use SQL, you presumably want to shield users from this complexity.
>
> See? You have nailed the key challenge we must solve to make the app
> integration idea work.
>
>
> We can make up lots of things we *could* do. They pretty much fall into
> three buckets:
>
> * Give the user or app developer a way to declare the type so all nodes do
> the same thing. Declare the type either per-query (with a CAST) or
> per-table (using some kind of schema or view.)
>
> * Discover or impose a uniform set of conversion rules that work in all
> cases. (Always convert to VARCHAR, say.)
>
> * Embrace dynamic types. Add dynamic types to Drill to make it more
> Python-like, with "duck typing" to decide, say, that "+" means different
> things for different type combinations. Accept the resulting performance
> hit.
>
>
> Before we brainstorm specific alternatives, it would be very helpful to
> understand what you actually *want* to do. What was the intent when the
> type changed? How do you want to users to work with such data in a SQL
> context? How would your ideal user experience look?
>
>
> Thanks,
> - Paul
>
>
>
>     On Thursday, January 16, 2020, 4:47:01 PM PST, Jiang Wu
> <ji...@mulesoft.com.invalid> wrote:
>
>  Moving the topic on non-relational data to this dedicated thread.  First a
> bit of context based on our use case:
>
> * We want to do ad-hoc analyze data coming from diverse sources like APIs,
> document stores, and relational stores.
> * Data are not limited to relational structures, e.g. API returning complex
> object collections.
> * Data may change its structure over time, e.g. due to implementation
> upgrades.
> * We want to use high level declarative query languages such as SQL.
>
> Various techniques exist to tackle non-relational data analysis such as
> mapping to a relational schema or run custom code in a distributed compute
> cluster (map-reduce, spark jobs, etc) on blob data.  These have their
> drawbacks like data latency and effort on structure transformation, and
> query latency and cost computing on blob data.
>
> We built a columnar data store for non-relational data without pre-defined
> schema.  For querying this data, technologies like Drill made it almost
> possible to directly work with non-relational data using array and map data
> types.  However, we feel more can be done to truly make non-relational data
> a first class citizen:
>
> 1) functions on array and map -- e.g. sizeOf(person.addresses) where
> person.addresses is an array.  Using FLATTEN is not the same as working
> with complex objects directly,
> 2) heterogenous types -- better handling of heterogeneous data types within
> the same column, e.g. product.version started as numbers, but some are
> strings.  Treating every value as a String is a workaround.
> 3) better storage plugin support for complex types -- we had to re-generate
> from our columnar vectors into objects to give to Drill, rather than
> feeding vectors directly.
>
> I don't think any of these are easy to do.  Much research and thinking will
> be needed for a cohesive solution.
>
> -- Jiang
>

Re: Non-Relational Data

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi All,

One more follow up. Perhaps there are two cases here: 1) "normal" columns which have evolved over time from one type to another, and 2) data which is inherently variable and non-relational.

One can make the case that the power/convenience of SQL comes from having a known set of types for "normal" columns, such as the product version example we discussed. The "data lake guys" use ETL (usually into Parquet) and HMS schemas to enforce and define the types. Clearly works for the data lake use case. Impala & Hive were successful for many years using this approach. Even Amazon does the same with Lambdas, Glue, etc.

For Drill, we're talking data integration outside of a data lake. One can argue that known types are still useful. So, perhaps we need a light-weight way, perhaps part of the storage plugin configuration, to resolve ambiguities. That is, tell Drill, "infer the type of most columns, because they are obvious. But, for product version, treat it as a VARCHAR."

Then, there is the separate problem of types which are inherently non-relational: arbitrary JSON structures, say. Here, it is unclear what it means to use SQL to process data which contains a Customer in one row, an Order in the next, and a Blog Post in the third. The key is to understand real use cases to see what is done in actual practice since the naive conclusion would be that SQL is not a good tool for such data.

For the users out there, can you explain how you'd like this to work for your application?

Thanks,
- Paul

On Thursday, January 16, 2020, 07:00:20 PM PST, Paul Rogers <pa...@yahoo.com> wrote:

As a follow-up, to avoid us rehashing the same solutions we've discussed for some time, here are just a few:

* Provide a very light-weight schema "hint" only for those columns that need the hint. In the example, only provide schema for the ambiguous version column. If a column tends to have lots of nulls, tell Drill what type it will be when data actually appears. Arina has a good start on this approach.

* Populate the above automagically as queries run. (Ted has long advocated this one.)

* Populate the schema by converting existing specs, such as a Swagger spec.

* Make the existing UNION and LIST (repeated UNION) types work fully (whatever that means; we'd need type conversion rules for all type combinations.)

* Add true dynamic types. Drill half-way supports vectors that hold Java objects. Make this a first-class feature. Auto-convert conflicting fields. Also needs type conversion rules for all combinations. (Might be more handy for special cases, to heavy-weight for the typical schema-evolution case. Might be an easy way to handle complex structures such as images, complex JSON structures, etc.)

* (Insert your ideas here.)

Point is, we're not short on ideas. What we need to know is what folks want; then we can pick the idea that works best.

Thanks,
- Paul

On Thursday, January 16, 2020, 6:48:22 PM PST, Paul Rogers <pa...@yahoo.com.invalid> wrote:

Hi Jiang,

Thanks for taking the time to explain your use case. In my experience, the scenario you describe is not unusual: the desire to integrate application data behind a common SQL interface. IMHO, this is where Drill differs from the "data lake guys" such as Impala and Presto. Would be helpful if you can help us get the solution right.

You have hit upon one of the key challenges in making this model work: the question of how to use SQL to work with a column that has a varying data type. Your product version use case is a classic example.

In the non-relational world, types are messy. That's why we have Python: we can write code that forces the values to one type or another, we can do "if" statements based on types, or we can use "duck typing" to do dynamic method dispatch.

But, you want to use the convenience (and speed) of SQL. We want to hide these details somehow. So we have to be clever.

Hive (then Presto, Impala and even Drill, when used with HMS) solved the problem by requiring a schema stored in HMS. Simple idea, but HMS has become heavyweight. With HMS, we can declare that our version field is a VARCHAR and conversions can be done at read time (schema-on-read.) Problem solved. HMS works best for files. It does not work for the app integration use case: reading data from ad-hoc files, accessing APIs and so on.

Schemas really help with distributed systems. Suppose I say SELECT a + b, SQL can work out that a is an INT, b is a BIGINT and the result should be a BIGINT. Work can be distributed across many independent nodes and all will make the same type decisions. Data can then be merged and types will agree. Basic stuff. In fact, types are part of the very definition of the relational theory on which SQL is based.

But, if data types vary (no schema, as in Drill without HMS), things get messy. Drill can distribute filters. Suppose I say WHERE version = 10. Drillbit 1 reads the INT-valued version fields, Drillbit 2 reads the VARCHAR valued fields. How do we make sure that both nodes make the same decisions?

Later, when data is merged, should there be a common type? What if I say ORDER BY version. What is the intended result? Sort the INT values before (or after) VARCHAR? Convert VARCHAR to INT (or visa-versa)?

Another factor is speed. Python is slow because it does dynamic type handling on every operation. Java and C are fast because they are statically typed. Similarly, Impala is fast because of static types. Drill tries to be fast by having a fixed vector type for each column. But, Drill also tries to be flexible, where is when things start to get "interesting."

The crude-but-effective solution, without a schema, is to require users to explicitly include the CAST and other statements in every query. It seems this was standard Hadoop practice before HMS came along. But, since you want to use SQL, you presumably want to shield users from this complexity.

See? You have nailed the key challenge we must solve to make the app integration idea work.

We can make up lots of things we *could* do. They pretty much fall into three buckets:

* Give the user or app developer a way to declare the type so all nodes do the same thing. Declare the type either per-query (with a CAST) or per-table (using some kind of schema or view.)

* Discover or impose a uniform set of conversion rules that work in all cases. (Always convert to VARCHAR, say.)

* Embrace dynamic types. Add dynamic types to Drill to make it more Python-like, with "duck typing" to decide, say, that "+" means different things for different type combinations. Accept the resulting performance hit.

Before we brainstorm specific alternatives, it would be very helpful to understand what you actually *want* to do. What was the intent when the type changed? How do you want to users to work with such data in a SQL context? How would your ideal user experience look?

Thanks,
- Paul

On Thursday, January 16, 2020, 4:47:01 PM PST, Jiang Wu <ji...@mulesoft.com.invalid> wrote:

Moving the topic on non-relational data to this dedicated thread. First a
bit of context based on our use case:

* We want to do ad-hoc analyze data coming from diverse sources like APIs,
document stores, and relational stores.
* Data are not limited to relational structures, e.g. API returning complex
object collections.
* Data may change its structure over time, e.g. due to implementation
upgrades.
* We want to use high level declarative query languages such as SQL.

Various techniques exist to tackle non-relational data analysis such as
mapping to a relational schema or run custom code in a distributed compute
cluster (map-reduce, spark jobs, etc) on blob data. These have their
drawbacks like data latency and effort on structure transformation, and
query latency and cost computing on blob data.

We built a columnar data store for non-relational data without pre-defined
schema. For querying this data, technologies like Drill made it almost
possible to directly work with non-relational data using array and map data
types. However, we feel more can be done to truly make non-relational data
a first class citizen:

1) functions on array and map -- e.g. sizeOf(person.addresses) where
person.addresses is an array. Using FLATTEN is not the same as working
with complex objects directly,
2) heterogenous types -- better handling of heterogeneous data types within
the same column, e.g. product.version started as numbers, but some are
strings. Treating every value as a String is a workaround.
3) better storage plugin support for complex types -- we had to re-generate
from our columnar vectors into objects to give to Drill, rather than
feeding vectors directly.

I don't think any of these are easy to do. Much research and thinking will
be needed for a cohesive solution.

-- Jiang

Re: Non-Relational Data

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

As a follow-up, to avoid us rehashing the same solutions we've discussed for some time, here are just a few:

* Populate the above automagically as queries run. (Ted has long advocated this one.)

* Populate the schema by converting existing specs, such as a Swagger spec.

* Make the existing UNION and LIST (repeated UNION) types work fully (whatever that means; we'd need type conversion rules for all type combinations.)

* (Insert your ideas here.)

Point is, we're not short on ideas. What we need to know is what folks want; then we can pick the idea that works best.

Thanks,
- Paul

On Thursday, January 16, 2020, 6:48:22 PM PST, Paul Rogers <pa...@yahoo.com.invalid> wrote:

Hi Jiang,

But, you want to use the convenience (and speed) of SQL. We want to hide these details somehow. So we have to be clever.

See? You have nailed the key challenge we must solve to make the app integration idea work.

We can make up lots of things we *could* do. They pretty much fall into three buckets:

* Give the user or app developer a way to declare the type so all nodes do the same thing. Declare the type either per-query (with a CAST) or per-table (using some kind of schema or view.)

* Discover or impose a uniform set of conversion rules that work in all cases. (Always convert to VARCHAR, say.)

Thanks,
- Paul

On Thursday, January 16, 2020, 4:47:01 PM PST, Jiang Wu <ji...@mulesoft.com.invalid> wrote:

Moving the topic on non-relational data to this dedicated thread. First a
bit of context based on our use case:

I don't think any of these are easy to do. Much research and thinking will
be needed for a cohesive solution.

-- Jiang

Re: Non-Relational Data

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Jiang,

But, you want to use the convenience (and speed) of SQL. We want to hide these details somehow. So we have to be clever.

See? You have nailed the key challenge we must solve to make the app integration idea work.

We can make up lots of things we *could* do. They pretty much fall into three buckets:

* Give the user or app developer a way to declare the type so all nodes do the same thing. Declare the type either per-query (with a CAST) or per-table (using some kind of schema or view.)

* Discover or impose a uniform set of conversion rules that work in all cases. (Always convert to VARCHAR, say.)

Thanks,
- Paul

On Thursday, January 16, 2020, 4:47:01 PM PST, Jiang Wu <ji...@mulesoft.com.invalid> wrote:

Moving the topic on non-relational data to this dedicated thread. First a
bit of context based on our use case:

I don't think any of these are easy to do. Much research and thinking will
be needed for a cohesive solution.

-- Jiang