You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Aman Sinha <am...@gmail.com> on 2018/11/06 02:47:28 UTC

Handling schema change in blocking operators

Hi all,
While we continue to enhance the schema provision and metastore aspects in
Drill, we also should explore what it means to be truly schema-less such
that we can better handle {semi, un}structured data, data sitting in DBs
that store JSON documents (e.g Mongo, MapR-DB).

The blocking operators are the main hurdles in this goal. I wrote some
thoughts on supporting Schema change in a Sort operator in DRILL-6829
<https://issues.apache.org/jira/browse/DRILL-6829> .   Would welcome any
feedback and see how to go about it going forward.

Thanks,
Aman

Re: Handling schema change in blocking operators

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Boaz,

As noted earlier, it would be wonderful if Drill could handle schema changes on the fly, using only the information in the files as they are read, and with only a few code changes. Alas, such is not the case.

Question: is the goal to have schema changes somewhat less often (but they still occur in the cases we discussed)? Or, is it to have rock solid, reliable results? IHMO, the team can produce rock solid results, with less work, using the schema solution. Everybody understands schema and will appreciate the ability to use a simple solution rather than working through complex schema change rules.

Schema-free still exists: it is the reward for folks with clean data sets. Folks with dirty data have to pay the cost of a schema to clean up their mess (in lieu of ETL into Parquet, which is more costly.)

Let's again look at the core problem. Remember the time dimension: files are read in random order on random nodes. This means no two runs of a query will see data in the same order either in the scanner or in downstream operators.

Think through this scenario:

* 20 files, each with 100K rows (2 million rows total, at least two record batches per file)
* 2 nodes with 5 minor fragments each (10 scanners total)
* Files exhibit the schema changes you suggest
* Files are read in random order on random nodes
* xDBC clients demand that the schema delivered on the first row is the same as that delivered on all rows up to the 2 millionth.
* Running the same query (same data) over and over produces the same schema on each run.

Thanks much for the use cases. Let's look at them in this context:

   * Column added

   * Fields added in a Json

   * Numeric "enlargement", like INT --> BIGINT, or INT --> DECIMAL, etc.

   * Non-Nullable to Nullable.

All of these appear to have a simple solution if we know the sequence that data arrives.

* If a column is added, then if we read the new files first, we'll know the type for the older files that don't have the column.

* If there is numeric enlargement and we read the larger size first, we know to read the older, smaller values at the larger size.

* If we read the non-nullable values first, we know to treat the older, non-nullable values as nullable.

But, let's play the story the other way around. Remember, xDBC clients can't handle a schema change: whatever schema we deliver on the first row we must continue to use to the last row.

* So, suppose we read the old files first. If a column is missing, but the user asked for it, what type is the column? How does the first row know what type will appear once Drill gets to the newer files? Or, do we pick a type and force the column, when it does appear, into that type?

* The same is true of type enlargement: how will Drill know, when it reads and delivers the first rows with a narrow type, that a wider type is coming 100 files from now? Or, if Drill picked the narrow type, should Drill try to force the wider values into that initial narrow type?

There are two choices. First, buffer all the rows before delivering the first so that type changes can be made on the buffered data before it is delivered. Second, know the final type info up front, in the form of a schema (or hint or ...)

Thanks for enumerating the many limitations.

* Schema change for xDBC only works if there is a buffering operator that sees all rows (as sort) but not one that sees some rows (grouped aggregation).

* Schema change would be very hard on key fields, so a query can still fail sometimes. We've not solved the problem, just made it somewhat more obscure.

* Resulting schema order may depend on file read order.

And, I'll throw in one I mentioned before: that a random operator may not have the needed context, such as date formats, deciding to convert to VARCHAR vs. numeric, etc.

I understand the case for "fail somewhat less often." I simply suggesting the team can achieve rock solid results -- and do so at a lower cost/risk than the partial solution. (I really want to have to throw away that "Data Engineering" chapter in the Drill book that explains all these limitations.)

Thanks,
- Paul

    On Tuesday, November 6, 2018, 5:50:52 PM PST, Boaz Ben-Zvi <bo...@apache.org> wrote:  

   Hi Paul,

(_a_)  Having a "schema file" sounds like contradiction to calling Drill 
"schema free"; maybe we could "sweep it under the mat" by creating a new 
convention for scanners, such that if a scanner has multiple files to 
read (e.g. f1.csv, f2,csv, ...), then is there's some file named 
"MeFirst.csv", it would always be read first !! (With some option to 
skip some of the rows there, like "MeFirst0.csv" means skip all the rows).

(_b_) If the schema (hint) is kept somewhere, could it be updated 
automatically by the executing query ? If so, running again a query that 
failed with "schema change" my succeed second time. If there is an issue 
with permissions, maybe each user can keep such cache in its ~/.drill ...

(_c_) Indeed we can't have a general "schema change" solution; however 
we can focus on the low hanging fruit, namely "schema evolution". In 
many cases, the change in the schema is "natural", and we could easily 
adopt the blocking operator. Cases like:

    * Column added

    * Fields added in a Json

    * Numeric "enlargement", like INT --> BIGINT, or INT --> DECIMAL, etc.

    * Non-Nullable to Nullable.

Further ideas:

- A blocking operator has a notion of the current schema; once the 
schema "evolves", it  can either "pause and convert all the old ones", 
or work lazily -- just track the old ones, and make changes as needed 
(e.g., work with two sets of generated code, as needed).

- As these changes are rare, we could restrict to handling only "one 
active change at a time"

- Memory management could be an issue (with "pause and convert"), but 
may be simple if the computation starts using the newer bigger batch 
size (for "lazy").

- We should distinguish between "key" columns, and "non-key" columns 
(for Sort / Hash-Join) or "value" columns in the Hash-Agg. One 
possibility for the Hash operators is to have some hash function 
compatibility, like  HashFunc( INT 567 ) == HashFunc( BIGINT 567 ), to 
simplify (and avoid rehashing).

     Thanks,

          Boaz

On 11/6/18 12:25 PM, Paul Rogers wrote:
> HI Aman,
>
> I would completely agree with the analysis -- except for the fact that we can't create a general solution, only a patchwork of incomplete ad-hoc solutions. The question is not whether it would be useful to have a general solution (it would), rather whether it is technically possible without some help from the user (it is not, IMHO.)
>
> I like the scenario presented, gives us a concrete example. Let's say an IoT device produced files with an evolving schema. A field in a JSON file started as BIGINT, later because DOUBLE, and finally became VARCHAR. What should Drill do? Maybe the values are:
> 1
> 1.1
> 1.33
>
> The change of types might represent the idea that the above are money amounts, and the only way to represent values exactly is with a string (in JSON) and with a DECIMAL in Drill.
>
> Or, maybe the values are:
> 1
> 1.1
> 1.1rev3
>
> Which showed that the value is a version string. Early developers thought to use an integer, later they wanted minor versions, and even later they realized they needed a patch value. The correct value type is VARCHAR.
>
> Once can also invent a scenario in which the proper type is BIGINT, DOUBLE or even TIMESTAMP.
>
> Since Drill can't know the user's intention, we can invest quite a bit of effort and still not solve the problem.
>
> What is the alternative?
>
> Suppose we simply let the query fail when we see a schema change, but we point the user to a solution:
>
> Query failed: Schema conflict on column `foo`: BIGINT and DOUBLE.
> Use a schema file to resolve the ambiguity.
> See https://urldefense.proofpoint.com/v2/url?u=http-3A__drill.apache.org_docs_schema-2Dfile&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=PqKay2uOMZUqopDRKNfBtZSlsp2meGOxWNAVHxHnXCk&m=DOzeipgsStxUnotQemlm6judvWUdbAdPuvMBBYh4ilU&s=_ezJ4X476FCrf8ouHloYk1NLS91bs7ITW7u36molPmU&e= for more information.
>
> Now, the user is in control: we stated what we can and cannot do and gave the user the option to decide on the data type.
>
> This is a special case of other use cases: it works just as well for specifying CSV types, refining JSON types and so on. A single solution that solves multiple problems.
>
> This approach also solves the problem that the JDBC and ODBC clients can't handle a schema that changes during processing. (The native Drill client can, which is a rather cool feature. xDBC hasn't caught up, so we have to deal with them as they are.)
>
> In fact, Drill could then say: if your data is nice and clean, query it without a schema since the data speaks for itself. If, however, your data is messy (as real-word data tends to be), just provide a schema to explain the intent and Drill will do the right thing.
>
> And, again, if the team tried the schema solution first, you'd be in a much better position to see what additional benefits could be had by trying to guess the type (and solving the time-travel issue.) (This is the lazy approach: do the least amount of work...)
>
> In fact, it may turn out that schema change as an issue disappears once users have a nice, clean, well-understood solution -- the schema file.
>
> Thanks,
> - Paul
>
>  
>
>      On Tuesday, November 6, 2018, 9:41:21 AM PST, Aman Sinha <am...@gmail.com> wrote:
>  
>  Hi Paul,
> Thanks for the feedback !  I am in complete favor of doing the schema
> discovery and schema hinting.  But even on this list in the past we have
> discussed other use cases such as IoT devices where the schema-on-read is
> needed (I think it was in the context of the 'death of schema-on-read'
> email thread).  As I mentioned in my prior email, JSON document databases
> don't have pre-defined schema and even if one does schema discovery, it
> will have to be continuously updated given that these DBs are used in
> operational applications where data is streaming in at a fast rate.
>
> I think we should try for a complementary approach - wherever schema
> discovery or hinting is feasible, Drill would use it.  For others
> scenarios, can we do a best effort and not fail the query ?
>
> Note that I don't want to backtrack and revise the data types of the rows
> already sent to the client.  In fact, today, if you have 2 files with
> different schema, if the columns are projected as below, the query will
> return data to the client in separate batches.  In fact, this is common
> among Drill users to do data exploration (with a LIMIT clause).
> (file 1: {a: 10, b: 20.5}  file 2: {a: "cat", b: "dog"} )
>
> 0: jdbc:drill:zk=local> select a, b from dfs.`/tmp/table2` ;
>
> *+------+-------+*
>
> *| ** a  ** | **  b  ** |*
>
> *+------+-------+*
>
> *| *10  * | *20.5 * |*
>
> *| *cat * | *dog  * |*
>
> *+------+-------+*
>
> You mention 'drill cant' predict the future'..which is true and I am saying
> we don't need to predict the future.  If all operators did what the Scan
> readers do which is emit a new record batch when it encounters a new
> schema, then conceptually it would get us much farther along.
>
> The point is : let's assume the client side is able to handle 2 different
> schemas, how can Drill internally handle that in the execution plan ?  For
> the non-blocking operators it means that as soon as the schema changes, it
> emits the previous Record Batch and starts a new output batch.  For the
> blocking operators,  there's more things to take care of and I created
> DRILL-6829 <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_DRILL-2D6829&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=PqKay2uOMZUqopDRKNfBtZSlsp2meGOxWNAVHxHnXCk&m=DOzeipgsStxUnotQemlm6judvWUdbAdPuvMBBYh4ilU&s=im9AAhcJ6rXl9yqBRTVEP1pYgEvJ_etsmK_RB7yJcg4&e=>  to capture
> that.
>
> Aman
>
> On Mon, Nov 5, 2018 at 8:50 PM Paul Rogers <pa...@yahoo.com.invalid>
> wrote:
>
>> Hi Aman,
>>
>> Thanks much for the write-up. My two cents, FWIW.
>>
>> As the history of this list has shown, I've fought with the schema change
>> issue multiple times: in sort, in JSON, in the row set loader framework,
>> and in writing the "Data Engineering" chapter in the Learning Drill book.
>>
>> What I have come to realize is that there is no general solution to the
>> schema change problem. Yes, there are clever things to do in special cases.
>> But he general problem is unsolvable.
>>
>> Look at the open PR for the projection framework. There is an
>> implementation of a "schema smoother." It tries really hard, but it
>> highlights the inherent limitations of such an effort.
>>
>> The key reason is that, do do a good job, rows processed now must know the
>> types of rows seen 100 million rows from now. Since Drill does not have a
>> time machine, that is not possible.
>>
>> The easiest way to visualize this is with a single fragment that reads two
>> files. File A has 100K rows with column C as an Varchar. File B has 100K
>> rows with column C as an Int. There is no sort, so all rows are returned
>> directly to the client as, say, four 50K batches.
>>
>> The client will encounter a schema with C as Varchar. Later, it will C as
>> Int. But, since the client already told the JDBC consumer that the type is
>> Varchar, the JDBC client is stuck. It could convert the Int to Varchar
>> behind the scenes.
>>
>> Now, run the query again. The order in which Drill reads files is random.
>> Second time, the client sees C as an Int. Now, JDBC must convert the later
>> Varchar columns to Int. That works if the Varchar are numbers, but not if
>> the Ints should have been Varchar.
>>
>> The general problem as I put it in the book, is that "Drill can't predict
>> the future" but that is precisely what is needed for a general solution.
>>
>> However, if the user sets a policy (treat column C as a DECIMAL, even if
>> you read it as an Int or Varchar), then time travel is not necessary.
>>
>> My humble suggestion is to focus on the schema effort: give the user a way
>> to define the resolution to the issue that is right for their data. See how
>> that works out for users. Then, with that extra information, go back and
>> see what other features might be useful.
>>
>> The proposed schema support (at least as hints, preferably as a schema
>> file, full blown with a metastore) is a much better, easier to understand,
>> easier to explain solution that is familiar to anyone coming from a DB
>> background.
>>
>>
>> My suggestion: to understand the challenges and limitations, think through
>> many different scenarios: look at the history of this list for some, see
>> the notes in the Result Set Loader wiki and code for more. Work out how
>> they could be resolved. You may see something I've missed, or you may
>> realize that the problem is just not solvable in general without an
>> up-front schema.
>>
>> More comments in the JIRA ticket.
>>
>> Thanks,
>> - Paul
>>
>>
>>
>>      On Monday, November 5, 2018, 6:47:48 PM PST, Aman Sinha <
>> amansinha@gmail.com> wrote:
>>
>>    Hi all,
>> While we continue to enhance the schema provision and metastore aspects in
>> Drill, we also should explore what it means to be truly schema-less such
>> that we can better handle {semi, un}structured data, data sitting in DBs
>> that store JSON documents (e.g Mongo, MapR-DB).
>>
>> The blocking operators are the main hurdles in this goal. I wrote some
>> thoughts on supporting Schema change in a Sort operator in DRILL-6829
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_DRILL-2D6829&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=PqKay2uOMZUqopDRKNfBtZSlsp2meGOxWNAVHxHnXCk&m=DOzeipgsStxUnotQemlm6judvWUdbAdPuvMBBYh4ilU&s=im9AAhcJ6rXl9yqBRTVEP1pYgEvJ_etsmK_RB7yJcg4&e=> .  Would welcome any
>> feedback and see how to go about it going forward.
>>
>> Thanks,
>> Aman
>>
>

Re: Handling schema change in blocking operators

Posted by Boaz Ben-Zvi <bo...@apache.org>.

  Hi Paul,

(_a_)  Having a "schema file" sounds like contradiction to calling Drill 
"schema free"; maybe we could "sweep it under the mat" by creating a new 
convention for scanners, such that if a scanner has multiple files to 
read (e.g. f1.csv, f2,csv, ...), then is there's some file named 
"MeFirst.csv", it would always be read first !! (With some option to 
skip some of the rows there, like "MeFirst0.csv" means skip all the rows).

(_b_) If the schema (hint) is kept somewhere, could it be updated 
automatically by the executing query ? If so, running again a query that 
failed with "schema change" my succeed second time. If there is an issue 
with permissions, maybe each user can keep such cache in its ~/.drill ...

(_c_) Indeed we can't have a general "schema change" solution; however 
we can focus on the low hanging fruit, namely "schema evolution". In 
many cases, the change in the schema is "natural", and we could easily 
adopt the blocking operator. Cases like:

    * Column added

    * Fields added in a Json

    * Numeric "enlargement", like INT --> BIGINT, or INT --> DECIMAL, etc.

    * Non-Nullable to Nullable.

Further ideas:

- A blocking operator has a notion of the current schema; once the 
schema "evolves", it  can either "pause and convert all the old ones", 
or work lazily -- just track the old ones, and make changes as needed 
(e.g., work with two sets of generated code, as needed).

- As these changes are rare, we could restrict to handling only "one 
active change at a time"

- Memory management could be an issue (with "pause and convert"), but 
may be simple if the computation starts using the newer bigger batch 
size (for "lazy").

- We should distinguish between "key" columns, and "non-key" columns 
(for Sort / Hash-Join) or "value" columns in the Hash-Agg. One 
possibility for the Hash operators is to have some hash function 
compatibility, like  HashFunc( INT 567 ) == HashFunc( BIGINT 567 ), to 
simplify (and avoid rehashing).

     Thanks,

          Boaz

On 11/6/18 12:25 PM, Paul Rogers wrote:
> HI Aman,
>
> I would completely agree with the analysis -- except for the fact that we can't create a general solution, only a patchwork of incomplete ad-hoc solutions. The question is not whether it would be useful to have a general solution (it would), rather whether it is technically possible without some help from the user (it is not, IMHO.)
>
> I like the scenario presented, gives us a concrete example. Let's say an IoT device produced files with an evolving schema. A field in a JSON file started as BIGINT, later because DOUBLE, and finally became VARCHAR. What should Drill do? Maybe the values are:
> 1
> 1.1
> 1.33
>
> The change of types might represent the idea that the above are money amounts, and the only way to represent values exactly is with a string (in JSON) and with a DECIMAL in Drill.
>
> Or, maybe the values are:
> 1
> 1.1
> 1.1rev3
>
> Which showed that the value is a version string. Early developers thought to use an integer, later they wanted minor versions, and even later they realized they needed a patch value. The correct value type is VARCHAR.
>
> Once can also invent a scenario in which the proper type is BIGINT, DOUBLE or even TIMESTAMP.
>
> Since Drill can't know the user's intention, we can invest quite a bit of effort and still not solve the problem.
>
> What is the alternative?
>
> Suppose we simply let the query fail when we see a schema change, but we point the user to a solution:
>
> Query failed: Schema conflict on column `foo`: BIGINT and DOUBLE.
> Use a schema file to resolve the ambiguity.
> See https://urldefense.proofpoint.com/v2/url?u=http-3A__drill.apache.org_docs_schema-2Dfile&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=PqKay2uOMZUqopDRKNfBtZSlsp2meGOxWNAVHxHnXCk&m=DOzeipgsStxUnotQemlm6judvWUdbAdPuvMBBYh4ilU&s=_ezJ4X476FCrf8ouHloYk1NLS91bs7ITW7u36molPmU&e= for more information.
>
> Now, the user is in control: we stated what we can and cannot do and gave the user the option to decide on the data type.
>
> This is a special case of other use cases: it works just as well for specifying CSV types, refining JSON types and so on. A single solution that solves multiple problems.
>
> This approach also solves the problem that the JDBC and ODBC clients can't handle a schema that changes during processing. (The native Drill client can, which is a rather cool feature. xDBC hasn't caught up, so we have to deal with them as they are.)
>
> In fact, Drill could then say: if your data is nice and clean, query it without a schema since the data speaks for itself. If, however, your data is messy (as real-word data tends to be), just provide a schema to explain the intent and Drill will do the right thing.
>
> And, again, if the team tried the schema solution first, you'd be in a much better position to see what additional benefits could be had by trying to guess the type (and solving the time-travel issue.) (This is the lazy approach: do the least amount of work...)
>
> In fact, it may turn out that schema change as an issue disappears once users have a nice, clean, well-understood solution -- the schema file.
>
> Thanks,
> - Paul
>
>   
>
>      On Tuesday, November 6, 2018, 9:41:21 AM PST, Aman Sinha <am...@gmail.com> wrote:
>   
>   Hi Paul,
> Thanks for the feedback !  I am in complete favor of doing the schema
> discovery and schema hinting.  But even on this list in the past we have
> discussed other use cases such as IoT devices where the schema-on-read is
> needed (I think it was in the context of the 'death of schema-on-read'
> email thread).  As I mentioned in my prior email, JSON document databases
> don't have pre-defined schema and even if one does schema discovery, it
> will have to be continuously updated given that these DBs are used in
> operational applications where data is streaming in at a fast rate.
>
> I think we should try for a complementary approach - wherever schema
> discovery or hinting is feasible, Drill would use it.  For others
> scenarios, can we do a best effort and not fail the query ?
>
> Note that I don't want to backtrack and revise the data types of the rows
> already sent to the client.  In fact, today, if you have 2 files with
> different schema, if the columns are projected as below, the query will
> return data to the client in separate batches.  In fact, this is common
> among Drill users to do data exploration (with a LIMIT clause).
> (file 1: {a: 10, b: 20.5}  file 2: {a: "cat", b: "dog"} )
>
> 0: jdbc:drill:zk=local> select a, b from dfs.`/tmp/table2` ;
>
> *+------+-------+*
>
> *| ** a  ** | **  b  ** |*
>
> *+------+-------+*
>
> *| *10  * | *20.5 * |*
>
> *| *cat * | *dog  * |*
>
> *+------+-------+*
>
> You mention 'drill cant' predict the future'..which is true and I am saying
> we don't need to predict the future.  If all operators did what the Scan
> readers do which is emit a new record batch when it encounters a new
> schema, then conceptually it would get us much farther along.
>
> The point is : let's assume the client side is able to handle 2 different
> schemas, how can Drill internally handle that in the execution plan ?  For
> the non-blocking operators it means that as soon as the schema changes, it
> emits the previous Record Batch and starts a new output batch.  For the
> blocking operators,  there's more things to take care of and I created
> DRILL-6829 <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_DRILL-2D6829&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=PqKay2uOMZUqopDRKNfBtZSlsp2meGOxWNAVHxHnXCk&m=DOzeipgsStxUnotQemlm6judvWUdbAdPuvMBBYh4ilU&s=im9AAhcJ6rXl9yqBRTVEP1pYgEvJ_etsmK_RB7yJcg4&e=>  to capture
> that.
>
> Aman
>
> On Mon, Nov 5, 2018 at 8:50 PM Paul Rogers <pa...@yahoo.com.invalid>
> wrote:
>
>> Hi Aman,
>>
>> Thanks much for the write-up. My two cents, FWIW.
>>
>> As the history of this list has shown, I've fought with the schema change
>> issue multiple times: in sort, in JSON, in the row set loader framework,
>> and in writing the "Data Engineering" chapter in the Learning Drill book.
>>
>> What I have come to realize is that there is no general solution to the
>> schema change problem. Yes, there are clever things to do in special cases.
>> But he general problem is unsolvable.
>>
>> Look at the open PR for the projection framework. There is an
>> implementation of a "schema smoother." It tries really hard, but it
>> highlights the inherent limitations of such an effort.
>>
>> The key reason is that, do do a good job, rows processed now must know the
>> types of rows seen 100 million rows from now. Since Drill does not have a
>> time machine, that is not possible.
>>
>> The easiest way to visualize this is with a single fragment that reads two
>> files. File A has 100K rows with column C as an Varchar. File B has 100K
>> rows with column C as an Int. There is no sort, so all rows are returned
>> directly to the client as, say, four 50K batches.
>>
>> The client will encounter a schema with C as Varchar. Later, it will C as
>> Int. But, since the client already told the JDBC consumer that the type is
>> Varchar, the JDBC client is stuck. It could convert the Int to Varchar
>> behind the scenes.
>>
>> Now, run the query again. The order in which Drill reads files is random.
>> Second time, the client sees C as an Int. Now, JDBC must convert the later
>> Varchar columns to Int. That works if the Varchar are numbers, but not if
>> the Ints should have been Varchar.
>>
>> The general problem as I put it in the book, is that "Drill can't predict
>> the future" but that is precisely what is needed for a general solution.
>>
>> However, if the user sets a policy (treat column C as a DECIMAL, even if
>> you read it as an Int or Varchar), then time travel is not necessary.
>>
>> My humble suggestion is to focus on the schema effort: give the user a way
>> to define the resolution to the issue that is right for their data. See how
>> that works out for users. Then, with that extra information, go back and
>> see what other features might be useful.
>>
>> The proposed schema support (at least as hints, preferably as a schema
>> file, full blown with a metastore) is a much better, easier to understand,
>> easier to explain solution that is familiar to anyone coming from a DB
>> background.
>>
>>
>> My suggestion: to understand the challenges and limitations, think through
>> many different scenarios: look at the history of this list for some, see
>> the notes in the Result Set Loader wiki and code for more. Work out how
>> they could be resolved. You may see something I've missed, or you may
>> realize that the problem is just not solvable in general without an
>> up-front schema.
>>
>> More comments in the JIRA ticket.
>>
>> Thanks,
>> - Paul
>>
>>
>>
>>      On Monday, November 5, 2018, 6:47:48 PM PST, Aman Sinha <
>> amansinha@gmail.com> wrote:
>>
>>    Hi all,
>> While we continue to enhance the schema provision and metastore aspects in
>> Drill, we also should explore what it means to be truly schema-less such
>> that we can better handle {semi, un}structured data, data sitting in DBs
>> that store JSON documents (e.g Mongo, MapR-DB).
>>
>> The blocking operators are the main hurdles in this goal. I wrote some
>> thoughts on supporting Schema change in a Sort operator in DRILL-6829
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_DRILL-2D6829&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=PqKay2uOMZUqopDRKNfBtZSlsp2meGOxWNAVHxHnXCk&m=DOzeipgsStxUnotQemlm6judvWUdbAdPuvMBBYh4ilU&s=im9AAhcJ6rXl9yqBRTVEP1pYgEvJ_etsmK_RB7yJcg4&e=> .  Would welcome any
>> feedback and see how to go about it going forward.
>>
>> Thanks,
>> Aman
>>
>

Re: Handling schema change in blocking operators

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

HI Aman,

I would completely agree with the analysis -- except for the fact that we can't create a general solution, only a patchwork of incomplete ad-hoc solutions. The question is not whether it would be useful to have a general solution (it would), rather whether it is technically possible without some help from the user (it is not, IMHO.)

I like the scenario presented, gives us a concrete example. Let's say an IoT device produced files with an evolving schema. A field in a JSON file started as BIGINT, later because DOUBLE, and finally became VARCHAR. What should Drill do? Maybe the values are:
1
1.1
1.33

The change of types might represent the idea that the above are money amounts, and the only way to represent values exactly is with a string (in JSON) and with a DECIMAL in Drill.

Or, maybe the values are:
1
1.1
1.1rev3

Which showed that the value is a version string. Early developers thought to use an integer, later they wanted minor versions, and even later they realized they needed a patch value. The correct value type is VARCHAR.

Once can also invent a scenario in which the proper type is BIGINT, DOUBLE or even TIMESTAMP.

Since Drill can't know the user's intention, we can invest quite a bit of effort and still not solve the problem.

What is the alternative?

Suppose we simply let the query fail when we see a schema change, but we point the user to a solution:

Query failed: Schema conflict on column `foo`: BIGINT and DOUBLE.
Use a schema file to resolve the ambiguity.
See http://drill.apache.org/docs/schema-file for more information.

Now, the user is in control: we stated what we can and cannot do and gave the user the option to decide on the data type.

This is a special case of other use cases: it works just as well for specifying CSV types, refining JSON types and so on. A single solution that solves multiple problems.

This approach also solves the problem that the JDBC and ODBC clients can't handle a schema that changes during processing. (The native Drill client can, which is a rather cool feature. xDBC hasn't caught up, so we have to deal with them as they are.)

In fact, Drill could then say: if your data is nice and clean, query it without a schema since the data speaks for itself. If, however, your data is messy (as real-word data tends to be), just provide a schema to explain the intent and Drill will do the right thing.

And, again, if the team tried the schema solution first, you'd be in a much better position to see what additional benefits could be had by trying to guess the type (and solving the time-travel issue.) (This is the lazy approach: do the least amount of work...)

In fact, it may turn out that schema change as an issue disappears once users have a nice, clean, well-understood solution -- the schema file.

Thanks,
- Paul

    On Tuesday, November 6, 2018, 9:41:21 AM PST, Aman Sinha <am...@gmail.com> wrote:  

 Hi Paul,
Thanks for the feedback !  I am in complete favor of doing the schema
discovery and schema hinting.  But even on this list in the past we have
discussed other use cases such as IoT devices where the schema-on-read is
needed (I think it was in the context of the 'death of schema-on-read'
email thread).  As I mentioned in my prior email, JSON document databases
don't have pre-defined schema and even if one does schema discovery, it
will have to be continuously updated given that these DBs are used in
operational applications where data is streaming in at a fast rate.

I think we should try for a complementary approach - wherever schema
discovery or hinting is feasible, Drill would use it.  For others
scenarios, can we do a best effort and not fail the query ?

Note that I don't want to backtrack and revise the data types of the rows
already sent to the client.  In fact, today, if you have 2 files with
different schema, if the columns are projected as below, the query will
return data to the client in separate batches.  In fact, this is common
among Drill users to do data exploration (with a LIMIT clause).
(file 1: {a: 10, b: 20.5}  file 2: {a: "cat", b: "dog"} )

0: jdbc:drill:zk=local> select a, b from dfs.`/tmp/table2` ;

*+------+-------+*

*| ** a  ** | **  b  ** |*

*+------+-------+*

*| *10  * | *20.5 * |*

*| *cat * | *dog  * |*

*+------+-------+*

You mention 'drill cant' predict the future'..which is true and I am saying
we don't need to predict the future.  If all operators did what the Scan
readers do which is emit a new record batch when it encounters a new
schema, then conceptually it would get us much farther along.

The point is : let's assume the client side is able to handle 2 different
schemas, how can Drill internally handle that in the execution plan ?  For
the non-blocking operators it means that as soon as the schema changes, it
emits the previous Record Batch and starts a new output batch.  For the
blocking operators,  there's more things to take care of and I created
DRILL-6829 <https://issues.apache.org/jira/browse/DRILL-6829>  to capture
that.

Aman

On Mon, Nov 5, 2018 at 8:50 PM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> Hi Aman,
>
> Thanks much for the write-up. My two cents, FWIW.
>
> As the history of this list has shown, I've fought with the schema change
> issue multiple times: in sort, in JSON, in the row set loader framework,
> and in writing the "Data Engineering" chapter in the Learning Drill book.
>
> What I have come to realize is that there is no general solution to the
> schema change problem. Yes, there are clever things to do in special cases.
> But he general problem is unsolvable.
>
> Look at the open PR for the projection framework. There is an
> implementation of a "schema smoother." It tries really hard, but it
> highlights the inherent limitations of such an effort.
>
> The key reason is that, do do a good job, rows processed now must know the
> types of rows seen 100 million rows from now. Since Drill does not have a
> time machine, that is not possible.
>
> The easiest way to visualize this is with a single fragment that reads two
> files. File A has 100K rows with column C as an Varchar. File B has 100K
> rows with column C as an Int. There is no sort, so all rows are returned
> directly to the client as, say, four 50K batches.
>
> The client will encounter a schema with C as Varchar. Later, it will C as
> Int. But, since the client already told the JDBC consumer that the type is
> Varchar, the JDBC client is stuck. It could convert the Int to Varchar
> behind the scenes.
>
> Now, run the query again. The order in which Drill reads files is random.
> Second time, the client sees C as an Int. Now, JDBC must convert the later
> Varchar columns to Int. That works if the Varchar are numbers, but not if
> the Ints should have been Varchar.
>
> The general problem as I put it in the book, is that "Drill can't predict
> the future" but that is precisely what is needed for a general solution.
>
> However, if the user sets a policy (treat column C as a DECIMAL, even if
> you read it as an Int or Varchar), then time travel is not necessary.
>
> My humble suggestion is to focus on the schema effort: give the user a way
> to define the resolution to the issue that is right for their data. See how
> that works out for users. Then, with that extra information, go back and
> see what other features might be useful.
>
> The proposed schema support (at least as hints, preferably as a schema
> file, full blown with a metastore) is a much better, easier to understand,
> easier to explain solution that is familiar to anyone coming from a DB
> background.
>
>
> My suggestion: to understand the challenges and limitations, think through
> many different scenarios: look at the history of this list for some, see
> the notes in the Result Set Loader wiki and code for more. Work out how
> they could be resolved. You may see something I've missed, or you may
> realize that the problem is just not solvable in general without an
> up-front schema.
>
> More comments in the JIRA ticket.
>
> Thanks,
> - Paul
>
>
>
>    On Monday, November 5, 2018, 6:47:48 PM PST, Aman Sinha <
> amansinha@gmail.com> wrote:
>
>  Hi all,
> While we continue to enhance the schema provision and metastore aspects in
> Drill, we also should explore what it means to be truly schema-less such
> that we can better handle {semi, un}structured data, data sitting in DBs
> that store JSON documents (e.g Mongo, MapR-DB).
>
> The blocking operators are the main hurdles in this goal. I wrote some
> thoughts on supporting Schema change in a Sort operator in DRILL-6829
> <https://issues.apache.org/jira/browse/DRILL-6829> .  Would welcome any
> feedback and see how to go about it going forward.
>
> Thanks,
> Aman
>

Re: Handling schema change in blocking operators

Posted by Aman Sinha <am...@gmail.com>.

Hi Paul,
Thanks for the feedback !  I am in complete favor of doing the schema
discovery and schema hinting.  But even on this list in the past we have
discussed other use cases such as IoT devices where the schema-on-read is
needed (I think it was in the context of the 'death of schema-on-read'
email thread).   As I mentioned in my prior email, JSON document databases
don't have pre-defined schema and even if one does schema discovery, it
will have to be continuously updated given that these DBs are used in
operational applications where data is streaming in at a fast rate.

I think we should try for a complementary approach - wherever schema
discovery or hinting is feasible, Drill would use it.  For others
scenarios, can we do a best effort and not fail the query ?

Note that I don't want to backtrack and revise the data types of the rows
already sent to the client.  In fact, today, if you have 2 files with
different schema, if the columns are projected as below, the query will
return data to the client in separate batches.   In fact, this is common
among Drill users to do data exploration (with a LIMIT clause).
(file 1: {a: 10, b: 20.5}   file 2: {a: "cat", b: "dog"} )

0: jdbc:drill:zk=local> select a, b from dfs.`/tmp/table2` ;

*+------+-------+*

*| ** a  ** | **  b  ** |*

*+------+-------+*

*| *10  * | *20.5 * |*

*| *cat * | *dog  * |*

*+------+-------+*

You mention 'drill cant' predict the future'..which is true and I am saying
we don't need to predict the future.  If all operators did what the Scan
readers do which is emit a new record batch when it encounters a new
schema, then conceptually it would get us much farther along.

The point is : let's assume the client side is able to handle 2 different
schemas, how can Drill internally handle that in the execution plan ?   For
the non-blocking operators it means that as soon as the schema changes, it
emits the previous Record Batch and starts a new output batch.  For the
blocking operators,  there's more things to take care of and I created
DRILL-6829 <https://issues.apache.org/jira/browse/DRILL-6829>  to capture
that.

Aman

On Mon, Nov 5, 2018 at 8:50 PM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> Hi Aman,
>
> Thanks much for the write-up. My two cents, FWIW.
>
> As the history of this list has shown, I've fought with the schema change
> issue multiple times: in sort, in JSON, in the row set loader framework,
> and in writing the "Data Engineering" chapter in the Learning Drill book.
>
> What I have come to realize is that there is no general solution to the
> schema change problem. Yes, there are clever things to do in special cases.
> But he general problem is unsolvable.
>
> Look at the open PR for the projection framework. There is an
> implementation of a "schema smoother." It tries really hard, but it
> highlights the inherent limitations of such an effort.
>
> The key reason is that, do do a good job, rows processed now must know the
> types of rows seen 100 million rows from now. Since Drill does not have a
> time machine, that is not possible.
>
> The easiest way to visualize this is with a single fragment that reads two
> files. File A has 100K rows with column C as an Varchar. File B has 100K
> rows with column C as an Int. There is no sort, so all rows are returned
> directly to the client as, say, four 50K batches.
>
> The client will encounter a schema with C as Varchar. Later, it will C as
> Int. But, since the client already told the JDBC consumer that the type is
> Varchar, the JDBC client is stuck. It could convert the Int to Varchar
> behind the scenes.
>
> Now, run the query again. The order in which Drill reads files is random.
> Second time, the client sees C as an Int. Now, JDBC must convert the later
> Varchar columns to Int. That works if the Varchar are numbers, but not if
> the Ints should have been Varchar.
>
> The general problem as I put it in the book, is that "Drill can't predict
> the future" but that is precisely what is needed for a general solution.
>
> However, if the user sets a policy (treat column C as a DECIMAL, even if
> you read it as an Int or Varchar), then time travel is not necessary.
>
> My humble suggestion is to focus on the schema effort: give the user a way
> to define the resolution to the issue that is right for their data. See how
> that works out for users. Then, with that extra information, go back and
> see what other features might be useful.
>
> The proposed schema support (at least as hints, preferably as a schema
> file, full blown with a metastore) is a much better, easier to understand,
> easier to explain solution that is familiar to anyone coming from a DB
> background.
>
>
> My suggestion: to understand the challenges and limitations, think through
> many different scenarios: look at the history of this list for some, see
> the notes in the Result Set Loader wiki and code for more. Work out how
> they could be resolved. You may see something I've missed, or you may
> realize that the problem is just not solvable in general without an
> up-front schema.
>
> More comments in the JIRA ticket.
>
> Thanks,
> - Paul
>
>
>
>     On Monday, November 5, 2018, 6:47:48 PM PST, Aman Sinha <
> amansinha@gmail.com> wrote:
>
>  Hi all,
> While we continue to enhance the schema provision and metastore aspects in
> Drill, we also should explore what it means to be truly schema-less such
> that we can better handle {semi, un}structured data, data sitting in DBs
> that store JSON documents (e.g Mongo, MapR-DB).
>
> The blocking operators are the main hurdles in this goal. I wrote some
> thoughts on supporting Schema change in a Sort operator in DRILL-6829
> <https://issues.apache.org/jira/browse/DRILL-6829> .  Would welcome any
> feedback and see how to go about it going forward.
>
> Thanks,
> Aman
>

Re: Handling schema change in blocking operators

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Aman,

Thanks much for the write-up. My two cents, FWIW.

As the history of this list has shown, I've fought with the schema change issue multiple times: in sort, in JSON, in the row set loader framework, and in writing the "Data Engineering" chapter in the Learning Drill book.

What I have come to realize is that there is no general solution to the schema change problem. Yes, there are clever things to do in special cases. But he general problem is unsolvable.

Look at the open PR for the projection framework. There is an implementation of a "schema smoother." It tries really hard, but it highlights the inherent limitations of such an effort.

The key reason is that, do do a good job, rows processed now must know the types of rows seen 100 million rows from now. Since Drill does not have a time machine, that is not possible.

The easiest way to visualize this is with a single fragment that reads two files. File A has 100K rows with column C as an Varchar. File B has 100K rows with column C as an Int. There is no sort, so all rows are returned directly to the client as, say, four 50K batches.

The client will encounter a schema with C as Varchar. Later, it will C as Int. But, since the client already told the JDBC consumer that the type is Varchar, the JDBC client is stuck. It could convert the Int to Varchar behind the scenes.

Now, run the query again. The order in which Drill reads files is random. Second time, the client sees C as an Int. Now, JDBC must convert the later Varchar columns to Int. That works if the Varchar are numbers, but not if the Ints should have been Varchar.

The general problem as I put it in the book, is that "Drill can't predict the future" but that is precisely what is needed for a general solution.

However, if the user sets a policy (treat column C as a DECIMAL, even if you read it as an Int or Varchar), then time travel is not necessary.

My humble suggestion is to focus on the schema effort: give the user a way to define the resolution to the issue that is right for their data. See how that works out for users. Then, with that extra information, go back and see what other features might be useful.

The proposed schema support (at least as hints, preferably as a schema file, full blown with a metastore) is a much better, easier to understand, easier to explain solution that is familiar to anyone coming from a DB background.

My suggestion: to understand the challenges and limitations, think through many different scenarios: look at the history of this list for some, see the notes in the Result Set Loader wiki and code for more. Work out how they could be resolved. You may see something I've missed, or you may realize that the problem is just not solvable in general without an up-front schema.

More comments in the JIRA ticket.

Thanks,
- Paul

On Monday, November 5, 2018, 6:47:48 PM PST, Aman Sinha <am...@gmail.com> wrote:

Hi all,
While we continue to enhance the schema provision and metastore aspects in
Drill, we also should explore what it means to be truly schema-less such
that we can better handle {semi, un}structured data, data sitting in DBs
that store JSON documents (e.g Mongo, MapR-DB).

The blocking operators are the main hurdles in this goal. I wrote some
thoughts on supporting Schema change in a Sort operator in DRILL-6829
<https://issues.apache.org/jira/browse/DRILL-6829> . Would welcome any
feedback and see how to go about it going forward.

Thanks,
Aman