You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2022/04/06 18:20:38 UTC

[GitHub] [druid] gianm commented on issue #11929: Batch ingestion using SQL INSERT

gianm commented on issue #11929:
URL: https://github.com/apache/druid/issues/11929#issuecomment-1090587695

   > SQL says _what_ is to be done, not _how_ to do it. That the time range corresponds to a partition is an implementation detail for us to sort out. The SQL planner can analyze the `WHERE` clause to ensure it contains only `__time` and that the start, end are valid. Who knows, maybe we will support some other kind of replacement at some point: let's not encode the rules in the syntax.
   
   TLDR: I like the idea of using a filter instead of PARTITION syntax, since a partition really can be expressed as a filter, and since it's believable that one day we'd want to support REPLACE on things that are not partitions. But I think we should make the WHERE-on-REPLACE mandatory to prevent users from accidentally replacing the entire table when they meant to replace a time chunk, and we should consider using a different keyword to avoid having two WHEREs on the same query with different semantics.
   
   ---
   
   Expanding on that thought:
   
   Let's write the queries out to see what they look like. I think instead of `__time IN (<start>, <end>)` we'll need a range filter, since the `IN` checks two specific time points rather than a range. So I'll switch to that.
   
   Replace a specific time chunk:
   
   ```
   REPLACE table [insert-style column list]
   WHERE __time >= TIMESTAMP '2000-01-01 00:00:00' AND __time < TIMESTAMP '2000-01-02 00:00:00
   SELECT ...
   PARTITIONED BY DAY
   ```
   
   I imagine the semantics would be similar to insert-with-overwrite today. So, there's an implicit DELETE here for the entire time range of the WHERE, and the SELECT is expected to generate rows that fit within that time range.
   
   I imagine that it'd be an error if:
   
   - the provided filter is not a time filter
   - the provided filter is not aligned to the provided PARTITIONED BY
   - the SELECT generates rows that don't match the filter
   
   And then, replacing the entire table:
   
   ```
   REPLACE table [insert-style column list]
   SELECT ...
   PARTITIONED BY DAY
   ```
   
   It seems potentially confusing that the query could have two WHEREs in it — one for the REPLACE and one for the SELECT — with different meanings. It makes it easy for users to make the mistake of writing this query:
   
   ```
   REPLACE dst
   SELECT * FROM src
   WHERE __time >= TIMESTAMP '2000-01-01 00:00:00' AND __time < TIMESTAMP '2000-01-02 00:00:00
   ```
   
   Instead of this one:
   
   ```
   REPLACE dst
   WHERE __time >= TIMESTAMP '2000-01-01 00:00:00' AND __time < TIMESTAMP '2000-01-02 00:00:00
   SELECT * FROM src
   ```
   
   Or even this one:
   
   ```
   REPLACE dst
   WHERE __time >= TIMESTAMP '2000-01-01 00:00:00' AND __time < TIMESTAMP '2000-01-02 00:00:00
   SELECT * FROM src
   WHERE __time >= TIMESTAMP '2000-01-01 00:00:00' AND __time < TIMESTAMP '2000-01-02 00:00:00
   ```
   
   The first query would replace the entire "dst" with one day from "src". The second query would replace one day of "dst" with all of "src". (If "src" contains more than that one day, it would be an error.) The third query would replace one day of "dst" with one day of "src".
   
   I think it's rare that anyone would want to do the first query; probably in most cases they really meant to do the second or third instead. But it's really easy to do it by accidentally omitting the WHERE-of-REPLACE. So I think we should make the first query invalid, either by requiring an explicit `WHERE TRUE` filter or by using a special keyword to indicate "I really want to replace the entire table".
   
   Separately, people might find it weird that there's two WHEREs with different semantics. Maybe we can use a different keyword instead, or in addition to, the first WHERE.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org