You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Vitalii Diravka <vi...@apache.org> on 2019/01/04 11:42:24 UTC

Re: Problem when using files with differents schemas in the same SELECT

Hi Benj,

This is a known issue with a column data type without values, by default it
is INT:OPTIONAL (for your last query) and when the meaningful data is came
that INT:OPTIONAL is treated to the new datatype.
It was discussed very frequently in the different topics. Paul raised this
topic in some mail threads, for instance "Possible way to specify column
types in query" [1].
One of the solutions is to specify the schema before reading the data, it
can be done in the query, schema from the file or with Drill Metastore. But
all mentioned approaches are under development [2], [3].

You can create a Jira ticket for this issue, since it is a good use case
for shoeing this issue. Also possibly some improvements can be done here,
for instance to show missing string values as *null*, but not as empty
string.

each of these SELECT can sometimes return "Error: SYSTEM ERROR:
> NullPointerException".

For the NPE, did you check the stacktrace? Please specify it in Jira as
well.

[1] http://mail-archives.apache.org/mod_mbox/drill-dev/201809.mbox/browser
[2] https://issues.apache.org/jira/browse/DRILL-6552
[3] https://issues.apache.org/jira/browse/DRILL-6835

Kind regards
Vitalii


On Wed, Jan 2, 2019 at 7:59 PM benj.dev <be...@laposte.net.invalid>
wrote:

> Hi,
>
> I have read that in SELECT from multiple sources (SELECT * FROM
> tmp.`myfile*`), the files are treated in random order.
> But I don't understand why the processing of (parquet) files that do not
> have the same columns is not homogeneous.
>
> Example (on Drill 1.14) :
>
> CREATE TABLE tmp2.`mytable1` AS SELECT            1 AS myc1,
>       'col3_1' AS myc3;
> CREATE TABLE tmp2.`mytable2` AS SELECT            2 AS myc1, 'col2_2' AS
> myc2, 'col3_2' AS myc3, 'col4_2' AS myc4;
> CREATE TABLE tmp2.`mytable3` AS SELECT 0 AS myc0, 3 AS myc1, 'col2_3' AS
> myc2;
>
> SELECT * FROM tmp2.`mytable*`;
> | mytable3  | 0           | 3     | col2_3  |
> | mytable2  | 1635023213  | 2     | col2_2  |
> | mytable1  | 1635023213  | 1     |         |
>
> SELECT myc0 FROM tmp2.`mytable*`;
> | 0           |
> | 1818386772  |
> | 1818386772  |
>
> SELECT myc2 FROM tmp2.`mytable*`;
> | col2_3  |
> | col2_2  |
> |         |
>
> SELECT myc0, myc1, myc2, myc3, myc4 FROM tmp2.`mytable*`;
> | 0     | 3     | col2_3  | null    | null    |
> | 0     | 2     | col2_2  | col3_2  | col4_2  |
> | 0     | 1     |         | col3_1  |         |
>
> Please note that :
> - each of these SELECT can sometimes return "Error: SYSTEM ERROR:
> NullPointerException".
> - The undefined columns may have different value in different calls.
> - Another point is that for a given column undefined in some files, this
> one can appear with a null value or empty chain (illustrated by the last
> example).
>   Maybe this is consequent of the (random) order of the SELECT.
>
> I can understand that the processing of different files in the same
> request can be difficult, but
> - Why try to put (random) value on unknown columns and not just put a
> NULL. Put NULL everytime will allow to treat this case
> - An error should appears all the time OR never, not randomly.
>
> Does anyone have an explanation or any trick or is it a well-known
> comportment/bug with already planned developments ?
>
> Thanks for any explanations or digression,
> Regards,
>