You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Daniel Barclay <db...@maprtech.com> on 2015/05/01 03:08:21 UTC

default to waiting for net/full schema? [was: Re: Desired Behavior when a table has both files and folders?]

Should Drill default to not sending changing schema information (that is,
to waiting until it has all the schema information before returning any
through JDBC), and only send changing schemas when the client has somehow
told Drill that it can handle changing schemas (e.g., when the client
registers a handler for schema changes, or in some connection property)?

Then Drill would work "normally" in regular JDBC tools (they won't fail
to show columns that didn't exist in earlier rows or--worse--crash trying
to access columns that no longer exist in later rows), but Drill could
still incrementally return changing schema information to clients that
can handle it.

Daniel


Steven Phillips wrote:
> I believe the missing columns is due to a limitation in sqlline itself. For
> this query, Drill don't know in advance what columns will be returned. It
> just returns them as they come. When the first batch get back to sqlline,
> it will assume that whatever columns it receives in that batch are the only
> columns this query will return. And it ignores any new columns that show up.
>
> On Wed, Apr 29, 2015 at 6:20 PM, Hao Zhu <hz...@maprtech.com> wrote:
>
>> You can specify the column names.
>> "select *"  explores the schema by itself.
>>
>>> select * from `data`;
>> +------------+------------+
>> |    dir0    |    col1    |
>> +------------+------------+
>> | null       | 1          |
>> | folder1    | null       |
>> | folder1    | null       |
>> | folder1    | 4          |
>> +------------+------------+
>> 4 rows selected (0.074 seconds)
>>> select dir0,col1,col2 from `data`;
>> +------------+------------+------------+
>> |    dir0    |    col1    |    col2    |
>> +------------+------------+------------+
>> | null       | 1          | null       |
>> | folder1    | null       | 3          |
>> | folder1    | null       | 2          |
>> | folder1    | 4          | null       |
>> +------------+------------+------------+
>> 4 rows selected (0.088 seconds)
>>> select dir0,col1,col2,col3 from `data`;
>> +------------+------------+------------+------------+
>> |    dir0    |    col1    |    col2    |    col3    |
>> +------------+------------+------------+------------+
>> | null       | 1          | null       | null       |
>> | folder1    | null       | 3          | null       |
>> | folder1    | null       | 2          | null       |
>> | folder1    | 4          | null       | null       |
>> +------------+------------+------------+------------+
>> 4 rows selected (0.098 seconds)
>>
>> Thanks,
>> Hao
>>
>> On Wed, Apr 29, 2015 at 5:14 PM, rahul challapalli <
>> challapallirahul@gmail.com> wrote:
>>
>>> What is the desired behavior when I run "select * from data;" on the
>> below
>>> structure?
>>>
>>> data/
>>>    -- file1.json
>>>    -- folder1/
>>>         -- file2.json
>>>
>>> file1.json : {"col1" : 1}
>>> file2.json : {"col2" : 2}
>>>
>>> This is what drill returns :
>>> +------------+------------+
>>> |    dir0    |    col2    |
>>> +------------+------------+
>>> | folder1   | 2          |
>>> | null       | null       |
>>> +------------+------------+
>>>
>>> Looks like drill ignored the columns from the first file.
>>>
>>> - Rahul
>>>
>>
>
>
>


-- 
Daniel Barclay
MapR Technologies

Re: default to waiting for net/full schema? [was: Re: Desired Behavior when a table has both files and folders?]

Posted by Jacques Nadeau <ja...@apache.org>.

removing all the cross-posting...

As a streaming engine, there is no way to know the schema of all data of
schemaless sources without first reading them.  Holding the entire dataset
in memory (or reading it twice) is too big a penalty.  Enhancements to do
sampling would be ideal.  Generally, we either guess or we know.  Right now
we guess (with not very good information).  It seems like we should
definitely improve our guesses.  Knowing is too expensive in some cases.

On Thu, Apr 30, 2015 at 6:08 PM, Daniel Barclay <db...@maprtech.com>
wrote:

> Should Drill default to not sending changing schema information (that is,
> to waiting until it has all the schema information before returning any
> through JDBC), and only send changing schemas when the client has somehow
> told Drill that it can handle changing schemas (e.g., when the client
> registers a handler for schema changes, or in some connection property)?
>
> Then Drill would work "normally" in regular JDBC tools (they won't fail
> to show columns that didn't exist in earlier rows or--worse--crash trying
> to access columns that no longer exist in later rows), but Drill could
> still incrementally return changing schema information to clients that
> can handle it.
>
> Daniel
>
>
> Steven Phillips wrote:
>
>> I believe the missing columns is due to a limitation in sqlline itself.
>> For
>> this query, Drill don't know in advance what columns will be returned. It
>> just returns them as they come. When the first batch get back to sqlline,
>> it will assume that whatever columns it receives in that batch are the
>> only
>> columns this query will return. And it ignores any new columns that show
>> up.
>>
>> On Wed, Apr 29, 2015 at 6:20 PM, Hao Zhu <hz...@maprtech.com> wrote:
>>
>>  You can specify the column names.
>>> "select *"  explores the schema by itself.
>>>
>>>  select * from `data`;
>>>>
>>> +------------+------------+
>>> |    dir0    |    col1    |
>>> +------------+------------+
>>> | null       | 1          |
>>> | folder1    | null       |
>>> | folder1    | null       |
>>> | folder1    | 4          |
>>> +------------+------------+
>>> 4 rows selected (0.074 seconds)
>>>
>>>> select dir0,col1,col2 from `data`;
>>>>
>>> +------------+------------+------------+
>>> |    dir0    |    col1    |    col2    |
>>> +------------+------------+------------+
>>> | null       | 1          | null       |
>>> | folder1    | null       | 3          |
>>> | folder1    | null       | 2          |
>>> | folder1    | 4          | null       |
>>> +------------+------------+------------+
>>> 4 rows selected (0.088 seconds)
>>>
>>>> select dir0,col1,col2,col3 from `data`;
>>>>
>>> +------------+------------+------------+------------+
>>> |    dir0    |    col1    |    col2    |    col3    |
>>> +------------+------------+------------+------------+
>>> | null       | 1          | null       | null       |
>>> | folder1    | null       | 3          | null       |
>>> | folder1    | null       | 2          | null       |
>>> | folder1    | 4          | null       | null       |
>>> +------------+------------+------------+------------+
>>> 4 rows selected (0.098 seconds)
>>>
>>> Thanks,
>>> Hao
>>>
>>> On Wed, Apr 29, 2015 at 5:14 PM, rahul challapalli <
>>> challapallirahul@gmail.com> wrote:
>>>
>>>  What is the desired behavior when I run "select * from data;" on the
>>>>
>>> below
>>>
>>>> structure?
>>>>
>>>> data/
>>>>    -- file1.json
>>>>    -- folder1/
>>>>         -- file2.json
>>>>
>>>> file1.json : {"col1" : 1}
>>>> file2.json : {"col2" : 2}
>>>>
>>>> This is what drill returns :
>>>> +------------+------------+
>>>> |    dir0    |    col2    |
>>>> +------------+------------+
>>>> | folder1   | 2          |
>>>> | null       | null       |
>>>> +------------+------------+
>>>>
>>>> Looks like drill ignored the columns from the first file.
>>>>
>>>> - Rahul
>>>>
>>>>
>>>
>>
>>
>>
>
> --
> Daniel Barclay
> MapR Technologies
>