You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Andries Engelbrecht <ae...@maprtech.com> on 2015/01/21 22:46:04 UTC

Filter out empty arrays in JSON

How do you filter out records with an empty array in drill? 
i.e some records have "url":[]  and some will have an array with data in it. When trying to read records with data in the array drill fails due to records missing any data in the array. Trying a filter with/* where "url":[0] is not null */ fails, also fails if applying url is not null.

Note some of the arrays contains maps, using twitter data as an example below. Some records have an empty array with “hashtags”:[]  and others will look similar to what is listed below.

"entities": {
    "trends": [],
    "symbols": [],
    "urls": [],
    "hashtags": [
      {
        "text": "GoPatriots",
        "indices": [
          83,
          94
        ]
      }
    ],
    "user_mentions": []
  },


Thanks
—Andries

Re: Filter out empty arrays in JSON

Posted by Jason Altekruse <al...@gmail.com>.
Currently this will fail on a repeated list or repeated map. This function
has only been defined for lists of scalars. There is a JIRA open for the
enhancement request, the priority on this should probably be bumped up with
the use cases we have seen. Currently it is marked "Future".

https://issues.apache.org/jira/browse/DRILL-1650

- Jason

On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <cm...@mapr.com> wrote:

> Trying that locally did not work for me (drill 0.7.0):
>
> 0: jdbc:drill:zk=local> select `id`, `name`, `members` from
> `Downloads/test.json` where repeated_count(`members`) > 0;
> Query failed: Query stopped., Failure while trying to materialize
> incoming schema.  Errors:
>
> Error in expression at index -1.  Error: Missing function
> implementation: [repeated_count(MAP-REPEATED)].  Full expression:
> --UNKNOWN EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
> bullseye-3:31010 ]
>
> Error: exception while executing query: Failure while executing query.
> (state=,code=0)
>
> ​
>
> Chris Matta
> cmatta@mapr.com
> 215-701-3146
>
> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <ad...@gmail.com> wrote:
>
> > repeated_count('entities.urls') > 0
> >
> > On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
> > aengelbrecht@maprtech.com> wrote:
> >
> > > How do you filter out records with an empty array in drill?
> > > i.e some records have "url":[]  and some will have an array with data
> in
> > > it. When trying to read records with data in the array drill fails due
> to
> > > records missing any data in the array. Trying a filter with/* where
> > > "url":[0] is not null */ fails, also fails if applying url is not null.
> > >
> > > Note some of the arrays contains maps, using twitter data as an example
> > > below. Some records have an empty array with “hashtags”:[]  and others
> > will
> > > look similar to what is listed below.
> > >
> > > "entities": {
> > >     "trends": [],
> > >     "symbols": [],
> > >     "urls": [],
> > >     "hashtags": [
> > >       {
> > >         "text": "GoPatriots",
> > >         "indices": [
> > >           83,
> > >           94
> > >         ]
> > >       }
> > >     ],
> > >     "user_mentions": []
> > >   },
> > >
> > >
> > > Thanks
> > > —Andries
> >
>

Re: Filter out empty arrays in JSON

Posted by Andries Engelbrecht <ae...@maprtech.com>.
I get the same error as Chris when trying to use repeated_count.

The issue is that some arrays will have maps/nested maps, and some will be empty.

Thus if we take the twitter example trying to retrieve entities.hashtags.text will fail due to some some entities.hashtags begin empty. Using null detection fails as the array hashtags is empty for some records.

If the repeated_count enhancement that Jason refers to should be able to resolve this, or a simpler alternative of having a null or other function to filter out empty arrays at any level in the JSON tree.

—Andries


On Jan 21, 2015, at 2:01 PM, Aditya <ad...@gmail.com> wrote:

> I believe that this works if the array contains homogeneous primitive
> types. In your example, it appears from the error, the array field 'member'
> contained maps for at least one record.
> 
> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <cm...@mapr.com> wrote:
> 
>> Trying that locally did not work for me (drill 0.7.0):
>> 
>> 0: jdbc:drill:zk=local> select `id`, `name`, `members` from `Downloads/test.json` where repeated_count(`members`) > 0;
>> Query failed: Query stopped., Failure while trying to materialize incoming schema.  Errors:
>> 
>> Error in expression at index -1.  Error: Missing function implementation: [repeated_count(MAP-REPEATED)].  Full expression: --UNKNOWN EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on bullseye-3:31010 ]
>> 
>> Error: exception while executing query: Failure while executing query. (state=,code=0)
>> 
>> ​
>> 
>> Chris Matta
>> cmatta@mapr.com
>> 215-701-3146
>> 
>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <ad...@gmail.com> wrote:
>> 
>>> repeated_count('entities.urls') > 0
>>> 
>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
>>> aengelbrecht@maprtech.com> wrote:
>>> 
>>>> How do you filter out records with an empty array in drill?
>>>> i.e some records have "url":[]  and some will have an array with data in
>>>> it. When trying to read records with data in the array drill fails due
>>> to
>>>> records missing any data in the array. Trying a filter with/* where
>>>> "url":[0] is not null */ fails, also fails if applying url is not null.
>>>> 
>>>> Note some of the arrays contains maps, using twitter data as an example
>>>> below. Some records have an empty array with “hashtags”:[]  and others
>>> will
>>>> look similar to what is listed below.
>>>> 
>>>> "entities": {
>>>>    "trends": [],
>>>>    "symbols": [],
>>>>    "urls": [],
>>>>    "hashtags": [
>>>>      {
>>>>        "text": "GoPatriots",
>>>>        "indices": [
>>>>          83,
>>>>          94
>>>>        ]
>>>>      }
>>>>    ],
>>>>    "user_mentions": []
>>>>  },
>>>> 
>>>> 
>>>> Thanks
>>>> —Andries
>>> 
>> 
>> 


Re: Filter out empty arrays in JSON

Posted by Steven Phillips <sp...@maprtech.com>.
>From the link, it doesn't look there is a convention.

On Wed, Jan 21, 2015 at 6:00 PM, Hao Zhu <hz...@maprtech.com> wrote:

> DRILL-2055 is filed.
>
> Thanks,
> Hao
>
> On Wed, Jan 21, 2015 at 5:37 PM, Jacques Nadeau <ja...@apache.org>
> wrote:
>
> > Steven, are you referring to the fact that he has two map keys with the
> > same name?
> >
> > If so, is that technically invalid json?
> >
> > I generally agree with this post [1].
> >
> >
> >
> http://stackoverflow.com/questions/21832701/does-json-syntax-allow-duplicate-keys-in-an-object#answer-23195243
> >
> > If others agree, I think it would be appropriate for Hao to file a JIRA
> to
> > make sure we follow and check this convention.
> >
> > J
> >
> > On Wed, Jan 21, 2015 at 5:07 PM, Steven Phillips <sphillips@maprtech.com
> >
> > wrote:
> >
> > > Hao,
> > >
> > > It looks to me like your json file is actually invalid, and it seems
> like
> > > Drill should be throwing an exception.
> > >
> > > On Wed, Jan 21, 2015 at 5:04 PM, Hao Zhu <hz...@maprtech.com> wrote:
> > >
> > > > I figured out the differences after getting the json file from
> Andries.
> > > >
> > > > My json file is like:
> > > > {“entities”:{xxx},“entities”:{yyy}... }
> > > >
> > > > Andries' json file is like:
> > > > {“entities”:{xxx}}
> > > > {“entities”:{yyy}}
> > > > ...
> > > >
> > > > So basically Andries' json file is contains multiple json files.
> > > >
> > > > Fortunately, we can use the same SQL to get the same results:
> > > > 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z2.json` t;
> > > > +------------+
> > > > |   EXPR$0   |
> > > > +------------+
> > > > | []         |
> > > > | [{"text":"GoPatriots"}] |
> > > > | [{"text":"aaa"}] |
> > > > | []         |
> > > > | [{"text":"bbb"}] |
> > > > +------------+
> > > > 5 rows selected (0.136 seconds)
> > > > 0: jdbc:drill:> with tmp as
> > > > . . . . . . . > (
> > > > . . . . . . . > select flatten(t.entities.hashtags) as c from
> > > > dfs.tmp.`z2.json` t
> > > > . . . . . . . > )
> > > > . . . . . . . > select tmp.c.text from tmp;
> > > > +------------+
> > > > |   EXPR$0   |
> > > > +------------+
> > > > | GoPatriots |
> > > > | aaa        |
> > > > | bbb        |
> > > > +------------+
> > > > 3 rows selected (0.115 seconds)
> > > >
> > > > Thanks,
> > > > Hao
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Jan 21, 2015 at 4:34 PM, Andries Engelbrecht <
> > > > aengelbrecht@maprtech.com> wrote:
> > > >
> > > > > In my case it returns the empty records when flatten is not used.
> > > > >
> > > > > 0: jdbc:drill:zk=drilldemo:5181> select t.entities.hashtags as
> > hashtags
> > > > > from `twitter.json` t limit 10;
> > > > > +------------+
> > > > > |  hashtags  |
> > > > > +------------+
> > > > > | []         |
> > > > > | [{"text":"SportsNews","indices":[0,11]}] |
> > > > > | []         |
> > > > > | [{"text":"SportsNews","indices":[0,11]}] |
> > > > > | []         |
> > > > > | []         |
> > > > > | []         |
> > > > > | []         |
> > > > > | []         |
> > > > > | [{"text":"CARvsSEA","indices":[36,45]}] |
> > > > > +------------+
> > > > >
> > > > > On Jan 21, 2015, at 4:26 PM, Hao Zhu <hz...@maprtech.com> wrote:
> > > > >
> > > > > > Actually not due to flatten, if you directly query the file, it
> > will
> > > > only
> > > > > > show the non-null values.
> > > > > >
> > > > > > 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z.json`
> t;
> > > > > > +------------+
> > > > > > |   EXPR$0   |
> > > > > > +------------+
> > > > > > | [{"text":"GoPatriots"},{"text":"aaa"},{"text":"bbb"}] |
> > > > > > +------------+
> > > > > > 1 row selected (0.123 seconds)
> > > > > >
> > > > > > Thanks,
> > > > > > Hao
> > > > > >
> > > > > > On Wed, Jan 21, 2015 at 4:18 PM, Andries Engelbrecht <
> > > > > > aengelbrecht@maprtech.com> wrote:
> > > > > >
> > > > > >> Very interesting, flatten seems to bypass empty records. Not
> sure
> > if
> > > > > that
> > > > > >> is an ideal result for all use cases, but certainly usable in
> this
> > > > case.
> > > > > >>
> > > > > >> Thanks
> > > > > >> —Andries
> > > > > >>
> > > > > >>
> > > > > >> On Jan 21, 2015, at 3:45 PM, Hao Zhu <hz...@maprtech.com> wrote:
> > > > > >>
> > > > > >>> I can also fetch non-null values for attached json file which
> > > > contains
> > > > > 6
> > > > > >> "entities", 3 of them are null, 3 of them have "text" value.
> > > > > >>>
> > > > > >>> Could you share your complete "twitter.json"?
> > > > > >>>
> > > > > >>> 0: jdbc:drill:> with tmp as
> > > > > >>> . . . . . . . > (
> > > > > >>> . . . . . . . > select flatten(t.entities.hashtags) as c from
> > > > > >> dfs.tmp.`z.json` t
> > > > > >>> . . . . . . . > )
> > > > > >>> . . . . . . . > select tmp.c.text from tmp;
> > > > > >>> +------------+
> > > > > >>> |   EXPR$0   |
> > > > > >>> +------------+
> > > > > >>> | GoPatriots |
> > > > > >>> | aaa        |
> > > > > >>> | bbb        |
> > > > > >>> +------------+
> > > > > >>> 3 rows selected (0.122 seconds)
> > > > > >>>
> > > > > >>> Thanks,
> > > > > >>> Hao
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> On Wed, Jan 21, 2015 at 3:35 PM, Andries Engelbrecht <
> > > > > >> aengelbrecht@maprtech.com> wrote:
> > > > > >>> The sample data I posted only has 1 element, but some records
> > have
> > > > > >> multiple elements in them.
> > > > > >>>
> > > > > >>> Interestingly enough though
> > > > > >>> select t.entities.hashtags[0].`text` from `twitter.json` t
> limit
> > > 10;
> > > > > >>>
> > > > > >>> Produces
> > > > > >>> +------------+
> > > > > >>> |   EXPR$0   |
> > > > > >>> +------------+
> > > > > >>> | null       |
> > > > > >>> | SportsNews |
> > > > > >>> | null       |
> > > > > >>> | SportsNews |
> > > > > >>> | null       |
> > > > > >>> | null       |
> > > > > >>> | null       |
> > > > > >>> | null       |
> > > > > >>> | null       |
> > > > > >>> | CARvsSEA   |
> > > > > >>> +——————+
> > > > > >>>
> > > > > >>>
> > > > > >>> And
> > > > > >>>
> > > > > >>> select t.entities.hashtags[0] from `twitter.json` t limit 10;
> > > > > >>>
> > > > > >>> +------------+
> > > > > >>> |   EXPR$0   |
> > > > > >>> +------------+
> > > > > >>> | {"indices":[]} |
> > > > > >>> | {"text":"SportsNews","indices":[90,99]} |
> > > > > >>> | {"indices":[]} |
> > > > > >>> | {"text":"SportsNews","indices":[90,99]} |
> > > > > >>> | {"indices":[]} |
> > > > > >>> | {"indices":[]} |
> > > > > >>> | {"indices":[]} |
> > > > > >>> | {"indices":[]} |
> > > > > >>> | {"indices":[]} |
> > > > > >>> | {"text":"CARvsSEA","indices":[90,99]} |
> > > > > >>> +——————+
> > > > > >>>
> > > > > >>> Strange part is that there is no indices map in the hashtags
> > array,
> > > > so
> > > > > >> no idea why it shows up when pointing to the first lament in an
> > > empty
> > > > > array.
> > > > > >>>
> > > > > >>>
> > > > > >>> —Andries
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> On Jan 21, 2015, at 3:24 PM, Hao Zhu <hz...@maprtech.com>
> wrote:
> > > > > >>>
> > > > > >>>> I just noticed that the result of "hashtags" is just an array
> > with
> > > > > >> only 1
> > > > > >>>> element.
> > > > > >>>> So take your example:
> > > > > >>>> [root@maprdemo tmp]# cat d.json
> > > > > >>>> {
> > > > > >>>> "entities": {
> > > > > >>>>  "trends": [],
> > > > > >>>>  "symbols": [],
> > > > > >>>>  "urls": [],
> > > > > >>>>  "hashtags": [],
> > > > > >>>>  "user_mentions": []
> > > > > >>>> },
> > > > > >>>> "entities": {
> > > > > >>>>  "trends": [1,2,3],
> > > > > >>>>  "symbols": [4,5,6],
> > > > > >>>>  "urls": [7,8,9],
> > > > > >>>>  "hashtags": [
> > > > > >>>>    {
> > > > > >>>>      "text": "GoPatriots",
> > > > > >>>>      "indices": []
> > > > > >>>>    }
> > > > > >>>>  ],
> > > > > >>>>  "user_mentions": []
> > > > > >>>> }
> > > > > >>>> }
> > > > > >>>>
> > > > > >>>> Now we can do this to achieve the results:
> > > > > >>>> 0: jdbc:drill:> select t.entities.hashtags from
> dfs.tmp.`d.json`
> > > t ;
> > > > > >>>> +------------+
> > > > > >>>> |   EXPR$0   |
> > > > > >>>> +------------+
> > > > > >>>> | [{"text":"GoPatriots"}] |
> > > > > >>>> +------------+
> > > > > >>>> 1 row selected (0.09 seconds)
> > > > > >>>> 0: jdbc:drill:> select t.entities.hashtags[0].text from
> > > > > >> dfs.tmp.`d.json` t ;
> > > > > >>>> +------------+
> > > > > >>>> |   EXPR$0   |
> > > > > >>>> +------------+
> > > > > >>>> | GoPatriots |
> > > > > >>>> +------------+
> > > > > >>>> 1 row selected (0.108 seconds)
> > > > > >>>>
> > > > > >>>> Thanks,
> > > > > >>>> Hao
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> On Wed, Jan 21, 2015 at 3:01 PM, Andries Engelbrecht <
> > > > > >>>> aengelbrecht@maprtech.com> wrote:
> > > > > >>>>
> > > > > >>>>> When I run the query on a larger dataset it actually show the
> > > empty
> > > > > >>>>> records.
> > > > > >>>>>
> > > > > >>>>> select t.entities.hashtags from `twitter.json` t limit 10;
> > > > > >>>>>
> > > > > >>>>> +------------+
> > > > > >>>>> |   EXPR$0   |
> > > > > >>>>> +------------+
> > > > > >>>>> | []         |
> > > > > >>>>> | [{"text":"SportsNews","indices":[0,11]}] |
> > > > > >>>>> | []         |
> > > > > >>>>> | [{"text":"SportsNews","indices":[0,11]}] |
> > > > > >>>>> | []         |
> > > > > >>>>> | []         |
> > > > > >>>>> | []         |
> > > > > >>>>> | []         |
> > > > > >>>>> | []         |
> > > > > >>>>> | [{"text":"CARvsSEA","indices":[36,45]}] |
> > > > > >>>>> +------------+
> > > > > >>>>> 10 rows selected (2.899 seconds)
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>> However having the output as maps is not very useful, unless
> i
> > > can
> > > > > >> filter
> > > > > >>>>> out the records with empty arrays and then drill deeper into
> > the
> > > > ones
> > > > > >> with
> > > > > >>>>> data in the arrays.
> > > > > >>>>>
> > > > > >>>>> BTW: Hao I would have expiated your query to return both
> rows,
> > > one
> > > > > >> with an
> > > > > >>>>> empty array as above and the other with the array data.
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>> —Andries
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>> On Jan 21, 2015, at 2:56 PM, Hao Zhu <hz...@maprtech.com>
> > wrote:
> > > > > >>>>>
> > > > > >>>>>> I am not sure if below is expected behavior.
> > > > > >>>>>> If we only select "hashtags", and it will return only 1 row
> > > > ignoring
> > > > > >> the
> > > > > >>>>>> null value.
> > > > > >>>>>> However then if we try to get "hashtags.text", it
> > fails...which
> > > > > >> means it
> > > > > >>>>> is
> > > > > >>>>>> still trying to read the NULL value.
> > > > > >>>>>> I am thinking it may confuse the SQL developers.
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>> 0: jdbc:drill:> select t.entities.hashtags from
> > dfs.tmp.`d.json`
> > > > t ;
> > > > > >>>>>> +------------+
> > > > > >>>>>> |   EXPR$0   |
> > > > > >>>>>> +------------+
> > > > > >>>>>> | [{"text":"GoPatriots"}] |
> > > > > >>>>>> +------------+
> > > > > >>>>>> 1 row selected (0.109 seconds)
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>> 0: jdbc:drill:> select t.entities.hashtags.text from
> > > > > >> dfs.tmp.`d.json` t ;
> > > > > >>>>>>
> > > > > >>>>>> Query failed: Query failed: Failure while running fragment.,
> > > > > >>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector
> cannot
> > be
> > > > > >> cast to
> > > > > >>>>>> org.apache.drill.exec.vector.complex.MapVector [
> > > > > >>>>>> 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> > > > > >>>>>> [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>> Error: exception while executing query: Failure while
> > executing
> > > > > >> query.
> > > > > >>>>>> (state=,code=0)
> > > > > >>>>>>
> > > > > >>>>>> Thanks,
> > > > > >>>>>> Hao
> > > > > >>>>>>
> > > > > >>>>>> On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht <
> > > > > >>>>>> aengelbrecht@maprtech.com> wrote:
> > > > > >>>>>>
> > > > > >>>>>>> Now try on hashtags with the following:
> > > > > >>>>>>>
> > > > > >>>>>>> drilldemo:5181> select t.entities.hashtags.`text` from
> > > > > >> `/twitter.json` t
> > > > > >>>>>>> where t.entities.hashtags is not null limit 10;
> > > > > >>>>>>>
> > > > > >>>>>>> Query failed: Query failed: Failure while running
> fragment.,
> > > > > >>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector
> cannot
> > > be
> > > > > >> cast to
> > > > > >>>>>>> org.apache.drill.exec.vector.complex.MapVector [
> > > > > >>>>>>> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> > > > > >>>>>>> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>> Error: exception while executing query: Failure while
> > executing
> > > > > >> query.
> > > > > >>>>>>> (state=,code=0)
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>> {
> > > > > >>>>>>> "entities": {
> > > > > >>>>>>> "trends": [],
> > > > > >>>>>>> "symbols": [],
> > > > > >>>>>>> "urls": [],
> > > > > >>>>>>> "hashtags": [],
> > > > > >>>>>>> "user_mentions": []
> > > > > >>>>>>> },
> > > > > >>>>>>> "entities": {
> > > > > >>>>>>> "trends": [1,2,3],
> > > > > >>>>>>> "symbols": [4,5,6],
> > > > > >>>>>>> "urls": [7,8,9],
> > > > > >>>>>>> "hashtags": [
> > > > > >>>>>>>   {
> > > > > >>>>>>>     "text": "GoPatriots",
> > > > > >>>>>>>     "indices": []
> > > > > >>>>>>>   }
> > > > > >>>>>>> ],
> > > > > >>>>>>> "user_mentions": []
> > > > > >>>>>>> }
> > > > > >>>>>>> }
> > > > > >>>>>>>
> > > > > >>>>>>> The issue seems to be that if some records have arrays with
> > > maps
> > > > in
> > > > > >> them
> > > > > >>>>>>> and others are empty.
> > > > > >>>>>>>
> > > > > >>>>>>> —Andries
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>> On Jan 21, 2015, at 2:34 PM, Hao Zhu <hz...@maprtech.com>
> > > wrote:
> > > > > >>>>>>>
> > > > > >>>>>>>> Seems it works for below json file:
> > > > > >>>>>>>> {
> > > > > >>>>>>>> "entities": {
> > > > > >>>>>>>> "trends": [],
> > > > > >>>>>>>> "symbols": [],
> > > > > >>>>>>>> "urls": [],
> > > > > >>>>>>>> "hashtags": [
> > > > > >>>>>>>>   {
> > > > > >>>>>>>>     "text": "GoPatriots",
> > > > > >>>>>>>>     "indices": [
> > > > > >>>>>>>>       83,
> > > > > >>>>>>>>       94
> > > > > >>>>>>>>     ]
> > > > > >>>>>>>>   }
> > > > > >>>>>>>> ],
> > > > > >>>>>>>> "user_mentions": []
> > > > > >>>>>>>> },
> > > > > >>>>>>>> "entities": {
> > > > > >>>>>>>> "trends": [1,2,3],
> > > > > >>>>>>>> "symbols": [4,5,6],
> > > > > >>>>>>>> "urls": [7,8,9],
> > > > > >>>>>>>> "hashtags": [
> > > > > >>>>>>>>   {
> > > > > >>>>>>>>     "text": "GoPatriots",
> > > > > >>>>>>>>     "indices": []
> > > > > >>>>>>>>   }
> > > > > >>>>>>>> ],
> > > > > >>>>>>>> "user_mentions": []
> > > > > >>>>>>>> }
> > > > > >>>>>>>> }
> > > > > >>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>> 0: jdbc:drill:> select t.entities.urls from
> dfs.tmp.`a.json`
> > > as
> > > > t
> > > > > >> where
> > > > > >>>>>>>> t.entities.urls is not null;
> > > > > >>>>>>>> +------------+
> > > > > >>>>>>>> |   EXPR$0   |
> > > > > >>>>>>>> +------------+
> > > > > >>>>>>>> | [7,8,9]    |
> > > > > >>>>>>>> +------------+
> > > > > >>>>>>>> 1 row selected (0.139 seconds)
> > > > > >>>>>>>> 0: jdbc:drill:> select t.entities.urls from
> dfs.tmp.`a.json`
> > > as
> > > > t
> > > > > >> where
> > > > > >>>>>>>> t.entities.urls is null;
> > > > > >>>>>>>> +------------+
> > > > > >>>>>>>> |   EXPR$0   |
> > > > > >>>>>>>> +------------+
> > > > > >>>>>>>> +------------+
> > > > > >>>>>>>> No rows selected (0.158 seconds)
> > > > > >>>>>>>>
> > > > > >>>>>>>> Thanks,
> > > > > >>>>>>>> Hao
> > > > > >>>>>>>>
> > > > > >>>>>>>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya <
> > > > adityakishore@gmail.com>
> > > > > >>>>> wrote:
> > > > > >>>>>>>>
> > > > > >>>>>>>>> I believe that this works if the array contains
> homogeneous
> > > > > >> primitive
> > > > > >>>>>>>>> types. In your example, it appears from the error, the
> > array
> > > > > field
> > > > > >>>>>>> 'member'
> > > > > >>>>>>>>> contained maps for at least one record.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <
> > > > > >> cmatta@mapr.com>
> > > > > >>>>>>>>> wrote:
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> Trying that locally did not work for me (drill 0.7.0):
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members`
> > from
> > > > > >>>>>>>>> `Downloads/test.json` where repeated_count(`members`) >
> 0;
> > > > > >>>>>>>>>> Query failed: Query stopped., Failure while trying to
> > > > > materialize
> > > > > >>>>>>>>> incoming schema.  Errors:
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Error in expression at index -1.  Error: Missing
> function
> > > > > >>>>>>>>> implementation: [repeated_count(MAP-REPEATED)].  Full
> > > > expression:
> > > > > >>>>>>> --UNKNOWN
> > > > > >>>>>>>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
> > > > > >>>>>>> bullseye-3:31010 ]
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Error: exception while executing query: Failure while
> > > > executing
> > > > > >>>>> query.
> > > > > >>>>>>>>> (state=,code=0)
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> ​
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Chris Matta
> > > > > >>>>>>>>>> cmatta@mapr.com
> > > > > >>>>>>>>>> 215-701-3146
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <
> > > > > adityakishore@gmail.com
> > > > > >>>
> > > > > >>>>>>> wrote:
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> repeated_count('entities.urls') > 0
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
> > > > > >>>>>>>>>>> aengelbrecht@maprtech.com> wrote:
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>> How do you filter out records with an empty array in
> > > drill?
> > > > > >>>>>>>>>>>> i.e some records have "url":[]  and some will have an
> > > array
> > > > > >> with
> > > > > >>>>> data
> > > > > >>>>>>>>> in
> > > > > >>>>>>>>>>>> it. When trying to read records with data in the array
> > > drill
> > > > > >> fails
> > > > > >>>>>>> due
> > > > > >>>>>>>>>>> to
> > > > > >>>>>>>>>>>> records missing any data in the array. Trying a filter
> > > > with/*
> > > > > >> where
> > > > > >>>>>>>>>>>> "url":[0] is not null */ fails, also fails if applying
> > url
> > > > is
> > > > > >> not
> > > > > >>>>>>>>> null.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> Note some of the arrays contains maps, using twitter
> > data
> > > as
> > > > > an
> > > > > >>>>>>>>> example
> > > > > >>>>>>>>>>>> below. Some records have an empty array with
> > “hashtags”:[]
> > > > > and
> > > > > >>>>>>> others
> > > > > >>>>>>>>>>> will
> > > > > >>>>>>>>>>>> look similar to what is listed below.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> "entities": {
> > > > > >>>>>>>>>>>> "trends": [],
> > > > > >>>>>>>>>>>> "symbols": [],
> > > > > >>>>>>>>>>>> "urls": [],
> > > > > >>>>>>>>>>>> "hashtags": [
> > > > > >>>>>>>>>>>>   {
> > > > > >>>>>>>>>>>>     "text": "GoPatriots",
> > > > > >>>>>>>>>>>>     "indices": [
> > > > > >>>>>>>>>>>>       83,
> > > > > >>>>>>>>>>>>       94
> > > > > >>>>>>>>>>>>     ]
> > > > > >>>>>>>>>>>>   }
> > > > > >>>>>>>>>>>> ],
> > > > > >>>>>>>>>>>> "user_mentions": []
> > > > > >>>>>>>>>>>> },
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> Thanks
> > > > > >>>>>>>>>>>> —Andries
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>>
> > > > > >>> <z.json>
> > > > > >>
> > > > > >>
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >  Steven Phillips
> > >  Software Engineer
> > >
> > >  mapr.com
> > >
> >
>



-- 
 Steven Phillips
 Software Engineer

 mapr.com

Re: Filter out empty arrays in JSON

Posted by Hao Zhu <hz...@maprtech.com>.
DRILL-2055 is filed.

Thanks,
Hao

On Wed, Jan 21, 2015 at 5:37 PM, Jacques Nadeau <ja...@apache.org> wrote:

> Steven, are you referring to the fact that he has two map keys with the
> same name?
>
> If so, is that technically invalid json?
>
> I generally agree with this post [1].
>
>
> http://stackoverflow.com/questions/21832701/does-json-syntax-allow-duplicate-keys-in-an-object#answer-23195243
>
> If others agree, I think it would be appropriate for Hao to file a JIRA to
> make sure we follow and check this convention.
>
> J
>
> On Wed, Jan 21, 2015 at 5:07 PM, Steven Phillips <sp...@maprtech.com>
> wrote:
>
> > Hao,
> >
> > It looks to me like your json file is actually invalid, and it seems like
> > Drill should be throwing an exception.
> >
> > On Wed, Jan 21, 2015 at 5:04 PM, Hao Zhu <hz...@maprtech.com> wrote:
> >
> > > I figured out the differences after getting the json file from Andries.
> > >
> > > My json file is like:
> > > {“entities”:{xxx},“entities”:{yyy}... }
> > >
> > > Andries' json file is like:
> > > {“entities”:{xxx}}
> > > {“entities”:{yyy}}
> > > ...
> > >
> > > So basically Andries' json file is contains multiple json files.
> > >
> > > Fortunately, we can use the same SQL to get the same results:
> > > 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z2.json` t;
> > > +------------+
> > > |   EXPR$0   |
> > > +------------+
> > > | []         |
> > > | [{"text":"GoPatriots"}] |
> > > | [{"text":"aaa"}] |
> > > | []         |
> > > | [{"text":"bbb"}] |
> > > +------------+
> > > 5 rows selected (0.136 seconds)
> > > 0: jdbc:drill:> with tmp as
> > > . . . . . . . > (
> > > . . . . . . . > select flatten(t.entities.hashtags) as c from
> > > dfs.tmp.`z2.json` t
> > > . . . . . . . > )
> > > . . . . . . . > select tmp.c.text from tmp;
> > > +------------+
> > > |   EXPR$0   |
> > > +------------+
> > > | GoPatriots |
> > > | aaa        |
> > > | bbb        |
> > > +------------+
> > > 3 rows selected (0.115 seconds)
> > >
> > > Thanks,
> > > Hao
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Jan 21, 2015 at 4:34 PM, Andries Engelbrecht <
> > > aengelbrecht@maprtech.com> wrote:
> > >
> > > > In my case it returns the empty records when flatten is not used.
> > > >
> > > > 0: jdbc:drill:zk=drilldemo:5181> select t.entities.hashtags as
> hashtags
> > > > from `twitter.json` t limit 10;
> > > > +------------+
> > > > |  hashtags  |
> > > > +------------+
> > > > | []         |
> > > > | [{"text":"SportsNews","indices":[0,11]}] |
> > > > | []         |
> > > > | [{"text":"SportsNews","indices":[0,11]}] |
> > > > | []         |
> > > > | []         |
> > > > | []         |
> > > > | []         |
> > > > | []         |
> > > > | [{"text":"CARvsSEA","indices":[36,45]}] |
> > > > +------------+
> > > >
> > > > On Jan 21, 2015, at 4:26 PM, Hao Zhu <hz...@maprtech.com> wrote:
> > > >
> > > > > Actually not due to flatten, if you directly query the file, it
> will
> > > only
> > > > > show the non-null values.
> > > > >
> > > > > 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z.json` t;
> > > > > +------------+
> > > > > |   EXPR$0   |
> > > > > +------------+
> > > > > | [{"text":"GoPatriots"},{"text":"aaa"},{"text":"bbb"}] |
> > > > > +------------+
> > > > > 1 row selected (0.123 seconds)
> > > > >
> > > > > Thanks,
> > > > > Hao
> > > > >
> > > > > On Wed, Jan 21, 2015 at 4:18 PM, Andries Engelbrecht <
> > > > > aengelbrecht@maprtech.com> wrote:
> > > > >
> > > > >> Very interesting, flatten seems to bypass empty records. Not sure
> if
> > > > that
> > > > >> is an ideal result for all use cases, but certainly usable in this
> > > case.
> > > > >>
> > > > >> Thanks
> > > > >> —Andries
> > > > >>
> > > > >>
> > > > >> On Jan 21, 2015, at 3:45 PM, Hao Zhu <hz...@maprtech.com> wrote:
> > > > >>
> > > > >>> I can also fetch non-null values for attached json file which
> > > contains
> > > > 6
> > > > >> "entities", 3 of them are null, 3 of them have "text" value.
> > > > >>>
> > > > >>> Could you share your complete "twitter.json"?
> > > > >>>
> > > > >>> 0: jdbc:drill:> with tmp as
> > > > >>> . . . . . . . > (
> > > > >>> . . . . . . . > select flatten(t.entities.hashtags) as c from
> > > > >> dfs.tmp.`z.json` t
> > > > >>> . . . . . . . > )
> > > > >>> . . . . . . . > select tmp.c.text from tmp;
> > > > >>> +------------+
> > > > >>> |   EXPR$0   |
> > > > >>> +------------+
> > > > >>> | GoPatriots |
> > > > >>> | aaa        |
> > > > >>> | bbb        |
> > > > >>> +------------+
> > > > >>> 3 rows selected (0.122 seconds)
> > > > >>>
> > > > >>> Thanks,
> > > > >>> Hao
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> On Wed, Jan 21, 2015 at 3:35 PM, Andries Engelbrecht <
> > > > >> aengelbrecht@maprtech.com> wrote:
> > > > >>> The sample data I posted only has 1 element, but some records
> have
> > > > >> multiple elements in them.
> > > > >>>
> > > > >>> Interestingly enough though
> > > > >>> select t.entities.hashtags[0].`text` from `twitter.json` t limit
> > 10;
> > > > >>>
> > > > >>> Produces
> > > > >>> +------------+
> > > > >>> |   EXPR$0   |
> > > > >>> +------------+
> > > > >>> | null       |
> > > > >>> | SportsNews |
> > > > >>> | null       |
> > > > >>> | SportsNews |
> > > > >>> | null       |
> > > > >>> | null       |
> > > > >>> | null       |
> > > > >>> | null       |
> > > > >>> | null       |
> > > > >>> | CARvsSEA   |
> > > > >>> +——————+
> > > > >>>
> > > > >>>
> > > > >>> And
> > > > >>>
> > > > >>> select t.entities.hashtags[0] from `twitter.json` t limit 10;
> > > > >>>
> > > > >>> +------------+
> > > > >>> |   EXPR$0   |
> > > > >>> +------------+
> > > > >>> | {"indices":[]} |
> > > > >>> | {"text":"SportsNews","indices":[90,99]} |
> > > > >>> | {"indices":[]} |
> > > > >>> | {"text":"SportsNews","indices":[90,99]} |
> > > > >>> | {"indices":[]} |
> > > > >>> | {"indices":[]} |
> > > > >>> | {"indices":[]} |
> > > > >>> | {"indices":[]} |
> > > > >>> | {"indices":[]} |
> > > > >>> | {"text":"CARvsSEA","indices":[90,99]} |
> > > > >>> +——————+
> > > > >>>
> > > > >>> Strange part is that there is no indices map in the hashtags
> array,
> > > so
> > > > >> no idea why it shows up when pointing to the first lament in an
> > empty
> > > > array.
> > > > >>>
> > > > >>>
> > > > >>> —Andries
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> On Jan 21, 2015, at 3:24 PM, Hao Zhu <hz...@maprtech.com> wrote:
> > > > >>>
> > > > >>>> I just noticed that the result of "hashtags" is just an array
> with
> > > > >> only 1
> > > > >>>> element.
> > > > >>>> So take your example:
> > > > >>>> [root@maprdemo tmp]# cat d.json
> > > > >>>> {
> > > > >>>> "entities": {
> > > > >>>>  "trends": [],
> > > > >>>>  "symbols": [],
> > > > >>>>  "urls": [],
> > > > >>>>  "hashtags": [],
> > > > >>>>  "user_mentions": []
> > > > >>>> },
> > > > >>>> "entities": {
> > > > >>>>  "trends": [1,2,3],
> > > > >>>>  "symbols": [4,5,6],
> > > > >>>>  "urls": [7,8,9],
> > > > >>>>  "hashtags": [
> > > > >>>>    {
> > > > >>>>      "text": "GoPatriots",
> > > > >>>>      "indices": []
> > > > >>>>    }
> > > > >>>>  ],
> > > > >>>>  "user_mentions": []
> > > > >>>> }
> > > > >>>> }
> > > > >>>>
> > > > >>>> Now we can do this to achieve the results:
> > > > >>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json`
> > t ;
> > > > >>>> +------------+
> > > > >>>> |   EXPR$0   |
> > > > >>>> +------------+
> > > > >>>> | [{"text":"GoPatriots"}] |
> > > > >>>> +------------+
> > > > >>>> 1 row selected (0.09 seconds)
> > > > >>>> 0: jdbc:drill:> select t.entities.hashtags[0].text from
> > > > >> dfs.tmp.`d.json` t ;
> > > > >>>> +------------+
> > > > >>>> |   EXPR$0   |
> > > > >>>> +------------+
> > > > >>>> | GoPatriots |
> > > > >>>> +------------+
> > > > >>>> 1 row selected (0.108 seconds)
> > > > >>>>
> > > > >>>> Thanks,
> > > > >>>> Hao
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> On Wed, Jan 21, 2015 at 3:01 PM, Andries Engelbrecht <
> > > > >>>> aengelbrecht@maprtech.com> wrote:
> > > > >>>>
> > > > >>>>> When I run the query on a larger dataset it actually show the
> > empty
> > > > >>>>> records.
> > > > >>>>>
> > > > >>>>> select t.entities.hashtags from `twitter.json` t limit 10;
> > > > >>>>>
> > > > >>>>> +------------+
> > > > >>>>> |   EXPR$0   |
> > > > >>>>> +------------+
> > > > >>>>> | []         |
> > > > >>>>> | [{"text":"SportsNews","indices":[0,11]}] |
> > > > >>>>> | []         |
> > > > >>>>> | [{"text":"SportsNews","indices":[0,11]}] |
> > > > >>>>> | []         |
> > > > >>>>> | []         |
> > > > >>>>> | []         |
> > > > >>>>> | []         |
> > > > >>>>> | []         |
> > > > >>>>> | [{"text":"CARvsSEA","indices":[36,45]}] |
> > > > >>>>> +------------+
> > > > >>>>> 10 rows selected (2.899 seconds)
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> However having the output as maps is not very useful, unless i
> > can
> > > > >> filter
> > > > >>>>> out the records with empty arrays and then drill deeper into
> the
> > > ones
> > > > >> with
> > > > >>>>> data in the arrays.
> > > > >>>>>
> > > > >>>>> BTW: Hao I would have expiated your query to return both rows,
> > one
> > > > >> with an
> > > > >>>>> empty array as above and the other with the array data.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> —Andries
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> On Jan 21, 2015, at 2:56 PM, Hao Zhu <hz...@maprtech.com>
> wrote:
> > > > >>>>>
> > > > >>>>>> I am not sure if below is expected behavior.
> > > > >>>>>> If we only select "hashtags", and it will return only 1 row
> > > ignoring
> > > > >> the
> > > > >>>>>> null value.
> > > > >>>>>> However then if we try to get "hashtags.text", it
> fails...which
> > > > >> means it
> > > > >>>>> is
> > > > >>>>>> still trying to read the NULL value.
> > > > >>>>>> I am thinking it may confuse the SQL developers.
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> 0: jdbc:drill:> select t.entities.hashtags from
> dfs.tmp.`d.json`
> > > t ;
> > > > >>>>>> +------------+
> > > > >>>>>> |   EXPR$0   |
> > > > >>>>>> +------------+
> > > > >>>>>> | [{"text":"GoPatriots"}] |
> > > > >>>>>> +------------+
> > > > >>>>>> 1 row selected (0.109 seconds)
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> 0: jdbc:drill:> select t.entities.hashtags.text from
> > > > >> dfs.tmp.`d.json` t ;
> > > > >>>>>>
> > > > >>>>>> Query failed: Query failed: Failure while running fragment.,
> > > > >>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot
> be
> > > > >> cast to
> > > > >>>>>> org.apache.drill.exec.vector.complex.MapVector [
> > > > >>>>>> 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> > > > >>>>>> [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> Error: exception while executing query: Failure while
> executing
> > > > >> query.
> > > > >>>>>> (state=,code=0)
> > > > >>>>>>
> > > > >>>>>> Thanks,
> > > > >>>>>> Hao
> > > > >>>>>>
> > > > >>>>>> On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht <
> > > > >>>>>> aengelbrecht@maprtech.com> wrote:
> > > > >>>>>>
> > > > >>>>>>> Now try on hashtags with the following:
> > > > >>>>>>>
> > > > >>>>>>> drilldemo:5181> select t.entities.hashtags.`text` from
> > > > >> `/twitter.json` t
> > > > >>>>>>> where t.entities.hashtags is not null limit 10;
> > > > >>>>>>>
> > > > >>>>>>> Query failed: Query failed: Failure while running fragment.,
> > > > >>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot
> > be
> > > > >> cast to
> > > > >>>>>>> org.apache.drill.exec.vector.complex.MapVector [
> > > > >>>>>>> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> > > > >>>>>>> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> Error: exception while executing query: Failure while
> executing
> > > > >> query.
> > > > >>>>>>> (state=,code=0)
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> {
> > > > >>>>>>> "entities": {
> > > > >>>>>>> "trends": [],
> > > > >>>>>>> "symbols": [],
> > > > >>>>>>> "urls": [],
> > > > >>>>>>> "hashtags": [],
> > > > >>>>>>> "user_mentions": []
> > > > >>>>>>> },
> > > > >>>>>>> "entities": {
> > > > >>>>>>> "trends": [1,2,3],
> > > > >>>>>>> "symbols": [4,5,6],
> > > > >>>>>>> "urls": [7,8,9],
> > > > >>>>>>> "hashtags": [
> > > > >>>>>>>   {
> > > > >>>>>>>     "text": "GoPatriots",
> > > > >>>>>>>     "indices": []
> > > > >>>>>>>   }
> > > > >>>>>>> ],
> > > > >>>>>>> "user_mentions": []
> > > > >>>>>>> }
> > > > >>>>>>> }
> > > > >>>>>>>
> > > > >>>>>>> The issue seems to be that if some records have arrays with
> > maps
> > > in
> > > > >> them
> > > > >>>>>>> and others are empty.
> > > > >>>>>>>
> > > > >>>>>>> —Andries
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> On Jan 21, 2015, at 2:34 PM, Hao Zhu <hz...@maprtech.com>
> > wrote:
> > > > >>>>>>>
> > > > >>>>>>>> Seems it works for below json file:
> > > > >>>>>>>> {
> > > > >>>>>>>> "entities": {
> > > > >>>>>>>> "trends": [],
> > > > >>>>>>>> "symbols": [],
> > > > >>>>>>>> "urls": [],
> > > > >>>>>>>> "hashtags": [
> > > > >>>>>>>>   {
> > > > >>>>>>>>     "text": "GoPatriots",
> > > > >>>>>>>>     "indices": [
> > > > >>>>>>>>       83,
> > > > >>>>>>>>       94
> > > > >>>>>>>>     ]
> > > > >>>>>>>>   }
> > > > >>>>>>>> ],
> > > > >>>>>>>> "user_mentions": []
> > > > >>>>>>>> },
> > > > >>>>>>>> "entities": {
> > > > >>>>>>>> "trends": [1,2,3],
> > > > >>>>>>>> "symbols": [4,5,6],
> > > > >>>>>>>> "urls": [7,8,9],
> > > > >>>>>>>> "hashtags": [
> > > > >>>>>>>>   {
> > > > >>>>>>>>     "text": "GoPatriots",
> > > > >>>>>>>>     "indices": []
> > > > >>>>>>>>   }
> > > > >>>>>>>> ],
> > > > >>>>>>>> "user_mentions": []
> > > > >>>>>>>> }
> > > > >>>>>>>> }
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json`
> > as
> > > t
> > > > >> where
> > > > >>>>>>>> t.entities.urls is not null;
> > > > >>>>>>>> +------------+
> > > > >>>>>>>> |   EXPR$0   |
> > > > >>>>>>>> +------------+
> > > > >>>>>>>> | [7,8,9]    |
> > > > >>>>>>>> +------------+
> > > > >>>>>>>> 1 row selected (0.139 seconds)
> > > > >>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json`
> > as
> > > t
> > > > >> where
> > > > >>>>>>>> t.entities.urls is null;
> > > > >>>>>>>> +------------+
> > > > >>>>>>>> |   EXPR$0   |
> > > > >>>>>>>> +------------+
> > > > >>>>>>>> +------------+
> > > > >>>>>>>> No rows selected (0.158 seconds)
> > > > >>>>>>>>
> > > > >>>>>>>> Thanks,
> > > > >>>>>>>> Hao
> > > > >>>>>>>>
> > > > >>>>>>>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya <
> > > adityakishore@gmail.com>
> > > > >>>>> wrote:
> > > > >>>>>>>>
> > > > >>>>>>>>> I believe that this works if the array contains homogeneous
> > > > >> primitive
> > > > >>>>>>>>> types. In your example, it appears from the error, the
> array
> > > > field
> > > > >>>>>>> 'member'
> > > > >>>>>>>>> contained maps for at least one record.
> > > > >>>>>>>>>
> > > > >>>>>>>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <
> > > > >> cmatta@mapr.com>
> > > > >>>>>>>>> wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>>> Trying that locally did not work for me (drill 0.7.0):
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members`
> from
> > > > >>>>>>>>> `Downloads/test.json` where repeated_count(`members`) > 0;
> > > > >>>>>>>>>> Query failed: Query stopped., Failure while trying to
> > > > materialize
> > > > >>>>>>>>> incoming schema.  Errors:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Error in expression at index -1.  Error: Missing function
> > > > >>>>>>>>> implementation: [repeated_count(MAP-REPEATED)].  Full
> > > expression:
> > > > >>>>>>> --UNKNOWN
> > > > >>>>>>>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
> > > > >>>>>>> bullseye-3:31010 ]
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Error: exception while executing query: Failure while
> > > executing
> > > > >>>>> query.
> > > > >>>>>>>>> (state=,code=0)
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> ​
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Chris Matta
> > > > >>>>>>>>>> cmatta@mapr.com
> > > > >>>>>>>>>> 215-701-3146
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <
> > > > adityakishore@gmail.com
> > > > >>>
> > > > >>>>>>> wrote:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>> repeated_count('entities.urls') > 0
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
> > > > >>>>>>>>>>> aengelbrecht@maprtech.com> wrote:
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> How do you filter out records with an empty array in
> > drill?
> > > > >>>>>>>>>>>> i.e some records have "url":[]  and some will have an
> > array
> > > > >> with
> > > > >>>>> data
> > > > >>>>>>>>> in
> > > > >>>>>>>>>>>> it. When trying to read records with data in the array
> > drill
> > > > >> fails
> > > > >>>>>>> due
> > > > >>>>>>>>>>> to
> > > > >>>>>>>>>>>> records missing any data in the array. Trying a filter
> > > with/*
> > > > >> where
> > > > >>>>>>>>>>>> "url":[0] is not null */ fails, also fails if applying
> url
> > > is
> > > > >> not
> > > > >>>>>>>>> null.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Note some of the arrays contains maps, using twitter
> data
> > as
> > > > an
> > > > >>>>>>>>> example
> > > > >>>>>>>>>>>> below. Some records have an empty array with
> “hashtags”:[]
> > > > and
> > > > >>>>>>> others
> > > > >>>>>>>>>>> will
> > > > >>>>>>>>>>>> look similar to what is listed below.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> "entities": {
> > > > >>>>>>>>>>>> "trends": [],
> > > > >>>>>>>>>>>> "symbols": [],
> > > > >>>>>>>>>>>> "urls": [],
> > > > >>>>>>>>>>>> "hashtags": [
> > > > >>>>>>>>>>>>   {
> > > > >>>>>>>>>>>>     "text": "GoPatriots",
> > > > >>>>>>>>>>>>     "indices": [
> > > > >>>>>>>>>>>>       83,
> > > > >>>>>>>>>>>>       94
> > > > >>>>>>>>>>>>     ]
> > > > >>>>>>>>>>>>   }
> > > > >>>>>>>>>>>> ],
> > > > >>>>>>>>>>>> "user_mentions": []
> > > > >>>>>>>>>>>> },
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Thanks
> > > > >>>>>>>>>>>> —Andries
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>
> > > > >>>
> > > > >>> <z.json>
> > > > >>
> > > > >>
> > > >
> > > >
> > >
> >
> >
> >
> > --
> >  Steven Phillips
> >  Software Engineer
> >
> >  mapr.com
> >
>

Re: Filter out empty arrays in JSON

Posted by Jacques Nadeau <ja...@apache.org>.
Steven, are you referring to the fact that he has two map keys with the
same name?

If so, is that technically invalid json?

I generally agree with this post [1].

http://stackoverflow.com/questions/21832701/does-json-syntax-allow-duplicate-keys-in-an-object#answer-23195243

If others agree, I think it would be appropriate for Hao to file a JIRA to
make sure we follow and check this convention.

J

On Wed, Jan 21, 2015 at 5:07 PM, Steven Phillips <sp...@maprtech.com>
wrote:

> Hao,
>
> It looks to me like your json file is actually invalid, and it seems like
> Drill should be throwing an exception.
>
> On Wed, Jan 21, 2015 at 5:04 PM, Hao Zhu <hz...@maprtech.com> wrote:
>
> > I figured out the differences after getting the json file from Andries.
> >
> > My json file is like:
> > {“entities”:{xxx},“entities”:{yyy}... }
> >
> > Andries' json file is like:
> > {“entities”:{xxx}}
> > {“entities”:{yyy}}
> > ...
> >
> > So basically Andries' json file is contains multiple json files.
> >
> > Fortunately, we can use the same SQL to get the same results:
> > 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z2.json` t;
> > +------------+
> > |   EXPR$0   |
> > +------------+
> > | []         |
> > | [{"text":"GoPatriots"}] |
> > | [{"text":"aaa"}] |
> > | []         |
> > | [{"text":"bbb"}] |
> > +------------+
> > 5 rows selected (0.136 seconds)
> > 0: jdbc:drill:> with tmp as
> > . . . . . . . > (
> > . . . . . . . > select flatten(t.entities.hashtags) as c from
> > dfs.tmp.`z2.json` t
> > . . . . . . . > )
> > . . . . . . . > select tmp.c.text from tmp;
> > +------------+
> > |   EXPR$0   |
> > +------------+
> > | GoPatriots |
> > | aaa        |
> > | bbb        |
> > +------------+
> > 3 rows selected (0.115 seconds)
> >
> > Thanks,
> > Hao
> >
> >
> >
> >
> >
> >
> > On Wed, Jan 21, 2015 at 4:34 PM, Andries Engelbrecht <
> > aengelbrecht@maprtech.com> wrote:
> >
> > > In my case it returns the empty records when flatten is not used.
> > >
> > > 0: jdbc:drill:zk=drilldemo:5181> select t.entities.hashtags as hashtags
> > > from `twitter.json` t limit 10;
> > > +------------+
> > > |  hashtags  |
> > > +------------+
> > > | []         |
> > > | [{"text":"SportsNews","indices":[0,11]}] |
> > > | []         |
> > > | [{"text":"SportsNews","indices":[0,11]}] |
> > > | []         |
> > > | []         |
> > > | []         |
> > > | []         |
> > > | []         |
> > > | [{"text":"CARvsSEA","indices":[36,45]}] |
> > > +------------+
> > >
> > > On Jan 21, 2015, at 4:26 PM, Hao Zhu <hz...@maprtech.com> wrote:
> > >
> > > > Actually not due to flatten, if you directly query the file, it will
> > only
> > > > show the non-null values.
> > > >
> > > > 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z.json` t;
> > > > +------------+
> > > > |   EXPR$0   |
> > > > +------------+
> > > > | [{"text":"GoPatriots"},{"text":"aaa"},{"text":"bbb"}] |
> > > > +------------+
> > > > 1 row selected (0.123 seconds)
> > > >
> > > > Thanks,
> > > > Hao
> > > >
> > > > On Wed, Jan 21, 2015 at 4:18 PM, Andries Engelbrecht <
> > > > aengelbrecht@maprtech.com> wrote:
> > > >
> > > >> Very interesting, flatten seems to bypass empty records. Not sure if
> > > that
> > > >> is an ideal result for all use cases, but certainly usable in this
> > case.
> > > >>
> > > >> Thanks
> > > >> —Andries
> > > >>
> > > >>
> > > >> On Jan 21, 2015, at 3:45 PM, Hao Zhu <hz...@maprtech.com> wrote:
> > > >>
> > > >>> I can also fetch non-null values for attached json file which
> > contains
> > > 6
> > > >> "entities", 3 of them are null, 3 of them have "text" value.
> > > >>>
> > > >>> Could you share your complete "twitter.json"?
> > > >>>
> > > >>> 0: jdbc:drill:> with tmp as
> > > >>> . . . . . . . > (
> > > >>> . . . . . . . > select flatten(t.entities.hashtags) as c from
> > > >> dfs.tmp.`z.json` t
> > > >>> . . . . . . . > )
> > > >>> . . . . . . . > select tmp.c.text from tmp;
> > > >>> +------------+
> > > >>> |   EXPR$0   |
> > > >>> +------------+
> > > >>> | GoPatriots |
> > > >>> | aaa        |
> > > >>> | bbb        |
> > > >>> +------------+
> > > >>> 3 rows selected (0.122 seconds)
> > > >>>
> > > >>> Thanks,
> > > >>> Hao
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Wed, Jan 21, 2015 at 3:35 PM, Andries Engelbrecht <
> > > >> aengelbrecht@maprtech.com> wrote:
> > > >>> The sample data I posted only has 1 element, but some records have
> > > >> multiple elements in them.
> > > >>>
> > > >>> Interestingly enough though
> > > >>> select t.entities.hashtags[0].`text` from `twitter.json` t limit
> 10;
> > > >>>
> > > >>> Produces
> > > >>> +------------+
> > > >>> |   EXPR$0   |
> > > >>> +------------+
> > > >>> | null       |
> > > >>> | SportsNews |
> > > >>> | null       |
> > > >>> | SportsNews |
> > > >>> | null       |
> > > >>> | null       |
> > > >>> | null       |
> > > >>> | null       |
> > > >>> | null       |
> > > >>> | CARvsSEA   |
> > > >>> +——————+
> > > >>>
> > > >>>
> > > >>> And
> > > >>>
> > > >>> select t.entities.hashtags[0] from `twitter.json` t limit 10;
> > > >>>
> > > >>> +------------+
> > > >>> |   EXPR$0   |
> > > >>> +------------+
> > > >>> | {"indices":[]} |
> > > >>> | {"text":"SportsNews","indices":[90,99]} |
> > > >>> | {"indices":[]} |
> > > >>> | {"text":"SportsNews","indices":[90,99]} |
> > > >>> | {"indices":[]} |
> > > >>> | {"indices":[]} |
> > > >>> | {"indices":[]} |
> > > >>> | {"indices":[]} |
> > > >>> | {"indices":[]} |
> > > >>> | {"text":"CARvsSEA","indices":[90,99]} |
> > > >>> +——————+
> > > >>>
> > > >>> Strange part is that there is no indices map in the hashtags array,
> > so
> > > >> no idea why it shows up when pointing to the first lament in an
> empty
> > > array.
> > > >>>
> > > >>>
> > > >>> —Andries
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Jan 21, 2015, at 3:24 PM, Hao Zhu <hz...@maprtech.com> wrote:
> > > >>>
> > > >>>> I just noticed that the result of "hashtags" is just an array with
> > > >> only 1
> > > >>>> element.
> > > >>>> So take your example:
> > > >>>> [root@maprdemo tmp]# cat d.json
> > > >>>> {
> > > >>>> "entities": {
> > > >>>>  "trends": [],
> > > >>>>  "symbols": [],
> > > >>>>  "urls": [],
> > > >>>>  "hashtags": [],
> > > >>>>  "user_mentions": []
> > > >>>> },
> > > >>>> "entities": {
> > > >>>>  "trends": [1,2,3],
> > > >>>>  "symbols": [4,5,6],
> > > >>>>  "urls": [7,8,9],
> > > >>>>  "hashtags": [
> > > >>>>    {
> > > >>>>      "text": "GoPatriots",
> > > >>>>      "indices": []
> > > >>>>    }
> > > >>>>  ],
> > > >>>>  "user_mentions": []
> > > >>>> }
> > > >>>> }
> > > >>>>
> > > >>>> Now we can do this to achieve the results:
> > > >>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json`
> t ;
> > > >>>> +------------+
> > > >>>> |   EXPR$0   |
> > > >>>> +------------+
> > > >>>> | [{"text":"GoPatriots"}] |
> > > >>>> +------------+
> > > >>>> 1 row selected (0.09 seconds)
> > > >>>> 0: jdbc:drill:> select t.entities.hashtags[0].text from
> > > >> dfs.tmp.`d.json` t ;
> > > >>>> +------------+
> > > >>>> |   EXPR$0   |
> > > >>>> +------------+
> > > >>>> | GoPatriots |
> > > >>>> +------------+
> > > >>>> 1 row selected (0.108 seconds)
> > > >>>>
> > > >>>> Thanks,
> > > >>>> Hao
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> On Wed, Jan 21, 2015 at 3:01 PM, Andries Engelbrecht <
> > > >>>> aengelbrecht@maprtech.com> wrote:
> > > >>>>
> > > >>>>> When I run the query on a larger dataset it actually show the
> empty
> > > >>>>> records.
> > > >>>>>
> > > >>>>> select t.entities.hashtags from `twitter.json` t limit 10;
> > > >>>>>
> > > >>>>> +------------+
> > > >>>>> |   EXPR$0   |
> > > >>>>> +------------+
> > > >>>>> | []         |
> > > >>>>> | [{"text":"SportsNews","indices":[0,11]}] |
> > > >>>>> | []         |
> > > >>>>> | [{"text":"SportsNews","indices":[0,11]}] |
> > > >>>>> | []         |
> > > >>>>> | []         |
> > > >>>>> | []         |
> > > >>>>> | []         |
> > > >>>>> | []         |
> > > >>>>> | [{"text":"CARvsSEA","indices":[36,45]}] |
> > > >>>>> +------------+
> > > >>>>> 10 rows selected (2.899 seconds)
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> However having the output as maps is not very useful, unless i
> can
> > > >> filter
> > > >>>>> out the records with empty arrays and then drill deeper into the
> > ones
> > > >> with
> > > >>>>> data in the arrays.
> > > >>>>>
> > > >>>>> BTW: Hao I would have expiated your query to return both rows,
> one
> > > >> with an
> > > >>>>> empty array as above and the other with the array data.
> > > >>>>>
> > > >>>>>
> > > >>>>> —Andries
> > > >>>>>
> > > >>>>>
> > > >>>>> On Jan 21, 2015, at 2:56 PM, Hao Zhu <hz...@maprtech.com> wrote:
> > > >>>>>
> > > >>>>>> I am not sure if below is expected behavior.
> > > >>>>>> If we only select "hashtags", and it will return only 1 row
> > ignoring
> > > >> the
> > > >>>>>> null value.
> > > >>>>>> However then if we try to get "hashtags.text", it fails...which
> > > >> means it
> > > >>>>> is
> > > >>>>>> still trying to read the NULL value.
> > > >>>>>> I am thinking it may confuse the SQL developers.
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json`
> > t ;
> > > >>>>>> +------------+
> > > >>>>>> |   EXPR$0   |
> > > >>>>>> +------------+
> > > >>>>>> | [{"text":"GoPatriots"}] |
> > > >>>>>> +------------+
> > > >>>>>> 1 row selected (0.109 seconds)
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> 0: jdbc:drill:> select t.entities.hashtags.text from
> > > >> dfs.tmp.`d.json` t ;
> > > >>>>>>
> > > >>>>>> Query failed: Query failed: Failure while running fragment.,
> > > >>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be
> > > >> cast to
> > > >>>>>> org.apache.drill.exec.vector.complex.MapVector [
> > > >>>>>> 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> > > >>>>>> [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> Error: exception while executing query: Failure while executing
> > > >> query.
> > > >>>>>> (state=,code=0)
> > > >>>>>>
> > > >>>>>> Thanks,
> > > >>>>>> Hao
> > > >>>>>>
> > > >>>>>> On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht <
> > > >>>>>> aengelbrecht@maprtech.com> wrote:
> > > >>>>>>
> > > >>>>>>> Now try on hashtags with the following:
> > > >>>>>>>
> > > >>>>>>> drilldemo:5181> select t.entities.hashtags.`text` from
> > > >> `/twitter.json` t
> > > >>>>>>> where t.entities.hashtags is not null limit 10;
> > > >>>>>>>
> > > >>>>>>> Query failed: Query failed: Failure while running fragment.,
> > > >>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot
> be
> > > >> cast to
> > > >>>>>>> org.apache.drill.exec.vector.complex.MapVector [
> > > >>>>>>> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> > > >>>>>>> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> Error: exception while executing query: Failure while executing
> > > >> query.
> > > >>>>>>> (state=,code=0)
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> {
> > > >>>>>>> "entities": {
> > > >>>>>>> "trends": [],
> > > >>>>>>> "symbols": [],
> > > >>>>>>> "urls": [],
> > > >>>>>>> "hashtags": [],
> > > >>>>>>> "user_mentions": []
> > > >>>>>>> },
> > > >>>>>>> "entities": {
> > > >>>>>>> "trends": [1,2,3],
> > > >>>>>>> "symbols": [4,5,6],
> > > >>>>>>> "urls": [7,8,9],
> > > >>>>>>> "hashtags": [
> > > >>>>>>>   {
> > > >>>>>>>     "text": "GoPatriots",
> > > >>>>>>>     "indices": []
> > > >>>>>>>   }
> > > >>>>>>> ],
> > > >>>>>>> "user_mentions": []
> > > >>>>>>> }
> > > >>>>>>> }
> > > >>>>>>>
> > > >>>>>>> The issue seems to be that if some records have arrays with
> maps
> > in
> > > >> them
> > > >>>>>>> and others are empty.
> > > >>>>>>>
> > > >>>>>>> —Andries
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> On Jan 21, 2015, at 2:34 PM, Hao Zhu <hz...@maprtech.com>
> wrote:
> > > >>>>>>>
> > > >>>>>>>> Seems it works for below json file:
> > > >>>>>>>> {
> > > >>>>>>>> "entities": {
> > > >>>>>>>> "trends": [],
> > > >>>>>>>> "symbols": [],
> > > >>>>>>>> "urls": [],
> > > >>>>>>>> "hashtags": [
> > > >>>>>>>>   {
> > > >>>>>>>>     "text": "GoPatriots",
> > > >>>>>>>>     "indices": [
> > > >>>>>>>>       83,
> > > >>>>>>>>       94
> > > >>>>>>>>     ]
> > > >>>>>>>>   }
> > > >>>>>>>> ],
> > > >>>>>>>> "user_mentions": []
> > > >>>>>>>> },
> > > >>>>>>>> "entities": {
> > > >>>>>>>> "trends": [1,2,3],
> > > >>>>>>>> "symbols": [4,5,6],
> > > >>>>>>>> "urls": [7,8,9],
> > > >>>>>>>> "hashtags": [
> > > >>>>>>>>   {
> > > >>>>>>>>     "text": "GoPatriots",
> > > >>>>>>>>     "indices": []
> > > >>>>>>>>   }
> > > >>>>>>>> ],
> > > >>>>>>>> "user_mentions": []
> > > >>>>>>>> }
> > > >>>>>>>> }
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json`
> as
> > t
> > > >> where
> > > >>>>>>>> t.entities.urls is not null;
> > > >>>>>>>> +------------+
> > > >>>>>>>> |   EXPR$0   |
> > > >>>>>>>> +------------+
> > > >>>>>>>> | [7,8,9]    |
> > > >>>>>>>> +------------+
> > > >>>>>>>> 1 row selected (0.139 seconds)
> > > >>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json`
> as
> > t
> > > >> where
> > > >>>>>>>> t.entities.urls is null;
> > > >>>>>>>> +------------+
> > > >>>>>>>> |   EXPR$0   |
> > > >>>>>>>> +------------+
> > > >>>>>>>> +------------+
> > > >>>>>>>> No rows selected (0.158 seconds)
> > > >>>>>>>>
> > > >>>>>>>> Thanks,
> > > >>>>>>>> Hao
> > > >>>>>>>>
> > > >>>>>>>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya <
> > adityakishore@gmail.com>
> > > >>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> I believe that this works if the array contains homogeneous
> > > >> primitive
> > > >>>>>>>>> types. In your example, it appears from the error, the array
> > > field
> > > >>>>>>> 'member'
> > > >>>>>>>>> contained maps for at least one record.
> > > >>>>>>>>>
> > > >>>>>>>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <
> > > >> cmatta@mapr.com>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> Trying that locally did not work for me (drill 0.7.0):
> > > >>>>>>>>>>
> > > >>>>>>>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members` from
> > > >>>>>>>>> `Downloads/test.json` where repeated_count(`members`) > 0;
> > > >>>>>>>>>> Query failed: Query stopped., Failure while trying to
> > > materialize
> > > >>>>>>>>> incoming schema.  Errors:
> > > >>>>>>>>>>
> > > >>>>>>>>>> Error in expression at index -1.  Error: Missing function
> > > >>>>>>>>> implementation: [repeated_count(MAP-REPEATED)].  Full
> > expression:
> > > >>>>>>> --UNKNOWN
> > > >>>>>>>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
> > > >>>>>>> bullseye-3:31010 ]
> > > >>>>>>>>>>
> > > >>>>>>>>>> Error: exception while executing query: Failure while
> > executing
> > > >>>>> query.
> > > >>>>>>>>> (state=,code=0)
> > > >>>>>>>>>>
> > > >>>>>>>>>> ​
> > > >>>>>>>>>>
> > > >>>>>>>>>> Chris Matta
> > > >>>>>>>>>> cmatta@mapr.com
> > > >>>>>>>>>> 215-701-3146
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <
> > > adityakishore@gmail.com
> > > >>>
> > > >>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>>> repeated_count('entities.urls') > 0
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
> > > >>>>>>>>>>> aengelbrecht@maprtech.com> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> How do you filter out records with an empty array in
> drill?
> > > >>>>>>>>>>>> i.e some records have "url":[]  and some will have an
> array
> > > >> with
> > > >>>>> data
> > > >>>>>>>>> in
> > > >>>>>>>>>>>> it. When trying to read records with data in the array
> drill
> > > >> fails
> > > >>>>>>> due
> > > >>>>>>>>>>> to
> > > >>>>>>>>>>>> records missing any data in the array. Trying a filter
> > with/*
> > > >> where
> > > >>>>>>>>>>>> "url":[0] is not null */ fails, also fails if applying url
> > is
> > > >> not
> > > >>>>>>>>> null.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Note some of the arrays contains maps, using twitter data
> as
> > > an
> > > >>>>>>>>> example
> > > >>>>>>>>>>>> below. Some records have an empty array with “hashtags”:[]
> > > and
> > > >>>>>>> others
> > > >>>>>>>>>>> will
> > > >>>>>>>>>>>> look similar to what is listed below.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> "entities": {
> > > >>>>>>>>>>>> "trends": [],
> > > >>>>>>>>>>>> "symbols": [],
> > > >>>>>>>>>>>> "urls": [],
> > > >>>>>>>>>>>> "hashtags": [
> > > >>>>>>>>>>>>   {
> > > >>>>>>>>>>>>     "text": "GoPatriots",
> > > >>>>>>>>>>>>     "indices": [
> > > >>>>>>>>>>>>       83,
> > > >>>>>>>>>>>>       94
> > > >>>>>>>>>>>>     ]
> > > >>>>>>>>>>>>   }
> > > >>>>>>>>>>>> ],
> > > >>>>>>>>>>>> "user_mentions": []
> > > >>>>>>>>>>>> },
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Thanks
> > > >>>>>>>>>>>> —Andries
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>
> > > >>>>>
> > > >>>
> > > >>>
> > > >>> <z.json>
> > > >>
> > > >>
> > >
> > >
> >
>
>
>
> --
>  Steven Phillips
>  Software Engineer
>
>  mapr.com
>

Re: Filter out empty arrays in JSON

Posted by Steven Phillips <sp...@maprtech.com>.
Hao,

It looks to me like your json file is actually invalid, and it seems like
Drill should be throwing an exception.

On Wed, Jan 21, 2015 at 5:04 PM, Hao Zhu <hz...@maprtech.com> wrote:

> I figured out the differences after getting the json file from Andries.
>
> My json file is like:
> {“entities”:{xxx},“entities”:{yyy}... }
>
> Andries' json file is like:
> {“entities”:{xxx}}
> {“entities”:{yyy}}
> ...
>
> So basically Andries' json file is contains multiple json files.
>
> Fortunately, we can use the same SQL to get the same results:
> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z2.json` t;
> +------------+
> |   EXPR$0   |
> +------------+
> | []         |
> | [{"text":"GoPatriots"}] |
> | [{"text":"aaa"}] |
> | []         |
> | [{"text":"bbb"}] |
> +------------+
> 5 rows selected (0.136 seconds)
> 0: jdbc:drill:> with tmp as
> . . . . . . . > (
> . . . . . . . > select flatten(t.entities.hashtags) as c from
> dfs.tmp.`z2.json` t
> . . . . . . . > )
> . . . . . . . > select tmp.c.text from tmp;
> +------------+
> |   EXPR$0   |
> +------------+
> | GoPatriots |
> | aaa        |
> | bbb        |
> +------------+
> 3 rows selected (0.115 seconds)
>
> Thanks,
> Hao
>
>
>
>
>
>
> On Wed, Jan 21, 2015 at 4:34 PM, Andries Engelbrecht <
> aengelbrecht@maprtech.com> wrote:
>
> > In my case it returns the empty records when flatten is not used.
> >
> > 0: jdbc:drill:zk=drilldemo:5181> select t.entities.hashtags as hashtags
> > from `twitter.json` t limit 10;
> > +------------+
> > |  hashtags  |
> > +------------+
> > | []         |
> > | [{"text":"SportsNews","indices":[0,11]}] |
> > | []         |
> > | [{"text":"SportsNews","indices":[0,11]}] |
> > | []         |
> > | []         |
> > | []         |
> > | []         |
> > | []         |
> > | [{"text":"CARvsSEA","indices":[36,45]}] |
> > +------------+
> >
> > On Jan 21, 2015, at 4:26 PM, Hao Zhu <hz...@maprtech.com> wrote:
> >
> > > Actually not due to flatten, if you directly query the file, it will
> only
> > > show the non-null values.
> > >
> > > 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z.json` t;
> > > +------------+
> > > |   EXPR$0   |
> > > +------------+
> > > | [{"text":"GoPatriots"},{"text":"aaa"},{"text":"bbb"}] |
> > > +------------+
> > > 1 row selected (0.123 seconds)
> > >
> > > Thanks,
> > > Hao
> > >
> > > On Wed, Jan 21, 2015 at 4:18 PM, Andries Engelbrecht <
> > > aengelbrecht@maprtech.com> wrote:
> > >
> > >> Very interesting, flatten seems to bypass empty records. Not sure if
> > that
> > >> is an ideal result for all use cases, but certainly usable in this
> case.
> > >>
> > >> Thanks
> > >> —Andries
> > >>
> > >>
> > >> On Jan 21, 2015, at 3:45 PM, Hao Zhu <hz...@maprtech.com> wrote:
> > >>
> > >>> I can also fetch non-null values for attached json file which
> contains
> > 6
> > >> "entities", 3 of them are null, 3 of them have "text" value.
> > >>>
> > >>> Could you share your complete "twitter.json"?
> > >>>
> > >>> 0: jdbc:drill:> with tmp as
> > >>> . . . . . . . > (
> > >>> . . . . . . . > select flatten(t.entities.hashtags) as c from
> > >> dfs.tmp.`z.json` t
> > >>> . . . . . . . > )
> > >>> . . . . . . . > select tmp.c.text from tmp;
> > >>> +------------+
> > >>> |   EXPR$0   |
> > >>> +------------+
> > >>> | GoPatriots |
> > >>> | aaa        |
> > >>> | bbb        |
> > >>> +------------+
> > >>> 3 rows selected (0.122 seconds)
> > >>>
> > >>> Thanks,
> > >>> Hao
> > >>>
> > >>>
> > >>>
> > >>> On Wed, Jan 21, 2015 at 3:35 PM, Andries Engelbrecht <
> > >> aengelbrecht@maprtech.com> wrote:
> > >>> The sample data I posted only has 1 element, but some records have
> > >> multiple elements in them.
> > >>>
> > >>> Interestingly enough though
> > >>> select t.entities.hashtags[0].`text` from `twitter.json` t limit 10;
> > >>>
> > >>> Produces
> > >>> +------------+
> > >>> |   EXPR$0   |
> > >>> +------------+
> > >>> | null       |
> > >>> | SportsNews |
> > >>> | null       |
> > >>> | SportsNews |
> > >>> | null       |
> > >>> | null       |
> > >>> | null       |
> > >>> | null       |
> > >>> | null       |
> > >>> | CARvsSEA   |
> > >>> +——————+
> > >>>
> > >>>
> > >>> And
> > >>>
> > >>> select t.entities.hashtags[0] from `twitter.json` t limit 10;
> > >>>
> > >>> +------------+
> > >>> |   EXPR$0   |
> > >>> +------------+
> > >>> | {"indices":[]} |
> > >>> | {"text":"SportsNews","indices":[90,99]} |
> > >>> | {"indices":[]} |
> > >>> | {"text":"SportsNews","indices":[90,99]} |
> > >>> | {"indices":[]} |
> > >>> | {"indices":[]} |
> > >>> | {"indices":[]} |
> > >>> | {"indices":[]} |
> > >>> | {"indices":[]} |
> > >>> | {"text":"CARvsSEA","indices":[90,99]} |
> > >>> +——————+
> > >>>
> > >>> Strange part is that there is no indices map in the hashtags array,
> so
> > >> no idea why it shows up when pointing to the first lament in an empty
> > array.
> > >>>
> > >>>
> > >>> —Andries
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Jan 21, 2015, at 3:24 PM, Hao Zhu <hz...@maprtech.com> wrote:
> > >>>
> > >>>> I just noticed that the result of "hashtags" is just an array with
> > >> only 1
> > >>>> element.
> > >>>> So take your example:
> > >>>> [root@maprdemo tmp]# cat d.json
> > >>>> {
> > >>>> "entities": {
> > >>>>  "trends": [],
> > >>>>  "symbols": [],
> > >>>>  "urls": [],
> > >>>>  "hashtags": [],
> > >>>>  "user_mentions": []
> > >>>> },
> > >>>> "entities": {
> > >>>>  "trends": [1,2,3],
> > >>>>  "symbols": [4,5,6],
> > >>>>  "urls": [7,8,9],
> > >>>>  "hashtags": [
> > >>>>    {
> > >>>>      "text": "GoPatriots",
> > >>>>      "indices": []
> > >>>>    }
> > >>>>  ],
> > >>>>  "user_mentions": []
> > >>>> }
> > >>>> }
> > >>>>
> > >>>> Now we can do this to achieve the results:
> > >>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ;
> > >>>> +------------+
> > >>>> |   EXPR$0   |
> > >>>> +------------+
> > >>>> | [{"text":"GoPatriots"}] |
> > >>>> +------------+
> > >>>> 1 row selected (0.09 seconds)
> > >>>> 0: jdbc:drill:> select t.entities.hashtags[0].text from
> > >> dfs.tmp.`d.json` t ;
> > >>>> +------------+
> > >>>> |   EXPR$0   |
> > >>>> +------------+
> > >>>> | GoPatriots |
> > >>>> +------------+
> > >>>> 1 row selected (0.108 seconds)
> > >>>>
> > >>>> Thanks,
> > >>>> Hao
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Wed, Jan 21, 2015 at 3:01 PM, Andries Engelbrecht <
> > >>>> aengelbrecht@maprtech.com> wrote:
> > >>>>
> > >>>>> When I run the query on a larger dataset it actually show the empty
> > >>>>> records.
> > >>>>>
> > >>>>> select t.entities.hashtags from `twitter.json` t limit 10;
> > >>>>>
> > >>>>> +------------+
> > >>>>> |   EXPR$0   |
> > >>>>> +------------+
> > >>>>> | []         |
> > >>>>> | [{"text":"SportsNews","indices":[0,11]}] |
> > >>>>> | []         |
> > >>>>> | [{"text":"SportsNews","indices":[0,11]}] |
> > >>>>> | []         |
> > >>>>> | []         |
> > >>>>> | []         |
> > >>>>> | []         |
> > >>>>> | []         |
> > >>>>> | [{"text":"CARvsSEA","indices":[36,45]}] |
> > >>>>> +------------+
> > >>>>> 10 rows selected (2.899 seconds)
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> However having the output as maps is not very useful, unless i can
> > >> filter
> > >>>>> out the records with empty arrays and then drill deeper into the
> ones
> > >> with
> > >>>>> data in the arrays.
> > >>>>>
> > >>>>> BTW: Hao I would have expiated your query to return both rows, one
> > >> with an
> > >>>>> empty array as above and the other with the array data.
> > >>>>>
> > >>>>>
> > >>>>> —Andries
> > >>>>>
> > >>>>>
> > >>>>> On Jan 21, 2015, at 2:56 PM, Hao Zhu <hz...@maprtech.com> wrote:
> > >>>>>
> > >>>>>> I am not sure if below is expected behavior.
> > >>>>>> If we only select "hashtags", and it will return only 1 row
> ignoring
> > >> the
> > >>>>>> null value.
> > >>>>>> However then if we try to get "hashtags.text", it fails...which
> > >> means it
> > >>>>> is
> > >>>>>> still trying to read the NULL value.
> > >>>>>> I am thinking it may confuse the SQL developers.
> > >>>>>>
> > >>>>>>
> > >>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json`
> t ;
> > >>>>>> +------------+
> > >>>>>> |   EXPR$0   |
> > >>>>>> +------------+
> > >>>>>> | [{"text":"GoPatriots"}] |
> > >>>>>> +------------+
> > >>>>>> 1 row selected (0.109 seconds)
> > >>>>>>
> > >>>>>>
> > >>>>>> 0: jdbc:drill:> select t.entities.hashtags.text from
> > >> dfs.tmp.`d.json` t ;
> > >>>>>>
> > >>>>>> Query failed: Query failed: Failure while running fragment.,
> > >>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be
> > >> cast to
> > >>>>>> org.apache.drill.exec.vector.complex.MapVector [
> > >>>>>> 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> > >>>>>> [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> > >>>>>>
> > >>>>>>
> > >>>>>> Error: exception while executing query: Failure while executing
> > >> query.
> > >>>>>> (state=,code=0)
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>> Hao
> > >>>>>>
> > >>>>>> On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht <
> > >>>>>> aengelbrecht@maprtech.com> wrote:
> > >>>>>>
> > >>>>>>> Now try on hashtags with the following:
> > >>>>>>>
> > >>>>>>> drilldemo:5181> select t.entities.hashtags.`text` from
> > >> `/twitter.json` t
> > >>>>>>> where t.entities.hashtags is not null limit 10;
> > >>>>>>>
> > >>>>>>> Query failed: Query failed: Failure while running fragment.,
> > >>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be
> > >> cast to
> > >>>>>>> org.apache.drill.exec.vector.complex.MapVector [
> > >>>>>>> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> > >>>>>>> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Error: exception while executing query: Failure while executing
> > >> query.
> > >>>>>>> (state=,code=0)
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> {
> > >>>>>>> "entities": {
> > >>>>>>> "trends": [],
> > >>>>>>> "symbols": [],
> > >>>>>>> "urls": [],
> > >>>>>>> "hashtags": [],
> > >>>>>>> "user_mentions": []
> > >>>>>>> },
> > >>>>>>> "entities": {
> > >>>>>>> "trends": [1,2,3],
> > >>>>>>> "symbols": [4,5,6],
> > >>>>>>> "urls": [7,8,9],
> > >>>>>>> "hashtags": [
> > >>>>>>>   {
> > >>>>>>>     "text": "GoPatriots",
> > >>>>>>>     "indices": []
> > >>>>>>>   }
> > >>>>>>> ],
> > >>>>>>> "user_mentions": []
> > >>>>>>> }
> > >>>>>>> }
> > >>>>>>>
> > >>>>>>> The issue seems to be that if some records have arrays with maps
> in
> > >> them
> > >>>>>>> and others are empty.
> > >>>>>>>
> > >>>>>>> —Andries
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Jan 21, 2015, at 2:34 PM, Hao Zhu <hz...@maprtech.com> wrote:
> > >>>>>>>
> > >>>>>>>> Seems it works for below json file:
> > >>>>>>>> {
> > >>>>>>>> "entities": {
> > >>>>>>>> "trends": [],
> > >>>>>>>> "symbols": [],
> > >>>>>>>> "urls": [],
> > >>>>>>>> "hashtags": [
> > >>>>>>>>   {
> > >>>>>>>>     "text": "GoPatriots",
> > >>>>>>>>     "indices": [
> > >>>>>>>>       83,
> > >>>>>>>>       94
> > >>>>>>>>     ]
> > >>>>>>>>   }
> > >>>>>>>> ],
> > >>>>>>>> "user_mentions": []
> > >>>>>>>> },
> > >>>>>>>> "entities": {
> > >>>>>>>> "trends": [1,2,3],
> > >>>>>>>> "symbols": [4,5,6],
> > >>>>>>>> "urls": [7,8,9],
> > >>>>>>>> "hashtags": [
> > >>>>>>>>   {
> > >>>>>>>>     "text": "GoPatriots",
> > >>>>>>>>     "indices": []
> > >>>>>>>>   }
> > >>>>>>>> ],
> > >>>>>>>> "user_mentions": []
> > >>>>>>>> }
> > >>>>>>>> }
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as
> t
> > >> where
> > >>>>>>>> t.entities.urls is not null;
> > >>>>>>>> +------------+
> > >>>>>>>> |   EXPR$0   |
> > >>>>>>>> +------------+
> > >>>>>>>> | [7,8,9]    |
> > >>>>>>>> +------------+
> > >>>>>>>> 1 row selected (0.139 seconds)
> > >>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as
> t
> > >> where
> > >>>>>>>> t.entities.urls is null;
> > >>>>>>>> +------------+
> > >>>>>>>> |   EXPR$0   |
> > >>>>>>>> +------------+
> > >>>>>>>> +------------+
> > >>>>>>>> No rows selected (0.158 seconds)
> > >>>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>> Hao
> > >>>>>>>>
> > >>>>>>>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya <
> adityakishore@gmail.com>
> > >>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> I believe that this works if the array contains homogeneous
> > >> primitive
> > >>>>>>>>> types. In your example, it appears from the error, the array
> > field
> > >>>>>>> 'member'
> > >>>>>>>>> contained maps for at least one record.
> > >>>>>>>>>
> > >>>>>>>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <
> > >> cmatta@mapr.com>
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Trying that locally did not work for me (drill 0.7.0):
> > >>>>>>>>>>
> > >>>>>>>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members` from
> > >>>>>>>>> `Downloads/test.json` where repeated_count(`members`) > 0;
> > >>>>>>>>>> Query failed: Query stopped., Failure while trying to
> > materialize
> > >>>>>>>>> incoming schema.  Errors:
> > >>>>>>>>>>
> > >>>>>>>>>> Error in expression at index -1.  Error: Missing function
> > >>>>>>>>> implementation: [repeated_count(MAP-REPEATED)].  Full
> expression:
> > >>>>>>> --UNKNOWN
> > >>>>>>>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
> > >>>>>>> bullseye-3:31010 ]
> > >>>>>>>>>>
> > >>>>>>>>>> Error: exception while executing query: Failure while
> executing
> > >>>>> query.
> > >>>>>>>>> (state=,code=0)
> > >>>>>>>>>>
> > >>>>>>>>>> ​
> > >>>>>>>>>>
> > >>>>>>>>>> Chris Matta
> > >>>>>>>>>> cmatta@mapr.com
> > >>>>>>>>>> 215-701-3146
> > >>>>>>>>>>
> > >>>>>>>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <
> > adityakishore@gmail.com
> > >>>
> > >>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> repeated_count('entities.urls') > 0
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
> > >>>>>>>>>>> aengelbrecht@maprtech.com> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> How do you filter out records with an empty array in drill?
> > >>>>>>>>>>>> i.e some records have "url":[]  and some will have an array
> > >> with
> > >>>>> data
> > >>>>>>>>> in
> > >>>>>>>>>>>> it. When trying to read records with data in the array drill
> > >> fails
> > >>>>>>> due
> > >>>>>>>>>>> to
> > >>>>>>>>>>>> records missing any data in the array. Trying a filter
> with/*
> > >> where
> > >>>>>>>>>>>> "url":[0] is not null */ fails, also fails if applying url
> is
> > >> not
> > >>>>>>>>> null.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Note some of the arrays contains maps, using twitter data as
> > an
> > >>>>>>>>> example
> > >>>>>>>>>>>> below. Some records have an empty array with “hashtags”:[]
> > and
> > >>>>>>> others
> > >>>>>>>>>>> will
> > >>>>>>>>>>>> look similar to what is listed below.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> "entities": {
> > >>>>>>>>>>>> "trends": [],
> > >>>>>>>>>>>> "symbols": [],
> > >>>>>>>>>>>> "urls": [],
> > >>>>>>>>>>>> "hashtags": [
> > >>>>>>>>>>>>   {
> > >>>>>>>>>>>>     "text": "GoPatriots",
> > >>>>>>>>>>>>     "indices": [
> > >>>>>>>>>>>>       83,
> > >>>>>>>>>>>>       94
> > >>>>>>>>>>>>     ]
> > >>>>>>>>>>>>   }
> > >>>>>>>>>>>> ],
> > >>>>>>>>>>>> "user_mentions": []
> > >>>>>>>>>>>> },
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Thanks
> > >>>>>>>>>>>> —Andries
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>>
> > >>>
> > >>>
> > >>> <z.json>
> > >>
> > >>
> >
> >
>



-- 
 Steven Phillips
 Software Engineer

 mapr.com

Re: Filter out empty arrays in JSON

Posted by Andries Engelbrecht <ae...@maprtech.com>.
Good catch, seem to work when pointing to a specific map in the array.

Strange thing is that I have actually been using that to filter out retweets, by using using a specific map to retweeted_status.id is not null.

Good find though.

Also flatten on hashtags works to filter out the empty arrays, but you may not always want to use flatten.

When pointing to a specific map Drill seems to consistently produce a null if the map does not exist,which is very handy.

--Andries 



> On Feb 5, 2015, at 7:36 PM, Christopher Matta <cm...@mapr.com> wrote:
> 
> You missed something:
> 
> select t.entities.hashtags from dfs.twitter./nflt where
> t.entities.hashtags[0].*text* is not null limit 10;
> 
> Add the .text to the end of the where.
> ​
> 
> Chris Matta
> cmatta@mapr.com
> 215-701-3146
> 
> On Thu, Feb 5, 2015 at 10:05 PM, Andries Engelbrecht <
> aengelbrecht@maprtech.com> wrote:
> 
>> Chris
>> 
>> Thanks, but I tried it before without success.
>> 
>> I still get this on Drill 0.7.
>> 
>> select t.entities.hashtags from dfs.twitter.`/nfl` t where
>> t.entities.hashtags[0] is not null limit 10;
>> 
>> Error: exception while executing query: Failure while executing query.
>> (state=,code=0)
>> Query failed: Query failed: Failure while running fragment., Failure while
>> trying to materialize incoming schema.  Errors:
>> 
>> Error in expression at index 0.  Error: Missing function implementation:
>> [isnotnull(MAP-REQUIRED)].  Full expression: null.. [
>> 0dab71b6-1860-4739-b752-1458ed29b671 on drilldemo:31010 ]
>> [ 0dab71b6-1860-4739-b752-1458ed29b671 on drilldemo:31010 ]
>> 
>> 
>> Mine seems finicky about even using the first element in the array
>> 
>> select t.entities.hashtags[0] from dfs.twitter.`/nfl` t limit 10;
>> 
>> Query failed: Query failed: Failure while running fragment., index: 4,
>> length: 4 (expected: range(0, 4)) [ 62cd2d3a-546e-41c8-a0c5-4da2b81c2ef8 on
>> drilldemo:31010 ]
>> [ 62cd2d3a-546e-41c8-a0c5-4da2b81c2ef8 on drilldemo:31010 ]
>> 
>> 
>> However flatten seems to work. (as long as you don’t have thousands of
>> small json files)
>> 
>> select flatten(t.entities.hashtags) from dfs.twitter.`/nfl` t limit 10;
>> 
>> +------------+
>> |   EXPR$0   |
>> +------------+
>> | {"text":"PantherPride","indices":["12","25"]} |
>> | {"text":"KeepPounding","indices":["48","61"]} |
>> | {"text":"vegas","indices":["107","113"]} |
>> | {"text":"nfl","indices":["23","27"]} |
>> | {"text":"Cowboys","indices":["7","15"]} |
>> | {"text":"Cowboys","indices":["7","15"]} |
>> | {"text":"Ebay","indices":["52","57"]} |
>> | {"text":"NFL","indices":["58","62"]} |
>> | {"text":"Christmas","indices":["63","73"]} |
>> | {"text":"Gift","indices":["74","79"]} |
>> +------------+
>> 
>> 
>> —Andries
>> 
>> 
>> 
>> 
>>> On Feb 5, 2015, at 6:41 PM, Christopher Matta <cm...@mapr.com> wrote:
>>> 
>>> Andreas, I think I may have solved your issue:
>>> 
>>>> select t.entities.hashtags FROM mfs.cmatta.`tweets/blackfriday` t WHERE
>> t.entities.hashtags[0].text is not null limit 10;
>>> +------------+
>>> |   EXPR$0   |
>>> +------------+
>>> |
>> [{"text":"Blackfriday","indices":[11,23]},{"text":"facebook","indices":[56,65]},{"text":"christmas","indices":[66,76]}]
>>> |
>>> |
>> [{"text":"BlackFriday","indices":[65,77]},{"text":"ViernesNegro","indices":[105,118]},{"text":"ofertas","indices":[122,130]}]
>>> |
>>> |
>> [{"text":"StarWars","indices":[21,30]},{"text":"BlackFriday","indices":[71,83]}]
>>> |
>>> | [{"text":"NotOneDime","indices":[14,25]}] |
>>> | [{"text":"BlackFriday","indices":[43,55]}] |
>>> | [{"text":"JRStudio","indices":[93,102]}] |
>>> |
>> [{"text":"luv","indices":[38,42]},{"text":"luvmanicures","indices":[43,56]},{"text":"luvpedicures","indices":[57,70]},{"text":"luvroyaloak","indices":[71,83]},{"text":"woodwardave","indices":[84,96]}]
>>> |
>>> | [{"text":"BlackFriday","indices":[0,12]}] |
>>> |
>> [{"text":"BlackFriday","indices":[18,30]},{"text":"SOE","indices":[65,69]},{"text":"DoubleSC","indices":[82,91]}]
>>> |
>>> |
>> [{"text":"BlackFriday","indices":[18,30]},{"text":"SOE","indices":[65,69]},{"text":"DoubleSC","indices":[82,91]}]
>>> |
>>> +------------+
>>> 10 rows selected (1.697 seconds)
>>> 
>>> I noticed when only extracting the first text item in the hashtag array (
>>> t.entities.hashtags[0].text) it was returning null for those tweets that
>>> didn’t have hashtags, so filtering out that seems to work.
>>> ​
>>> 
>>> Chris Matta
>>> cmatta@mapr.com
>>> 215-701-3146
>>> 
>>> On Mon, Jan 26, 2015 at 6:43 PM, Jason Altekruse <
>> altekrusejason@gmail.com>
>>> wrote:
>>> 
>>>> As Aditya commented before this will work if the lists only contain
>> scalars
>>>> 
>>>> repeated_count('entities.urls') > 0
>>>> 
>>>> If the lists contain maps unfortunately this is not available today.
>> There
>>>> is an enhancement request open for this feature. I have marked it for a
>> fix
>>>> in 0.9 as it is more of a feature request than a bug and we are working
>> on
>>>> closing a large number of bugs for 0.8 before we get to issues like
>> this.
>>>> 
>>>> https://issues.apache.org/jira/browse/DRILL-1650
>>>> 
>>>> -Jason
>>>> 
>>>> On Mon, Jan 26, 2015 at 3:26 PM, Andries Engelbrecht <
>>>> aengelbrecht@maprtech.com> wrote:
>>>> 
>>>>> Unfortunately it seems that with larger data sets the use of flatten
>>>> seems
>>>>> to produce an error.
>>>>> 
>>>>> Any other options to filter out JSON records (objects) where an array
>> is
>>>>> empty?
>>>>> 
>>>>> Thanks
>>>>> —Andries
>>>>> 
>>>>> 
>>>>> On Jan 21, 2015, at 5:18 PM, Andries Engelbrecht <
>>>>> aengelbrecht@maprtech.com> wrote:
>>>>> 
>>>>>> It was interesting to see flatten bypass the records with an empty
>>>> array.
>>>>>> 
>>>>>> Unfortunately some arrays are much more complex than the example here,
>>>>> and it is still useful to have the ability to filter out ones that are
>>>>> empty.
>>>>>> 
>>>>>> —Andries
>>>>>> 
>>>>>>> On Jan 21, 2015, at 5:04 PM, Hao Zhu <hz...@maprtech.com> wrote:
>>>>>>> 
>>>>>>> I figured out the differences after getting the json file from
>>>> Andries.
>>>>>>> 
>>>>>>> My json file is like:
>>>>>>> {“entities”:{xxx},“entities”:{yyy}... }
>>>>>>> 
>>>>>>> Andries' json file is like:
>>>>>>> {“entities”:{xxx}}
>>>>>>> {“entities”:{yyy}}
>>>>>>> ...
>>>>>>> 
>>>>>>> So basically Andries' json file is contains multiple json files.
>>>>>>> 
>>>>>>> Fortunately, we can use the same SQL to get the same results:
>>>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z2.json` t;
>>>>>>> +------------+
>>>>>>> |   EXPR$0   |
>>>>>>> +------------+
>>>>>>> | []         |
>>>>>>> | [{"text":"GoPatriots"}] |
>>>>>>> | [{"text":"aaa"}] |
>>>>>>> | []         |
>>>>>>> | [{"text":"bbb"}] |
>>>>>>> +------------+
>>>>>>> 5 rows selected (0.136 seconds)
>>>>>>> 0: jdbc:drill:> with tmp as
>>>>>>> . . . . . . . > (
>>>>>>> . . . . . . . > select flatten(t.entities.hashtags) as c from
>>>>>>> dfs.tmp.`z2.json` t
>>>>>>> . . . . . . . > )
>>>>>>> . . . . . . . > select tmp.c.text from tmp;
>>>>>>> +------------+
>>>>>>> |   EXPR$0   |
>>>>>>> +------------+
>>>>>>> | GoPatriots |
>>>>>>> | aaa        |
>>>>>>> | bbb        |
>>>>>>> +------------+
>>>>>>> 3 rows selected (0.115 seconds)
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Hao
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Jan 21, 2015 at 4:34 PM, Andries Engelbrecht <
>>>>>>> aengelbrecht@maprtech.com> wrote:
>>>>>>> 
>>>>>>>> In my case it returns the empty records when flatten is not used.
>>>>>>>> 
>>>>>>>> 0: jdbc:drill:zk=drilldemo:5181> select t.entities.hashtags as
>>>> hashtags
>>>>>>>> from `twitter.json` t limit 10;
>>>>>>>> +------------+
>>>>>>>> |  hashtags  |
>>>>>>>> +------------+
>>>>>>>> | []         |
>>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>>>>>>> | []         |
>>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>>>>>>> | []         |
>>>>>>>> | []         |
>>>>>>>> | []         |
>>>>>>>> | []         |
>>>>>>>> | []         |
>>>>>>>> | [{"text":"CARvsSEA","indices":[36,45]}] |
>>>>>>>> +------------+
>>>>>>>> 
>>>>>>>>> On Jan 21, 2015, at 4:26 PM, Hao Zhu <hz...@maprtech.com> wrote:
>>>>>>>>> 
>>>>>>>>> Actually not due to flatten, if you directly query the file, it
>> will
>>>>> only
>>>>>>>>> show the non-null values.
>>>>>>>>> 
>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z.json` t;
>>>>>>>>> +------------+
>>>>>>>>> |   EXPR$0   |
>>>>>>>>> +------------+
>>>>>>>>> | [{"text":"GoPatriots"},{"text":"aaa"},{"text":"bbb"}] |
>>>>>>>>> +------------+
>>>>>>>>> 1 row selected (0.123 seconds)
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Hao
>>>>>>>>> 
>>>>>>>>> On Wed, Jan 21, 2015 at 4:18 PM, Andries Engelbrecht <
>>>>>>>>> aengelbrecht@maprtech.com> wrote:
>>>>>>>>> 
>>>>>>>>>> Very interesting, flatten seems to bypass empty records. Not sure
>>>> if
>>>>>>>> that
>>>>>>>>>> is an ideal result for all use cases, but certainly usable in this
>>>>> case.
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> —Andries
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Jan 21, 2015, at 3:45 PM, Hao Zhu <hz...@maprtech.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> I can also fetch non-null values for attached json file which
>>>>> contains
>>>>>>>> 6
>>>>>>>>>> "entities", 3 of them are null, 3 of them have "text" value.
>>>>>>>>>>> 
>>>>>>>>>>> Could you share your complete "twitter.json"?
>>>>>>>>>>> 
>>>>>>>>>>> 0: jdbc:drill:> with tmp as
>>>>>>>>>>> . . . . . . . > (
>>>>>>>>>>> . . . . . . . > select flatten(t.entities.hashtags) as c from
>>>>>>>>>> dfs.tmp.`z.json` t
>>>>>>>>>>> . . . . . . . > )
>>>>>>>>>>> . . . . . . . > select tmp.c.text from tmp;
>>>>>>>>>>> +------------+
>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>> +------------+
>>>>>>>>>>> | GoPatriots |
>>>>>>>>>>> | aaa        |
>>>>>>>>>>> | bbb        |
>>>>>>>>>>> +------------+
>>>>>>>>>>> 3 rows selected (0.122 seconds)
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Hao
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, Jan 21, 2015 at 3:35 PM, Andries Engelbrecht <
>>>>>>>>>>> aengelbrecht@maprtech.com> wrote:
>>>>>>>>>>> The sample data I posted only has 1 element, but some records
>> have
>>>>>>>>>> multiple elements in them.
>>>>>>>>>>> 
>>>>>>>>>>> Interestingly enough though
>>>>>>>>>>> select t.entities.hashtags[0].`text` from `twitter.json` t limit
>>>> 10;
>>>>>>>>>>> 
>>>>>>>>>>> Produces
>>>>>>>>>>> +------------+
>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>> +------------+
>>>>>>>>>>> | null       |
>>>>>>>>>>> | SportsNews |
>>>>>>>>>>> | null       |
>>>>>>>>>>> | SportsNews |
>>>>>>>>>>> | null       |
>>>>>>>>>>> | null       |
>>>>>>>>>>> | null       |
>>>>>>>>>>> | null       |
>>>>>>>>>>> | null       |
>>>>>>>>>>> | CARvsSEA   |
>>>>>>>>>>> +——————+
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> And
>>>>>>>>>>> 
>>>>>>>>>>> select t.entities.hashtags[0] from `twitter.json` t limit 10;
>>>>>>>>>>> 
>>>>>>>>>>> +------------+
>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>> +------------+
>>>>>>>>>>> | {"indices":[]} |
>>>>>>>>>>> | {"text":"SportsNews","indices":[90,99]} |
>>>>>>>>>>> | {"indices":[]} |
>>>>>>>>>>> | {"text":"SportsNews","indices":[90,99]} |
>>>>>>>>>>> | {"indices":[]} |
>>>>>>>>>>> | {"indices":[]} |
>>>>>>>>>>> | {"indices":[]} |
>>>>>>>>>>> | {"indices":[]} |
>>>>>>>>>>> | {"indices":[]} |
>>>>>>>>>>> | {"text":"CARvsSEA","indices":[90,99]} |
>>>>>>>>>>> +——————+
>>>>>>>>>>> 
>>>>>>>>>>> Strange part is that there is no indices map in the hashtags
>>>> array,
>>>>> so
>>>>>>>>>> no idea why it shows up when pointing to the first lament in an
>>>> empty
>>>>>>>> array.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> —Andries
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On Jan 21, 2015, at 3:24 PM, Hao Zhu <hz...@maprtech.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> I just noticed that the result of "hashtags" is just an array
>>>> with
>>>>>>>>>> only 1
>>>>>>>>>>>> element.
>>>>>>>>>>>> So take your example:
>>>>>>>>>>>> [root@maprdemo tmp]# cat d.json
>>>>>>>>>>>> {
>>>>>>>>>>>> "entities": {
>>>>>>>>>>>> "trends": [],
>>>>>>>>>>>> "symbols": [],
>>>>>>>>>>>> "urls": [],
>>>>>>>>>>>> "hashtags": [],
>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>> },
>>>>>>>>>>>> "entities": {
>>>>>>>>>>>> "trends": [1,2,3],
>>>>>>>>>>>> "symbols": [4,5,6],
>>>>>>>>>>>> "urls": [7,8,9],
>>>>>>>>>>>> "hashtags": [
>>>>>>>>>>>> {
>>>>>>>>>>>>  "text": "GoPatriots",
>>>>>>>>>>>>  "indices": []
>>>>>>>>>>>> }
>>>>>>>>>>>> ],
>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>> }
>>>>>>>>>>>> }
>>>>>>>>>>>> 
>>>>>>>>>>>> Now we can do this to achieve the results:
>>>>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json`
>>>> t
>>>>> ;
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> | [{"text":"GoPatriots"}] |
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> 1 row selected (0.09 seconds)
>>>>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags[0].text from
>>>>>>>>>> dfs.tmp.`d.json` t ;
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> | GoPatriots |
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> 1 row selected (0.108 seconds)
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Hao
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, Jan 21, 2015 at 3:01 PM, Andries Engelbrecht <
>>>>>>>>>>>> aengelbrecht@maprtech.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> When I run the query on a larger dataset it actually show the
>>>>> empty
>>>>>>>>>>>>> records.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> select t.entities.hashtags from `twitter.json` t limit 10;
>>>>>>>>>>>>> 
>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>> | []         |
>>>>>>>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>>>>>>>>>>>> | []         |
>>>>>>>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>>>>>>>>>>>> | []         |
>>>>>>>>>>>>> | []         |
>>>>>>>>>>>>> | []         |
>>>>>>>>>>>>> | []         |
>>>>>>>>>>>>> | []         |
>>>>>>>>>>>>> | [{"text":"CARvsSEA","indices":[36,45]}] |
>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>> 10 rows selected (2.899 seconds)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> However having the output as maps is not very useful, unless i
>>>> can
>>>>>>>>>> filter
>>>>>>>>>>>>> out the records with empty arrays and then drill deeper into
>> the
>>>>> ones
>>>>>>>>>> with
>>>>>>>>>>>>> data in the arrays.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> BTW: Hao I would have expiated your query to return both rows,
>>>> one
>>>>>>>>>> with an
>>>>>>>>>>>>> empty array as above and the other with the array data.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> —Andries
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Jan 21, 2015, at 2:56 PM, Hao Zhu <hz...@maprtech.com>
>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I am not sure if below is expected behavior.
>>>>>>>>>>>>>> If we only select "hashtags", and it will return only 1 row
>>>>> ignoring
>>>>>>>>>> the
>>>>>>>>>>>>>> null value.
>>>>>>>>>>>>>> However then if we try to get "hashtags.text", it
>> fails...which
>>>>>>>>>> means it
>>>>>>>>>>>>> is
>>>>>>>>>>>>>> still trying to read the NULL value.
>>>>>>>>>>>>>> I am thinking it may confuse the SQL developers.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags from
>>>> dfs.tmp.`d.json`
>>>>> t ;
>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>> | [{"text":"GoPatriots"}] |
>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>> 1 row selected (0.109 seconds)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags.text from
>>>>>>>>>> dfs.tmp.`d.json` t ;
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Query failed: Query failed: Failure while running fragment.,
>>>>>>>>>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot
>>>> be
>>>>>>>>>> cast to
>>>>>>>>>>>>>> org.apache.drill.exec.vector.complex.MapVector [
>>>>>>>>>>>>>> 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
>>>>>>>>>>>>>> [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Error: exception while executing query: Failure while
>> executing
>>>>>>>>>> query.
>>>>>>>>>>>>>> (state=,code=0)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Hao
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht <
>>>>>>>>>>>>>> aengelbrecht@maprtech.com> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Now try on hashtags with the following:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> drilldemo:5181> select t.entities.hashtags.`text` from
>>>>>>>>>> `/twitter.json` t
>>>>>>>>>>>>>>> where t.entities.hashtags is not null limit 10;
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Query failed: Query failed: Failure while running fragment.,
>>>>>>>>>>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot
>>>> be
>>>>>>>>>> cast to
>>>>>>>>>>>>>>> org.apache.drill.exec.vector.complex.MapVector [
>>>>>>>>>>>>>>> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
>>>>>>>>>>>>>>> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Error: exception while executing query: Failure while
>>>> executing
>>>>>>>>>> query.
>>>>>>>>>>>>>>> (state=,code=0)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>> "entities": {
>>>>>>>>>>>>>>> "trends": [],
>>>>>>>>>>>>>>> "symbols": [],
>>>>>>>>>>>>>>> "urls": [],
>>>>>>>>>>>>>>> "hashtags": [],
>>>>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>>>>> },
>>>>>>>>>>>>>>> "entities": {
>>>>>>>>>>>>>>> "trends": [1,2,3],
>>>>>>>>>>>>>>> "symbols": [4,5,6],
>>>>>>>>>>>>>>> "urls": [7,8,9],
>>>>>>>>>>>>>>> "hashtags": [
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>> "text": "GoPatriots",
>>>>>>>>>>>>>>> "indices": []
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> ],
>>>>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The issue seems to be that if some records have arrays with
>>>>> maps in
>>>>>>>>>> them
>>>>>>>>>>>>>>> and others are empty.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> —Andries
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Jan 21, 2015, at 2:34 PM, Hao Zhu <hz...@maprtech.com>
>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Seems it works for below json file:
>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>> "entities": {
>>>>>>>>>>>>>>>> "trends": [],
>>>>>>>>>>>>>>>> "symbols": [],
>>>>>>>>>>>>>>>> "urls": [],
>>>>>>>>>>>>>>>> "hashtags": [
>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>> "text": "GoPatriots",
>>>>>>>>>>>>>>>> "indices": [
>>>>>>>>>>>>>>>>   83,
>>>>>>>>>>>>>>>>   94
>>>>>>>>>>>>>>>> ]
>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>> ],
>>>>>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>>>>>> },
>>>>>>>>>>>>>>>> "entities": {
>>>>>>>>>>>>>>>> "trends": [1,2,3],
>>>>>>>>>>>>>>>> "symbols": [4,5,6],
>>>>>>>>>>>>>>>> "urls": [7,8,9],
>>>>>>>>>>>>>>>> "hashtags": [
>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>> "text": "GoPatriots",
>>>>>>>>>>>>>>>> "indices": []
>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>> ],
>>>>>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json`
>>>>> as t
>>>>>>>>>> where
>>>>>>>>>>>>>>>> t.entities.urls is not null;
>>>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>>>> | [7,8,9]    |
>>>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>>>> 1 row selected (0.139 seconds)
>>>>>>>>>>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json`
>>>>> as t
>>>>>>>>>> where
>>>>>>>>>>>>>>>> t.entities.urls is null;
>>>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>>>> No rows selected (0.158 seconds)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Hao
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya <
>>>>> adityakishore@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I believe that this works if the array contains homogeneous
>>>>>>>>>> primitive
>>>>>>>>>>>>>>>>> types. In your example, it appears from the error, the
>> array
>>>>>>>> field
>>>>>>>>>>>>>>> 'member'
>>>>>>>>>>>>>>>>> contained maps for at least one record.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <
>>>>>>>>>> cmatta@mapr.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Trying that locally did not work for me (drill 0.7.0):
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members`
>> from
>>>>>>>>>>>>>>>>> `Downloads/test.json` where repeated_count(`members`) > 0;
>>>>>>>>>>>>>>>>>> Query failed: Query stopped., Failure while trying to
>>>>>>>> materialize
>>>>>>>>>>>>>>>>> incoming schema.  Errors:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Error in expression at index -1.  Error: Missing function
>>>>>>>>>>>>>>>>> implementation: [repeated_count(MAP-REPEATED)].  Full
>>>>> expression:
>>>>>>>>>>>>>>> --UNKNOWN
>>>>>>>>>>>>>>>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
>>>>>>>>>>>>>>> bullseye-3:31010 ]
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Error: exception while executing query: Failure while
>>>>> executing
>>>>>>>>>>>>> query.
>>>>>>>>>>>>>>>>> (state=,code=0)
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> ​
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Chris Matta
>>>>>>>>>>>>>>>>>> cmatta@mapr.com
>>>>>>>>>>>>>>>>>> 215-701-3146
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <
>>>>>>>> adityakishore@gmail.com
>>>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> repeated_count('entities.urls') > 0
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
>>>>>>>>>>>>>>>>>>> aengelbrecht@maprtech.com> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> How do you filter out records with an empty array in
>>>> drill?
>>>>>>>>>>>>>>>>>>>> i.e some records have "url":[]  and some will have an
>>>> array
>>>>>>>>>> with
>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>> it. When trying to read records with data in the array
>>>>> drill
>>>>>>>>>> fails
>>>>>>>>>>>>>>> due
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> records missing any data in the array. Trying a filter
>>>>> with/*
>>>>>>>>>> where
>>>>>>>>>>>>>>>>>>>> "url":[0] is not null */ fails, also fails if applying
>>>> url
>>>>> is
>>>>>>>>>> not
>>>>>>>>>>>>>>>>> null.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Note some of the arrays contains maps, using twitter
>> data
>>>>> as
>>>>>>>> an
>>>>>>>>>>>>>>>>> example
>>>>>>>>>>>>>>>>>>>> below. Some records have an empty array with
>>>> “hashtags”:[]
>>>>>>>> and
>>>>>>>>>>>>>>> others
>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>> look similar to what is listed below.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> "entities": {
>>>>>>>>>>>>>>>>>>>> "trends": [],
>>>>>>>>>>>>>>>>>>>> "symbols": [],
>>>>>>>>>>>>>>>>>>>> "urls": [],
>>>>>>>>>>>>>>>>>>>> "hashtags": [
>>>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>>> "text": "GoPatriots",
>>>>>>>>>>>>>>>>>>>> "indices": [
>>>>>>>>>>>>>>>>>>>>   83,
>>>>>>>>>>>>>>>>>>>>   94
>>>>>>>>>>>>>>>>>>>> ]
>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>> ],
>>>>>>>>>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>>>>>>>>>> },
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>> —Andries
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> <z.json>
>> 
>> 

Re: Filter out empty arrays in JSON

Posted by Christopher Matta <cm...@mapr.com>.
You missed something:

select t.entities.hashtags from dfs.twitter./nflt where
t.entities.hashtags[0].*text* is not null limit 10;

Add the .text to the end of the where.
​

Chris Matta
cmatta@mapr.com
215-701-3146

On Thu, Feb 5, 2015 at 10:05 PM, Andries Engelbrecht <
aengelbrecht@maprtech.com> wrote:

> Chris
>
> Thanks, but I tried it before without success.
>
> I still get this on Drill 0.7.
>
> select t.entities.hashtags from dfs.twitter.`/nfl` t where
> t.entities.hashtags[0] is not null limit 10;
>
> Error: exception while executing query: Failure while executing query.
> (state=,code=0)
> Query failed: Query failed: Failure while running fragment., Failure while
> trying to materialize incoming schema.  Errors:
>
> Error in expression at index 0.  Error: Missing function implementation:
> [isnotnull(MAP-REQUIRED)].  Full expression: null.. [
> 0dab71b6-1860-4739-b752-1458ed29b671 on drilldemo:31010 ]
> [ 0dab71b6-1860-4739-b752-1458ed29b671 on drilldemo:31010 ]
>
>
> Mine seems finicky about even using the first element in the array
>
> select t.entities.hashtags[0] from dfs.twitter.`/nfl` t limit 10;
>
> Query failed: Query failed: Failure while running fragment., index: 4,
> length: 4 (expected: range(0, 4)) [ 62cd2d3a-546e-41c8-a0c5-4da2b81c2ef8 on
> drilldemo:31010 ]
> [ 62cd2d3a-546e-41c8-a0c5-4da2b81c2ef8 on drilldemo:31010 ]
>
>
> However flatten seems to work. (as long as you don’t have thousands of
> small json files)
>
> select flatten(t.entities.hashtags) from dfs.twitter.`/nfl` t limit 10;
>
> +------------+
> |   EXPR$0   |
> +------------+
> | {"text":"PantherPride","indices":["12","25"]} |
> | {"text":"KeepPounding","indices":["48","61"]} |
> | {"text":"vegas","indices":["107","113"]} |
> | {"text":"nfl","indices":["23","27"]} |
> | {"text":"Cowboys","indices":["7","15"]} |
> | {"text":"Cowboys","indices":["7","15"]} |
> | {"text":"Ebay","indices":["52","57"]} |
> | {"text":"NFL","indices":["58","62"]} |
> | {"text":"Christmas","indices":["63","73"]} |
> | {"text":"Gift","indices":["74","79"]} |
> +------------+
>
>
> —Andries
>
>
>
>
> On Feb 5, 2015, at 6:41 PM, Christopher Matta <cm...@mapr.com> wrote:
>
> > Andreas, I think I may have solved your issue:
> >
> >> select t.entities.hashtags FROM mfs.cmatta.`tweets/blackfriday` t WHERE
> t.entities.hashtags[0].text is not null limit 10;
> > +------------+
> > |   EXPR$0   |
> > +------------+
> > |
> [{"text":"Blackfriday","indices":[11,23]},{"text":"facebook","indices":[56,65]},{"text":"christmas","indices":[66,76]}]
> > |
> > |
> [{"text":"BlackFriday","indices":[65,77]},{"text":"ViernesNegro","indices":[105,118]},{"text":"ofertas","indices":[122,130]}]
> > |
> > |
> [{"text":"StarWars","indices":[21,30]},{"text":"BlackFriday","indices":[71,83]}]
> > |
> > | [{"text":"NotOneDime","indices":[14,25]}] |
> > | [{"text":"BlackFriday","indices":[43,55]}] |
> > | [{"text":"JRStudio","indices":[93,102]}] |
> > |
> [{"text":"luv","indices":[38,42]},{"text":"luvmanicures","indices":[43,56]},{"text":"luvpedicures","indices":[57,70]},{"text":"luvroyaloak","indices":[71,83]},{"text":"woodwardave","indices":[84,96]}]
> > |
> > | [{"text":"BlackFriday","indices":[0,12]}] |
> > |
> [{"text":"BlackFriday","indices":[18,30]},{"text":"SOE","indices":[65,69]},{"text":"DoubleSC","indices":[82,91]}]
> > |
> > |
> [{"text":"BlackFriday","indices":[18,30]},{"text":"SOE","indices":[65,69]},{"text":"DoubleSC","indices":[82,91]}]
> > |
> > +------------+
> > 10 rows selected (1.697 seconds)
> >
> > I noticed when only extracting the first text item in the hashtag array (
> > t.entities.hashtags[0].text) it was returning null for those tweets that
> > didn’t have hashtags, so filtering out that seems to work.
> > ​
> >
> > Chris Matta
> > cmatta@mapr.com
> > 215-701-3146
> >
> > On Mon, Jan 26, 2015 at 6:43 PM, Jason Altekruse <
> altekrusejason@gmail.com>
> > wrote:
> >
> >> As Aditya commented before this will work if the lists only contain
> scalars
> >>
> >> repeated_count('entities.urls') > 0
> >>
> >> If the lists contain maps unfortunately this is not available today.
> There
> >> is an enhancement request open for this feature. I have marked it for a
> fix
> >> in 0.9 as it is more of a feature request than a bug and we are working
> on
> >> closing a large number of bugs for 0.8 before we get to issues like
> this.
> >>
> >> https://issues.apache.org/jira/browse/DRILL-1650
> >>
> >> -Jason
> >>
> >> On Mon, Jan 26, 2015 at 3:26 PM, Andries Engelbrecht <
> >> aengelbrecht@maprtech.com> wrote:
> >>
> >>> Unfortunately it seems that with larger data sets the use of flatten
> >> seems
> >>> to produce an error.
> >>>
> >>> Any other options to filter out JSON records (objects) where an array
> is
> >>> empty?
> >>>
> >>> Thanks
> >>> —Andries
> >>>
> >>>
> >>> On Jan 21, 2015, at 5:18 PM, Andries Engelbrecht <
> >>> aengelbrecht@maprtech.com> wrote:
> >>>
> >>>> It was interesting to see flatten bypass the records with an empty
> >> array.
> >>>>
> >>>> Unfortunately some arrays are much more complex than the example here,
> >>> and it is still useful to have the ability to filter out ones that are
> >>> empty.
> >>>>
> >>>> —Andries
> >>>>
> >>>> On Jan 21, 2015, at 5:04 PM, Hao Zhu <hz...@maprtech.com> wrote:
> >>>>
> >>>>> I figured out the differences after getting the json file from
> >> Andries.
> >>>>>
> >>>>> My json file is like:
> >>>>> {“entities”:{xxx},“entities”:{yyy}... }
> >>>>>
> >>>>> Andries' json file is like:
> >>>>> {“entities”:{xxx}}
> >>>>> {“entities”:{yyy}}
> >>>>> ...
> >>>>>
> >>>>> So basically Andries' json file is contains multiple json files.
> >>>>>
> >>>>> Fortunately, we can use the same SQL to get the same results:
> >>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z2.json` t;
> >>>>> +------------+
> >>>>> |   EXPR$0   |
> >>>>> +------------+
> >>>>> | []         |
> >>>>> | [{"text":"GoPatriots"}] |
> >>>>> | [{"text":"aaa"}] |
> >>>>> | []         |
> >>>>> | [{"text":"bbb"}] |
> >>>>> +------------+
> >>>>> 5 rows selected (0.136 seconds)
> >>>>> 0: jdbc:drill:> with tmp as
> >>>>> . . . . . . . > (
> >>>>> . . . . . . . > select flatten(t.entities.hashtags) as c from
> >>>>> dfs.tmp.`z2.json` t
> >>>>> . . . . . . . > )
> >>>>> . . . . . . . > select tmp.c.text from tmp;
> >>>>> +------------+
> >>>>> |   EXPR$0   |
> >>>>> +------------+
> >>>>> | GoPatriots |
> >>>>> | aaa        |
> >>>>> | bbb        |
> >>>>> +------------+
> >>>>> 3 rows selected (0.115 seconds)
> >>>>>
> >>>>> Thanks,
> >>>>> Hao
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Wed, Jan 21, 2015 at 4:34 PM, Andries Engelbrecht <
> >>>>> aengelbrecht@maprtech.com> wrote:
> >>>>>
> >>>>>> In my case it returns the empty records when flatten is not used.
> >>>>>>
> >>>>>> 0: jdbc:drill:zk=drilldemo:5181> select t.entities.hashtags as
> >> hashtags
> >>>>>> from `twitter.json` t limit 10;
> >>>>>> +------------+
> >>>>>> |  hashtags  |
> >>>>>> +------------+
> >>>>>> | []         |
> >>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
> >>>>>> | []         |
> >>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
> >>>>>> | []         |
> >>>>>> | []         |
> >>>>>> | []         |
> >>>>>> | []         |
> >>>>>> | []         |
> >>>>>> | [{"text":"CARvsSEA","indices":[36,45]}] |
> >>>>>> +------------+
> >>>>>>
> >>>>>> On Jan 21, 2015, at 4:26 PM, Hao Zhu <hz...@maprtech.com> wrote:
> >>>>>>
> >>>>>>> Actually not due to flatten, if you directly query the file, it
> will
> >>> only
> >>>>>>> show the non-null values.
> >>>>>>>
> >>>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z.json` t;
> >>>>>>> +------------+
> >>>>>>> |   EXPR$0   |
> >>>>>>> +------------+
> >>>>>>> | [{"text":"GoPatriots"},{"text":"aaa"},{"text":"bbb"}] |
> >>>>>>> +------------+
> >>>>>>> 1 row selected (0.123 seconds)
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Hao
> >>>>>>>
> >>>>>>> On Wed, Jan 21, 2015 at 4:18 PM, Andries Engelbrecht <
> >>>>>>> aengelbrecht@maprtech.com> wrote:
> >>>>>>>
> >>>>>>>> Very interesting, flatten seems to bypass empty records. Not sure
> >> if
> >>>>>> that
> >>>>>>>> is an ideal result for all use cases, but certainly usable in this
> >>> case.
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>> —Andries
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Jan 21, 2015, at 3:45 PM, Hao Zhu <hz...@maprtech.com> wrote:
> >>>>>>>>
> >>>>>>>>> I can also fetch non-null values for attached json file which
> >>> contains
> >>>>>> 6
> >>>>>>>> "entities", 3 of them are null, 3 of them have "text" value.
> >>>>>>>>>
> >>>>>>>>> Could you share your complete "twitter.json"?
> >>>>>>>>>
> >>>>>>>>> 0: jdbc:drill:> with tmp as
> >>>>>>>>> . . . . . . . > (
> >>>>>>>>> . . . . . . . > select flatten(t.entities.hashtags) as c from
> >>>>>>>> dfs.tmp.`z.json` t
> >>>>>>>>> . . . . . . . > )
> >>>>>>>>> . . . . . . . > select tmp.c.text from tmp;
> >>>>>>>>> +------------+
> >>>>>>>>> |   EXPR$0   |
> >>>>>>>>> +------------+
> >>>>>>>>> | GoPatriots |
> >>>>>>>>> | aaa        |
> >>>>>>>>> | bbb        |
> >>>>>>>>> +------------+
> >>>>>>>>> 3 rows selected (0.122 seconds)
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Hao
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wed, Jan 21, 2015 at 3:35 PM, Andries Engelbrecht <
> >>>>>>>> aengelbrecht@maprtech.com> wrote:
> >>>>>>>>> The sample data I posted only has 1 element, but some records
> have
> >>>>>>>> multiple elements in them.
> >>>>>>>>>
> >>>>>>>>> Interestingly enough though
> >>>>>>>>> select t.entities.hashtags[0].`text` from `twitter.json` t limit
> >> 10;
> >>>>>>>>>
> >>>>>>>>> Produces
> >>>>>>>>> +------------+
> >>>>>>>>> |   EXPR$0   |
> >>>>>>>>> +------------+
> >>>>>>>>> | null       |
> >>>>>>>>> | SportsNews |
> >>>>>>>>> | null       |
> >>>>>>>>> | SportsNews |
> >>>>>>>>> | null       |
> >>>>>>>>> | null       |
> >>>>>>>>> | null       |
> >>>>>>>>> | null       |
> >>>>>>>>> | null       |
> >>>>>>>>> | CARvsSEA   |
> >>>>>>>>> +——————+
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> And
> >>>>>>>>>
> >>>>>>>>> select t.entities.hashtags[0] from `twitter.json` t limit 10;
> >>>>>>>>>
> >>>>>>>>> +------------+
> >>>>>>>>> |   EXPR$0   |
> >>>>>>>>> +------------+
> >>>>>>>>> | {"indices":[]} |
> >>>>>>>>> | {"text":"SportsNews","indices":[90,99]} |
> >>>>>>>>> | {"indices":[]} |
> >>>>>>>>> | {"text":"SportsNews","indices":[90,99]} |
> >>>>>>>>> | {"indices":[]} |
> >>>>>>>>> | {"indices":[]} |
> >>>>>>>>> | {"indices":[]} |
> >>>>>>>>> | {"indices":[]} |
> >>>>>>>>> | {"indices":[]} |
> >>>>>>>>> | {"text":"CARvsSEA","indices":[90,99]} |
> >>>>>>>>> +——————+
> >>>>>>>>>
> >>>>>>>>> Strange part is that there is no indices map in the hashtags
> >> array,
> >>> so
> >>>>>>>> no idea why it shows up when pointing to the first lament in an
> >> empty
> >>>>>> array.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> —Andries
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Jan 21, 2015, at 3:24 PM, Hao Zhu <hz...@maprtech.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> I just noticed that the result of "hashtags" is just an array
> >> with
> >>>>>>>> only 1
> >>>>>>>>>> element.
> >>>>>>>>>> So take your example:
> >>>>>>>>>> [root@maprdemo tmp]# cat d.json
> >>>>>>>>>> {
> >>>>>>>>>> "entities": {
> >>>>>>>>>> "trends": [],
> >>>>>>>>>> "symbols": [],
> >>>>>>>>>> "urls": [],
> >>>>>>>>>> "hashtags": [],
> >>>>>>>>>> "user_mentions": []
> >>>>>>>>>> },
> >>>>>>>>>> "entities": {
> >>>>>>>>>> "trends": [1,2,3],
> >>>>>>>>>> "symbols": [4,5,6],
> >>>>>>>>>> "urls": [7,8,9],
> >>>>>>>>>> "hashtags": [
> >>>>>>>>>> {
> >>>>>>>>>>   "text": "GoPatriots",
> >>>>>>>>>>   "indices": []
> >>>>>>>>>> }
> >>>>>>>>>> ],
> >>>>>>>>>> "user_mentions": []
> >>>>>>>>>> }
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> Now we can do this to achieve the results:
> >>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json`
> >> t
> >>> ;
> >>>>>>>>>> +------------+
> >>>>>>>>>> |   EXPR$0   |
> >>>>>>>>>> +------------+
> >>>>>>>>>> | [{"text":"GoPatriots"}] |
> >>>>>>>>>> +------------+
> >>>>>>>>>> 1 row selected (0.09 seconds)
> >>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags[0].text from
> >>>>>>>> dfs.tmp.`d.json` t ;
> >>>>>>>>>> +------------+
> >>>>>>>>>> |   EXPR$0   |
> >>>>>>>>>> +------------+
> >>>>>>>>>> | GoPatriots |
> >>>>>>>>>> +------------+
> >>>>>>>>>> 1 row selected (0.108 seconds)
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Hao
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Jan 21, 2015 at 3:01 PM, Andries Engelbrecht <
> >>>>>>>>>> aengelbrecht@maprtech.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> When I run the query on a larger dataset it actually show the
> >>> empty
> >>>>>>>>>>> records.
> >>>>>>>>>>>
> >>>>>>>>>>> select t.entities.hashtags from `twitter.json` t limit 10;
> >>>>>>>>>>>
> >>>>>>>>>>> +------------+
> >>>>>>>>>>> |   EXPR$0   |
> >>>>>>>>>>> +------------+
> >>>>>>>>>>> | []         |
> >>>>>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
> >>>>>>>>>>> | []         |
> >>>>>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
> >>>>>>>>>>> | []         |
> >>>>>>>>>>> | []         |
> >>>>>>>>>>> | []         |
> >>>>>>>>>>> | []         |
> >>>>>>>>>>> | []         |
> >>>>>>>>>>> | [{"text":"CARvsSEA","indices":[36,45]}] |
> >>>>>>>>>>> +------------+
> >>>>>>>>>>> 10 rows selected (2.899 seconds)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> However having the output as maps is not very useful, unless i
> >> can
> >>>>>>>> filter
> >>>>>>>>>>> out the records with empty arrays and then drill deeper into
> the
> >>> ones
> >>>>>>>> with
> >>>>>>>>>>> data in the arrays.
> >>>>>>>>>>>
> >>>>>>>>>>> BTW: Hao I would have expiated your query to return both rows,
> >> one
> >>>>>>>> with an
> >>>>>>>>>>> empty array as above and the other with the array data.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> —Andries
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Jan 21, 2015, at 2:56 PM, Hao Zhu <hz...@maprtech.com>
> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> I am not sure if below is expected behavior.
> >>>>>>>>>>>> If we only select "hashtags", and it will return only 1 row
> >>> ignoring
> >>>>>>>> the
> >>>>>>>>>>>> null value.
> >>>>>>>>>>>> However then if we try to get "hashtags.text", it
> fails...which
> >>>>>>>> means it
> >>>>>>>>>>> is
> >>>>>>>>>>>> still trying to read the NULL value.
> >>>>>>>>>>>> I am thinking it may confuse the SQL developers.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags from
> >> dfs.tmp.`d.json`
> >>> t ;
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> |   EXPR$0   |
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> | [{"text":"GoPatriots"}] |
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> 1 row selected (0.109 seconds)
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags.text from
> >>>>>>>> dfs.tmp.`d.json` t ;
> >>>>>>>>>>>>
> >>>>>>>>>>>> Query failed: Query failed: Failure while running fragment.,
> >>>>>>>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot
> >> be
> >>>>>>>> cast to
> >>>>>>>>>>>> org.apache.drill.exec.vector.complex.MapVector [
> >>>>>>>>>>>> 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> >>>>>>>>>>>> [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Error: exception while executing query: Failure while
> executing
> >>>>>>>> query.
> >>>>>>>>>>>> (state=,code=0)
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> Hao
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht <
> >>>>>>>>>>>> aengelbrecht@maprtech.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Now try on hashtags with the following:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> drilldemo:5181> select t.entities.hashtags.`text` from
> >>>>>>>> `/twitter.json` t
> >>>>>>>>>>>>> where t.entities.hashtags is not null limit 10;
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Query failed: Query failed: Failure while running fragment.,
> >>>>>>>>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot
> >> be
> >>>>>>>> cast to
> >>>>>>>>>>>>> org.apache.drill.exec.vector.complex.MapVector [
> >>>>>>>>>>>>> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> >>>>>>>>>>>>> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Error: exception while executing query: Failure while
> >> executing
> >>>>>>>> query.
> >>>>>>>>>>>>> (state=,code=0)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> {
> >>>>>>>>>>>>> "entities": {
> >>>>>>>>>>>>> "trends": [],
> >>>>>>>>>>>>> "symbols": [],
> >>>>>>>>>>>>> "urls": [],
> >>>>>>>>>>>>> "hashtags": [],
> >>>>>>>>>>>>> "user_mentions": []
> >>>>>>>>>>>>> },
> >>>>>>>>>>>>> "entities": {
> >>>>>>>>>>>>> "trends": [1,2,3],
> >>>>>>>>>>>>> "symbols": [4,5,6],
> >>>>>>>>>>>>> "urls": [7,8,9],
> >>>>>>>>>>>>> "hashtags": [
> >>>>>>>>>>>>> {
> >>>>>>>>>>>>>  "text": "GoPatriots",
> >>>>>>>>>>>>>  "indices": []
> >>>>>>>>>>>>> }
> >>>>>>>>>>>>> ],
> >>>>>>>>>>>>> "user_mentions": []
> >>>>>>>>>>>>> }
> >>>>>>>>>>>>> }
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The issue seems to be that if some records have arrays with
> >>> maps in
> >>>>>>>> them
> >>>>>>>>>>>>> and others are empty.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> —Andries
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Jan 21, 2015, at 2:34 PM, Hao Zhu <hz...@maprtech.com>
> >> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Seems it works for below json file:
> >>>>>>>>>>>>>> {
> >>>>>>>>>>>>>> "entities": {
> >>>>>>>>>>>>>> "trends": [],
> >>>>>>>>>>>>>> "symbols": [],
> >>>>>>>>>>>>>> "urls": [],
> >>>>>>>>>>>>>> "hashtags": [
> >>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>  "text": "GoPatriots",
> >>>>>>>>>>>>>>  "indices": [
> >>>>>>>>>>>>>>    83,
> >>>>>>>>>>>>>>    94
> >>>>>>>>>>>>>>  ]
> >>>>>>>>>>>>>> }
> >>>>>>>>>>>>>> ],
> >>>>>>>>>>>>>> "user_mentions": []
> >>>>>>>>>>>>>> },
> >>>>>>>>>>>>>> "entities": {
> >>>>>>>>>>>>>> "trends": [1,2,3],
> >>>>>>>>>>>>>> "symbols": [4,5,6],
> >>>>>>>>>>>>>> "urls": [7,8,9],
> >>>>>>>>>>>>>> "hashtags": [
> >>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>  "text": "GoPatriots",
> >>>>>>>>>>>>>>  "indices": []
> >>>>>>>>>>>>>> }
> >>>>>>>>>>>>>> ],
> >>>>>>>>>>>>>> "user_mentions": []
> >>>>>>>>>>>>>> }
> >>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json`
> >>> as t
> >>>>>>>> where
> >>>>>>>>>>>>>> t.entities.urls is not null;
> >>>>>>>>>>>>>> +------------+
> >>>>>>>>>>>>>> |   EXPR$0   |
> >>>>>>>>>>>>>> +------------+
> >>>>>>>>>>>>>> | [7,8,9]    |
> >>>>>>>>>>>>>> +------------+
> >>>>>>>>>>>>>> 1 row selected (0.139 seconds)
> >>>>>>>>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json`
> >>> as t
> >>>>>>>> where
> >>>>>>>>>>>>>> t.entities.urls is null;
> >>>>>>>>>>>>>> +------------+
> >>>>>>>>>>>>>> |   EXPR$0   |
> >>>>>>>>>>>>>> +------------+
> >>>>>>>>>>>>>> +------------+
> >>>>>>>>>>>>>> No rows selected (0.158 seconds)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>> Hao
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya <
> >>> adityakishore@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I believe that this works if the array contains homogeneous
> >>>>>>>> primitive
> >>>>>>>>>>>>>>> types. In your example, it appears from the error, the
> array
> >>>>>> field
> >>>>>>>>>>>>> 'member'
> >>>>>>>>>>>>>>> contained maps for at least one record.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <
> >>>>>>>> cmatta@mapr.com>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Trying that locally did not work for me (drill 0.7.0):
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members`
> from
> >>>>>>>>>>>>>>> `Downloads/test.json` where repeated_count(`members`) > 0;
> >>>>>>>>>>>>>>>> Query failed: Query stopped., Failure while trying to
> >>>>>> materialize
> >>>>>>>>>>>>>>> incoming schema.  Errors:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Error in expression at index -1.  Error: Missing function
> >>>>>>>>>>>>>>> implementation: [repeated_count(MAP-REPEATED)].  Full
> >>> expression:
> >>>>>>>>>>>>> --UNKNOWN
> >>>>>>>>>>>>>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
> >>>>>>>>>>>>> bullseye-3:31010 ]
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Error: exception while executing query: Failure while
> >>> executing
> >>>>>>>>>>> query.
> >>>>>>>>>>>>>>> (state=,code=0)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> ​
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Chris Matta
> >>>>>>>>>>>>>>>> cmatta@mapr.com
> >>>>>>>>>>>>>>>> 215-701-3146
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <
> >>>>>> adityakishore@gmail.com
> >>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> repeated_count('entities.urls') > 0
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
> >>>>>>>>>>>>>>>>> aengelbrecht@maprtech.com> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> How do you filter out records with an empty array in
> >> drill?
> >>>>>>>>>>>>>>>>>> i.e some records have "url":[]  and some will have an
> >> array
> >>>>>>>> with
> >>>>>>>>>>> data
> >>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>> it. When trying to read records with data in the array
> >>> drill
> >>>>>>>> fails
> >>>>>>>>>>>>> due
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> records missing any data in the array. Trying a filter
> >>> with/*
> >>>>>>>> where
> >>>>>>>>>>>>>>>>>> "url":[0] is not null */ fails, also fails if applying
> >> url
> >>> is
> >>>>>>>> not
> >>>>>>>>>>>>>>> null.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Note some of the arrays contains maps, using twitter
> data
> >>> as
> >>>>>> an
> >>>>>>>>>>>>>>> example
> >>>>>>>>>>>>>>>>>> below. Some records have an empty array with
> >> “hashtags”:[]
> >>>>>> and
> >>>>>>>>>>>>> others
> >>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>> look similar to what is listed below.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> "entities": {
> >>>>>>>>>>>>>>>>>> "trends": [],
> >>>>>>>>>>>>>>>>>> "symbols": [],
> >>>>>>>>>>>>>>>>>> "urls": [],
> >>>>>>>>>>>>>>>>>> "hashtags": [
> >>>>>>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>>>>>  "text": "GoPatriots",
> >>>>>>>>>>>>>>>>>>  "indices": [
> >>>>>>>>>>>>>>>>>>    83,
> >>>>>>>>>>>>>>>>>>    94
> >>>>>>>>>>>>>>>>>>  ]
> >>>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>>> ],
> >>>>>>>>>>>>>>>>>> "user_mentions": []
> >>>>>>>>>>>>>>>>>> },
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>>>>> —Andries
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> <z.json>
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>
> >>>
> >>
>
>

Re: Filter out empty arrays in JSON

Posted by Christopher Matta <cm...@mapr.com>.
You can’t use repeated_count function if the list contains maps, only
scalars. This was pointed out by Jason (and Aditya) earlier in this thread:

As Aditya commented before this will work if the lists only contain scalars
repeated_count(‘entities.urls’) > 0

If the lists contain maps unfortunately this is not available today. There
is an enhancement request open for this feature. I have marked it for a fix
in 0.9 as it is more of a feature request than a bug and we are working on
closing a large number of bugs for 0.8 before we get to issues like this.

https://issues.apache.org/jira/browse/DRILL-1650

​

Chris Matta
cmatta@mapr.com
215-701-3146

On Thu, Feb 5, 2015 at 10:36 PM, Sudhakar Thota <st...@maprtech.com> wrote:

> You can use repeated_count function to eliminate the null arrays.
>
> Sudhakar Thota
> Sent from my iPhone
>
> > On Feb 5, 2015, at 7:05 PM, Andries Engelbrecht <
> aengelbrecht@maprtech.com> wrote:
> >
> > Chris
> >
> > Thanks, but I tried it before without success.
> >
> > I still get this on Drill 0.7.
> >
> > select t.entities.hashtags from dfs.twitter.`/nfl` t where
> t.entities.hashtags[0] is not null limit 10;
> >
> > Error: exception while executing query: Failure while executing query.
> (state=,code=0)
> > Query failed: Query failed: Failure while running fragment., Failure
> while trying to materialize incoming schema.  Errors:
> >
> > Error in expression at index 0.  Error: Missing function implementation:
> [isnotnull(MAP-REQUIRED)].  Full expression: null.. [
> 0dab71b6-1860-4739-b752-1458ed29b671 on drilldemo:31010 ]
> > [ 0dab71b6-1860-4739-b752-1458ed29b671 on drilldemo:31010 ]
> >
> >
> > Mine seems finicky about even using the first element in the array
> >
> > select t.entities.hashtags[0] from dfs.twitter.`/nfl` t limit 10;
> >
> > Query failed: Query failed: Failure while running fragment., index: 4,
> length: 4 (expected: range(0, 4)) [ 62cd2d3a-546e-41c8-a0c5-4da2b81c2ef8 on
> drilldemo:31010 ]
> > [ 62cd2d3a-546e-41c8-a0c5-4da2b81c2ef8 on drilldemo:31010 ]
> >
> >
> > However flatten seems to work. (as long as you don’t have thousands of
> small json files)
> >
> > select flatten(t.entities.hashtags) from dfs.twitter.`/nfl` t limit 10;
> >
> > +------------+
> > |   EXPR$0   |
> > +------------+
> > | {"text":"PantherPride","indices":["12","25"]} |
> > | {"text":"KeepPounding","indices":["48","61"]} |
> > | {"text":"vegas","indices":["107","113"]} |
> > | {"text":"nfl","indices":["23","27"]} |
> > | {"text":"Cowboys","indices":["7","15"]} |
> > | {"text":"Cowboys","indices":["7","15"]} |
> > | {"text":"Ebay","indices":["52","57"]} |
> > | {"text":"NFL","indices":["58","62"]} |
> > | {"text":"Christmas","indices":["63","73"]} |
> > | {"text":"Gift","indices":["74","79"]} |
> > +------------+
> >
> >
> > —Andries
> >
> >
> >
> >
> >> On Feb 5, 2015, at 6:41 PM, Christopher Matta <cm...@mapr.com> wrote:
> >>
> >> Andreas, I think I may have solved your issue:
> >>
> >>> select t.entities.hashtags FROM mfs.cmatta.`tweets/blackfriday` t
> WHERE t.entities.hashtags[0].text is not null limit 10;
> >> +------------+
> >> |   EXPR$0   |
> >> +------------+
> >> |
> [{"text":"Blackfriday","indices":[11,23]},{"text":"facebook","indices":[56,65]},{"text":"christmas","indices":[66,76]}]
> >> |
> >> |
> [{"text":"BlackFriday","indices":[65,77]},{"text":"ViernesNegro","indices":[105,118]},{"text":"ofertas","indices":[122,130]}]
> >> |
> >> |
> [{"text":"StarWars","indices":[21,30]},{"text":"BlackFriday","indices":[71,83]}]
> >> |
> >> | [{"text":"NotOneDime","indices":[14,25]}] |
> >> | [{"text":"BlackFriday","indices":[43,55]}] |
> >> | [{"text":"JRStudio","indices":[93,102]}] |
> >> |
> [{"text":"luv","indices":[38,42]},{"text":"luvmanicures","indices":[43,56]},{"text":"luvpedicures","indices":[57,70]},{"text":"luvroyaloak","indices":[71,83]},{"text":"woodwardave","indices":[84,96]}]
> >> |
> >> | [{"text":"BlackFriday","indices":[0,12]}] |
> >> |
> [{"text":"BlackFriday","indices":[18,30]},{"text":"SOE","indices":[65,69]},{"text":"DoubleSC","indices":[82,91]}]
> >> |
> >> |
> [{"text":"BlackFriday","indices":[18,30]},{"text":"SOE","indices":[65,69]},{"text":"DoubleSC","indices":[82,91]}]
> >> |
> >> +------------+
> >> 10 rows selected (1.697 seconds)
> >>
> >> I noticed when only extracting the first text item in the hashtag array
> (
> >> t.entities.hashtags[0].text) it was returning null for those tweets that
> >> didn’t have hashtags, so filtering out that seems to work.
> >> ​
> >>
> >> Chris Matta
> >> cmatta@mapr.com
> >> 215-701-3146
> >>
> >> On Mon, Jan 26, 2015 at 6:43 PM, Jason Altekruse <
> altekrusejason@gmail.com>
> >> wrote:
> >>
> >>> As Aditya commented before this will work if the lists only contain
> scalars
> >>>
> >>> repeated_count('entities.urls') > 0
> >>>
> >>> If the lists contain maps unfortunately this is not available today.
> There
> >>> is an enhancement request open for this feature. I have marked it for
> a fix
> >>> in 0.9 as it is more of a feature request than a bug and we are
> working on
> >>> closing a large number of bugs for 0.8 before we get to issues like
> this.
> >>>
> >>> https://issues.apache.org/jira/browse/DRILL-1650
> >>>
> >>> -Jason
> >>>
> >>> On Mon, Jan 26, 2015 at 3:26 PM, Andries Engelbrecht <
> >>> aengelbrecht@maprtech.com> wrote:
> >>>
> >>>> Unfortunately it seems that with larger data sets the use of flatten
> >>> seems
> >>>> to produce an error.
> >>>>
> >>>> Any other options to filter out JSON records (objects) where an array
> is
> >>>> empty?
> >>>>
> >>>> Thanks
> >>>> —Andries
> >>>>
> >>>>
> >>>> On Jan 21, 2015, at 5:18 PM, Andries Engelbrecht <
> >>>> aengelbrecht@maprtech.com> wrote:
> >>>>
> >>>>> It was interesting to see flatten bypass the records with an empty
> >>> array.
> >>>>>
> >>>>> Unfortunately some arrays are much more complex than the example
> here,
> >>>> and it is still useful to have the ability to filter out ones that are
> >>>> empty.
> >>>>>
> >>>>> —Andries
> >>>>>
> >>>>>> On Jan 21, 2015, at 5:04 PM, Hao Zhu <hz...@maprtech.com> wrote:
> >>>>>>
> >>>>>> I figured out the differences after getting the json file from
> >>> Andries.
> >>>>>>
> >>>>>> My json file is like:
> >>>>>> {“entities”:{xxx},“entities”:{yyy}... }
> >>>>>>
> >>>>>> Andries' json file is like:
> >>>>>> {“entities”:{xxx}}
> >>>>>> {“entities”:{yyy}}
> >>>>>> ...
> >>>>>>
> >>>>>> So basically Andries' json file is contains multiple json files.
> >>>>>>
> >>>>>> Fortunately, we can use the same SQL to get the same results:
> >>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z2.json` t;
> >>>>>> +------------+
> >>>>>> |   EXPR$0   |
> >>>>>> +------------+
> >>>>>> | []         |
> >>>>>> | [{"text":"GoPatriots"}] |
> >>>>>> | [{"text":"aaa"}] |
> >>>>>> | []         |
> >>>>>> | [{"text":"bbb"}] |
> >>>>>> +------------+
> >>>>>> 5 rows selected (0.136 seconds)
> >>>>>> 0: jdbc:drill:> with tmp as
> >>>>>> . . . . . . . > (
> >>>>>> . . . . . . . > select flatten(t.entities.hashtags) as c from
> >>>>>> dfs.tmp.`z2.json` t
> >>>>>> . . . . . . . > )
> >>>>>> . . . . . . . > select tmp.c.text from tmp;
> >>>>>> +------------+
> >>>>>> |   EXPR$0   |
> >>>>>> +------------+
> >>>>>> | GoPatriots |
> >>>>>> | aaa        |
> >>>>>> | bbb        |
> >>>>>> +------------+
> >>>>>> 3 rows selected (0.115 seconds)
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Hao
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Jan 21, 2015 at 4:34 PM, Andries Engelbrecht <
> >>>>>> aengelbrecht@maprtech.com> wrote:
> >>>>>>
> >>>>>>> In my case it returns the empty records when flatten is not used.
> >>>>>>>
> >>>>>>> 0: jdbc:drill:zk=drilldemo:5181> select t.entities.hashtags as
> >>> hashtags
> >>>>>>> from `twitter.json` t limit 10;
> >>>>>>> +------------+
> >>>>>>> |  hashtags  |
> >>>>>>> +------------+
> >>>>>>> | []         |
> >>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
> >>>>>>> | []         |
> >>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
> >>>>>>> | []         |
> >>>>>>> | []         |
> >>>>>>> | []         |
> >>>>>>> | []         |
> >>>>>>> | []         |
> >>>>>>> | [{"text":"CARvsSEA","indices":[36,45]}] |
> >>>>>>> +------------+
> >>>>>>>
> >>>>>>>> On Jan 21, 2015, at 4:26 PM, Hao Zhu <hz...@maprtech.com> wrote:
> >>>>>>>>
> >>>>>>>> Actually not due to flatten, if you directly query the file, it
> will
> >>>> only
> >>>>>>>> show the non-null values.
> >>>>>>>>
> >>>>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z.json`
> t;
> >>>>>>>> +------------+
> >>>>>>>> |   EXPR$0   |
> >>>>>>>> +------------+
> >>>>>>>> | [{"text":"GoPatriots"},{"text":"aaa"},{"text":"bbb"}] |
> >>>>>>>> +------------+
> >>>>>>>> 1 row selected (0.123 seconds)
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Hao
> >>>>>>>>
> >>>>>>>> On Wed, Jan 21, 2015 at 4:18 PM, Andries Engelbrecht <
> >>>>>>>> aengelbrecht@maprtech.com> wrote:
> >>>>>>>>
> >>>>>>>>> Very interesting, flatten seems to bypass empty records. Not sure
> >>> if
> >>>>>>> that
> >>>>>>>>> is an ideal result for all use cases, but certainly usable in
> this
> >>>> case.
> >>>>>>>>>
> >>>>>>>>> Thanks
> >>>>>>>>> —Andries
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> On Jan 21, 2015, at 3:45 PM, Hao Zhu <hz...@maprtech.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> I can also fetch non-null values for attached json file which
> >>>> contains
> >>>>>>> 6
> >>>>>>>>> "entities", 3 of them are null, 3 of them have "text" value.
> >>>>>>>>>>
> >>>>>>>>>> Could you share your complete "twitter.json"?
> >>>>>>>>>>
> >>>>>>>>>> 0: jdbc:drill:> with tmp as
> >>>>>>>>>> . . . . . . . > (
> >>>>>>>>>> . . . . . . . > select flatten(t.entities.hashtags) as c from
> >>>>>>>>> dfs.tmp.`z.json` t
> >>>>>>>>>> . . . . . . . > )
> >>>>>>>>>> . . . . . . . > select tmp.c.text from tmp;
> >>>>>>>>>> +------------+
> >>>>>>>>>> |   EXPR$0   |
> >>>>>>>>>> +------------+
> >>>>>>>>>> | GoPatriots |
> >>>>>>>>>> | aaa        |
> >>>>>>>>>> | bbb        |
> >>>>>>>>>> +------------+
> >>>>>>>>>> 3 rows selected (0.122 seconds)
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Hao
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> On Wed, Jan 21, 2015 at 3:35 PM, Andries Engelbrecht <
> >>>>>>>>>> aengelbrecht@maprtech.com> wrote:
> >>>>>>>>>> The sample data I posted only has 1 element, but some records
> have
> >>>>>>>>> multiple elements in them.
> >>>>>>>>>>
> >>>>>>>>>> Interestingly enough though
> >>>>>>>>>> select t.entities.hashtags[0].`text` from `twitter.json` t limit
> >>> 10;
> >>>>>>>>>>
> >>>>>>>>>> Produces
> >>>>>>>>>> +------------+
> >>>>>>>>>> |   EXPR$0   |
> >>>>>>>>>> +------------+
> >>>>>>>>>> | null       |
> >>>>>>>>>> | SportsNews |
> >>>>>>>>>> | null       |
> >>>>>>>>>> | SportsNews |
> >>>>>>>>>> | null       |
> >>>>>>>>>> | null       |
> >>>>>>>>>> | null       |
> >>>>>>>>>> | null       |
> >>>>>>>>>> | null       |
> >>>>>>>>>> | CARvsSEA   |
> >>>>>>>>>> +——————+
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> And
> >>>>>>>>>>
> >>>>>>>>>> select t.entities.hashtags[0] from `twitter.json` t limit 10;
> >>>>>>>>>>
> >>>>>>>>>> +------------+
> >>>>>>>>>> |   EXPR$0   |
> >>>>>>>>>> +------------+
> >>>>>>>>>> | {"indices":[]} |
> >>>>>>>>>> | {"text":"SportsNews","indices":[90,99]} |
> >>>>>>>>>> | {"indices":[]} |
> >>>>>>>>>> | {"text":"SportsNews","indices":[90,99]} |
> >>>>>>>>>> | {"indices":[]} |
> >>>>>>>>>> | {"indices":[]} |
> >>>>>>>>>> | {"indices":[]} |
> >>>>>>>>>> | {"indices":[]} |
> >>>>>>>>>> | {"indices":[]} |
> >>>>>>>>>> | {"text":"CARvsSEA","indices":[90,99]} |
> >>>>>>>>>> +——————+
> >>>>>>>>>>
> >>>>>>>>>> Strange part is that there is no indices map in the hashtags
> >>> array,
> >>>> so
> >>>>>>>>> no idea why it shows up when pointing to the first lament in an
> >>> empty
> >>>>>>> array.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> —Andries
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> On Jan 21, 2015, at 3:24 PM, Hao Zhu <hz...@maprtech.com>
> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> I just noticed that the result of "hashtags" is just an array
> >>> with
> >>>>>>>>> only 1
> >>>>>>>>>>> element.
> >>>>>>>>>>> So take your example:
> >>>>>>>>>>> [root@maprdemo tmp]# cat d.json
> >>>>>>>>>>> {
> >>>>>>>>>>> "entities": {
> >>>>>>>>>>> "trends": [],
> >>>>>>>>>>> "symbols": [],
> >>>>>>>>>>> "urls": [],
> >>>>>>>>>>> "hashtags": [],
> >>>>>>>>>>> "user_mentions": []
> >>>>>>>>>>> },
> >>>>>>>>>>> "entities": {
> >>>>>>>>>>> "trends": [1,2,3],
> >>>>>>>>>>> "symbols": [4,5,6],
> >>>>>>>>>>> "urls": [7,8,9],
> >>>>>>>>>>> "hashtags": [
> >>>>>>>>>>> {
> >>>>>>>>>>>  "text": "GoPatriots",
> >>>>>>>>>>>  "indices": []
> >>>>>>>>>>> }
> >>>>>>>>>>> ],
> >>>>>>>>>>> "user_mentions": []
> >>>>>>>>>>> }
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>> Now we can do this to achieve the results:
> >>>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags from
> dfs.tmp.`d.json`
> >>> t
> >>>> ;
> >>>>>>>>>>> +------------+
> >>>>>>>>>>> |   EXPR$0   |
> >>>>>>>>>>> +------------+
> >>>>>>>>>>> | [{"text":"GoPatriots"}] |
> >>>>>>>>>>> +------------+
> >>>>>>>>>>> 1 row selected (0.09 seconds)
> >>>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags[0].text from
> >>>>>>>>> dfs.tmp.`d.json` t ;
> >>>>>>>>>>> +------------+
> >>>>>>>>>>> |   EXPR$0   |
> >>>>>>>>>>> +------------+
> >>>>>>>>>>> | GoPatriots |
> >>>>>>>>>>> +------------+
> >>>>>>>>>>> 1 row selected (0.108 seconds)
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Hao
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Jan 21, 2015 at 3:01 PM, Andries Engelbrecht <
> >>>>>>>>>>> aengelbrecht@maprtech.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> When I run the query on a larger dataset it actually show the
> >>>> empty
> >>>>>>>>>>>> records.
> >>>>>>>>>>>>
> >>>>>>>>>>>> select t.entities.hashtags from `twitter.json` t limit 10;
> >>>>>>>>>>>>
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> |   EXPR$0   |
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> | []         |
> >>>>>>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
> >>>>>>>>>>>> | []         |
> >>>>>>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
> >>>>>>>>>>>> | []         |
> >>>>>>>>>>>> | []         |
> >>>>>>>>>>>> | []         |
> >>>>>>>>>>>> | []         |
> >>>>>>>>>>>> | []         |
> >>>>>>>>>>>> | [{"text":"CARvsSEA","indices":[36,45]}] |
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> 10 rows selected (2.899 seconds)
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> However having the output as maps is not very useful, unless i
> >>> can
> >>>>>>>>> filter
> >>>>>>>>>>>> out the records with empty arrays and then drill deeper into
> the
> >>>> ones
> >>>>>>>>> with
> >>>>>>>>>>>> data in the arrays.
> >>>>>>>>>>>>
> >>>>>>>>>>>> BTW: Hao I would have expiated your query to return both rows,
> >>> one
> >>>>>>>>> with an
> >>>>>>>>>>>> empty array as above and the other with the array data.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> —Andries
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Jan 21, 2015, at 2:56 PM, Hao Zhu <hz...@maprtech.com>
> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I am not sure if below is expected behavior.
> >>>>>>>>>>>>> If we only select "hashtags", and it will return only 1 row
> >>>> ignoring
> >>>>>>>>> the
> >>>>>>>>>>>>> null value.
> >>>>>>>>>>>>> However then if we try to get "hashtags.text", it
> fails...which
> >>>>>>>>> means it
> >>>>>>>>>>>> is
> >>>>>>>>>>>>> still trying to read the NULL value.
> >>>>>>>>>>>>> I am thinking it may confuse the SQL developers.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags from
> >>> dfs.tmp.`d.json`
> >>>> t ;
> >>>>>>>>>>>>> +------------+
> >>>>>>>>>>>>> |   EXPR$0   |
> >>>>>>>>>>>>> +------------+
> >>>>>>>>>>>>> | [{"text":"GoPatriots"}] |
> >>>>>>>>>>>>> +------------+
> >>>>>>>>>>>>> 1 row selected (0.109 seconds)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags.text from
> >>>>>>>>> dfs.tmp.`d.json` t ;
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Query failed: Query failed: Failure while running fragment.,
> >>>>>>>>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot
> >>> be
> >>>>>>>>> cast to
> >>>>>>>>>>>>> org.apache.drill.exec.vector.complex.MapVector [
> >>>>>>>>>>>>> 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> >>>>>>>>>>>>> [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Error: exception while executing query: Failure while
> executing
> >>>>>>>>> query.
> >>>>>>>>>>>>> (state=,code=0)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>> Hao
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht <
> >>>>>>>>>>>>> aengelbrecht@maprtech.com> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Now try on hashtags with the following:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> drilldemo:5181> select t.entities.hashtags.`text` from
> >>>>>>>>> `/twitter.json` t
> >>>>>>>>>>>>>> where t.entities.hashtags is not null limit 10;
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Query failed: Query failed: Failure while running fragment.,
> >>>>>>>>>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector
> cannot
> >>> be
> >>>>>>>>> cast to
> >>>>>>>>>>>>>> org.apache.drill.exec.vector.complex.MapVector [
> >>>>>>>>>>>>>> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> >>>>>>>>>>>>>> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Error: exception while executing query: Failure while
> >>> executing
> >>>>>>>>> query.
> >>>>>>>>>>>>>> (state=,code=0)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> {
> >>>>>>>>>>>>>> "entities": {
> >>>>>>>>>>>>>> "trends": [],
> >>>>>>>>>>>>>> "symbols": [],
> >>>>>>>>>>>>>> "urls": [],
> >>>>>>>>>>>>>> "hashtags": [],
> >>>>>>>>>>>>>> "user_mentions": []
> >>>>>>>>>>>>>> },
> >>>>>>>>>>>>>> "entities": {
> >>>>>>>>>>>>>> "trends": [1,2,3],
> >>>>>>>>>>>>>> "symbols": [4,5,6],
> >>>>>>>>>>>>>> "urls": [7,8,9],
> >>>>>>>>>>>>>> "hashtags": [
> >>>>>>>>>>>>>> {
> >>>>>>>>>>>>>> "text": "GoPatriots",
> >>>>>>>>>>>>>> "indices": []
> >>>>>>>>>>>>>> }
> >>>>>>>>>>>>>> ],
> >>>>>>>>>>>>>> "user_mentions": []
> >>>>>>>>>>>>>> }
> >>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The issue seems to be that if some records have arrays with
> >>>> maps in
> >>>>>>>>> them
> >>>>>>>>>>>>>> and others are empty.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> —Andries
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Jan 21, 2015, at 2:34 PM, Hao Zhu <hz...@maprtech.com>
> >>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Seems it works for below json file:
> >>>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>> "entities": {
> >>>>>>>>>>>>>>> "trends": [],
> >>>>>>>>>>>>>>> "symbols": [],
> >>>>>>>>>>>>>>> "urls": [],
> >>>>>>>>>>>>>>> "hashtags": [
> >>>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>> "text": "GoPatriots",
> >>>>>>>>>>>>>>> "indices": [
> >>>>>>>>>>>>>>>   83,
> >>>>>>>>>>>>>>>   94
> >>>>>>>>>>>>>>> ]
> >>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>> ],
> >>>>>>>>>>>>>>> "user_mentions": []
> >>>>>>>>>>>>>>> },
> >>>>>>>>>>>>>>> "entities": {
> >>>>>>>>>>>>>>> "trends": [1,2,3],
> >>>>>>>>>>>>>>> "symbols": [4,5,6],
> >>>>>>>>>>>>>>> "urls": [7,8,9],
> >>>>>>>>>>>>>>> "hashtags": [
> >>>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>> "text": "GoPatriots",
> >>>>>>>>>>>>>>> "indices": []
> >>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>> ],
> >>>>>>>>>>>>>>> "user_mentions": []
> >>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 0: jdbc:drill:> select t.entities.urls from
> dfs.tmp.`a.json`
> >>>> as t
> >>>>>>>>> where
> >>>>>>>>>>>>>>> t.entities.urls is not null;
> >>>>>>>>>>>>>>> +------------+
> >>>>>>>>>>>>>>> |   EXPR$0   |
> >>>>>>>>>>>>>>> +------------+
> >>>>>>>>>>>>>>> | [7,8,9]    |
> >>>>>>>>>>>>>>> +------------+
> >>>>>>>>>>>>>>> 1 row selected (0.139 seconds)
> >>>>>>>>>>>>>>> 0: jdbc:drill:> select t.entities.urls from
> dfs.tmp.`a.json`
> >>>> as t
> >>>>>>>>> where
> >>>>>>>>>>>>>>> t.entities.urls is null;
> >>>>>>>>>>>>>>> +------------+
> >>>>>>>>>>>>>>> |   EXPR$0   |
> >>>>>>>>>>>>>>> +------------+
> >>>>>>>>>>>>>>> +------------+
> >>>>>>>>>>>>>>> No rows selected (0.158 seconds)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>> Hao
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya <
> >>>> adityakishore@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I believe that this works if the array contains
> homogeneous
> >>>>>>>>> primitive
> >>>>>>>>>>>>>>>> types. In your example, it appears from the error, the
> array
> >>>>>>> field
> >>>>>>>>>>>>>> 'member'
> >>>>>>>>>>>>>>>> contained maps for at least one record.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <
> >>>>>>>>> cmatta@mapr.com>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Trying that locally did not work for me (drill 0.7.0):
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members`
> from
> >>>>>>>>>>>>>>>> `Downloads/test.json` where repeated_count(`members`) > 0;
> >>>>>>>>>>>>>>>>> Query failed: Query stopped., Failure while trying to
> >>>>>>> materialize
> >>>>>>>>>>>>>>>> incoming schema.  Errors:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Error in expression at index -1.  Error: Missing function
> >>>>>>>>>>>>>>>> implementation: [repeated_count(MAP-REPEATED)].  Full
> >>>> expression:
> >>>>>>>>>>>>>> --UNKNOWN
> >>>>>>>>>>>>>>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
> >>>>>>>>>>>>>> bullseye-3:31010 ]
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Error: exception while executing query: Failure while
> >>>> executing
> >>>>>>>>>>>> query.
> >>>>>>>>>>>>>>>> (state=,code=0)
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> ​
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Chris Matta
> >>>>>>>>>>>>>>>>> cmatta@mapr.com
> >>>>>>>>>>>>>>>>> 215-701-3146
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <
> >>>>>>> adityakishore@gmail.com
> >>>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> repeated_count('entities.urls') > 0
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
> >>>>>>>>>>>>>>>>>> aengelbrecht@maprtech.com> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> How do you filter out records with an empty array in
> >>> drill?
> >>>>>>>>>>>>>>>>>>> i.e some records have "url":[]  and some will have an
> >>> array
> >>>>>>>>> with
> >>>>>>>>>>>> data
> >>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>> it. When trying to read records with data in the array
> >>>> drill
> >>>>>>>>> fails
> >>>>>>>>>>>>>> due
> >>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> records missing any data in the array. Trying a filter
> >>>> with/*
> >>>>>>>>> where
> >>>>>>>>>>>>>>>>>>> "url":[0] is not null */ fails, also fails if applying
> >>> url
> >>>> is
> >>>>>>>>> not
> >>>>>>>>>>>>>>>> null.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Note some of the arrays contains maps, using twitter
> data
> >>>> as
> >>>>>>> an
> >>>>>>>>>>>>>>>> example
> >>>>>>>>>>>>>>>>>>> below. Some records have an empty array with
> >>> “hashtags”:[]
> >>>>>>> and
> >>>>>>>>>>>>>> others
> >>>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>> look similar to what is listed below.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> "entities": {
> >>>>>>>>>>>>>>>>>>> "trends": [],
> >>>>>>>>>>>>>>>>>>> "symbols": [],
> >>>>>>>>>>>>>>>>>>> "urls": [],
> >>>>>>>>>>>>>>>>>>> "hashtags": [
> >>>>>>>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>>>>>> "text": "GoPatriots",
> >>>>>>>>>>>>>>>>>>> "indices": [
> >>>>>>>>>>>>>>>>>>>   83,
> >>>>>>>>>>>>>>>>>>>   94
> >>>>>>>>>>>>>>>>>>> ]
> >>>>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>>>> ],
> >>>>>>>>>>>>>>>>>>> "user_mentions": []
> >>>>>>>>>>>>>>>>>>> },
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>>>>>> —Andries
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> <z.json>
> >
>

Re: Filter out empty arrays in JSON

Posted by Sudhakar Thota <st...@maprtech.com>.
You can use repeated_count function to eliminate the null arrays.

Sudhakar Thota
Sent from my iPhone

> On Feb 5, 2015, at 7:05 PM, Andries Engelbrecht <ae...@maprtech.com> wrote:
> 
> Chris 
> 
> Thanks, but I tried it before without success.
> 
> I still get this on Drill 0.7. 
> 
> select t.entities.hashtags from dfs.twitter.`/nfl` t where t.entities.hashtags[0] is not null limit 10;
> 
> Error: exception while executing query: Failure while executing query. (state=,code=0)
> Query failed: Query failed: Failure while running fragment., Failure while trying to materialize incoming schema.  Errors:
> 
> Error in expression at index 0.  Error: Missing function implementation: [isnotnull(MAP-REQUIRED)].  Full expression: null.. [ 0dab71b6-1860-4739-b752-1458ed29b671 on drilldemo:31010 ]
> [ 0dab71b6-1860-4739-b752-1458ed29b671 on drilldemo:31010 ]
> 
> 
> Mine seems finicky about even using the first element in the array
> 
> select t.entities.hashtags[0] from dfs.twitter.`/nfl` t limit 10;
> 
> Query failed: Query failed: Failure while running fragment., index: 4, length: 4 (expected: range(0, 4)) [ 62cd2d3a-546e-41c8-a0c5-4da2b81c2ef8 on drilldemo:31010 ]
> [ 62cd2d3a-546e-41c8-a0c5-4da2b81c2ef8 on drilldemo:31010 ]
> 
> 
> However flatten seems to work. (as long as you don’t have thousands of small json files)
> 
> select flatten(t.entities.hashtags) from dfs.twitter.`/nfl` t limit 10;
> 
> +------------+
> |   EXPR$0   |
> +------------+
> | {"text":"PantherPride","indices":["12","25"]} |
> | {"text":"KeepPounding","indices":["48","61"]} |
> | {"text":"vegas","indices":["107","113"]} |
> | {"text":"nfl","indices":["23","27"]} |
> | {"text":"Cowboys","indices":["7","15"]} |
> | {"text":"Cowboys","indices":["7","15"]} |
> | {"text":"Ebay","indices":["52","57"]} |
> | {"text":"NFL","indices":["58","62"]} |
> | {"text":"Christmas","indices":["63","73"]} |
> | {"text":"Gift","indices":["74","79"]} |
> +------------+
> 
> 
> —Andries
> 
> 
> 
> 
>> On Feb 5, 2015, at 6:41 PM, Christopher Matta <cm...@mapr.com> wrote:
>> 
>> Andreas, I think I may have solved your issue:
>> 
>>> select t.entities.hashtags FROM mfs.cmatta.`tweets/blackfriday` t WHERE t.entities.hashtags[0].text is not null limit 10;
>> +------------+
>> |   EXPR$0   |
>> +------------+
>> | [{"text":"Blackfriday","indices":[11,23]},{"text":"facebook","indices":[56,65]},{"text":"christmas","indices":[66,76]}]
>> |
>> | [{"text":"BlackFriday","indices":[65,77]},{"text":"ViernesNegro","indices":[105,118]},{"text":"ofertas","indices":[122,130]}]
>> |
>> | [{"text":"StarWars","indices":[21,30]},{"text":"BlackFriday","indices":[71,83]}]
>> |
>> | [{"text":"NotOneDime","indices":[14,25]}] |
>> | [{"text":"BlackFriday","indices":[43,55]}] |
>> | [{"text":"JRStudio","indices":[93,102]}] |
>> | [{"text":"luv","indices":[38,42]},{"text":"luvmanicures","indices":[43,56]},{"text":"luvpedicures","indices":[57,70]},{"text":"luvroyaloak","indices":[71,83]},{"text":"woodwardave","indices":[84,96]}]
>> |
>> | [{"text":"BlackFriday","indices":[0,12]}] |
>> | [{"text":"BlackFriday","indices":[18,30]},{"text":"SOE","indices":[65,69]},{"text":"DoubleSC","indices":[82,91]}]
>> |
>> | [{"text":"BlackFriday","indices":[18,30]},{"text":"SOE","indices":[65,69]},{"text":"DoubleSC","indices":[82,91]}]
>> |
>> +------------+
>> 10 rows selected (1.697 seconds)
>> 
>> I noticed when only extracting the first text item in the hashtag array (
>> t.entities.hashtags[0].text) it was returning null for those tweets that
>> didn’t have hashtags, so filtering out that seems to work.
>> ​
>> 
>> Chris Matta
>> cmatta@mapr.com
>> 215-701-3146
>> 
>> On Mon, Jan 26, 2015 at 6:43 PM, Jason Altekruse <al...@gmail.com>
>> wrote:
>> 
>>> As Aditya commented before this will work if the lists only contain scalars
>>> 
>>> repeated_count('entities.urls') > 0
>>> 
>>> If the lists contain maps unfortunately this is not available today. There
>>> is an enhancement request open for this feature. I have marked it for a fix
>>> in 0.9 as it is more of a feature request than a bug and we are working on
>>> closing a large number of bugs for 0.8 before we get to issues like this.
>>> 
>>> https://issues.apache.org/jira/browse/DRILL-1650
>>> 
>>> -Jason
>>> 
>>> On Mon, Jan 26, 2015 at 3:26 PM, Andries Engelbrecht <
>>> aengelbrecht@maprtech.com> wrote:
>>> 
>>>> Unfortunately it seems that with larger data sets the use of flatten
>>> seems
>>>> to produce an error.
>>>> 
>>>> Any other options to filter out JSON records (objects) where an array is
>>>> empty?
>>>> 
>>>> Thanks
>>>> —Andries
>>>> 
>>>> 
>>>> On Jan 21, 2015, at 5:18 PM, Andries Engelbrecht <
>>>> aengelbrecht@maprtech.com> wrote:
>>>> 
>>>>> It was interesting to see flatten bypass the records with an empty
>>> array.
>>>>> 
>>>>> Unfortunately some arrays are much more complex than the example here,
>>>> and it is still useful to have the ability to filter out ones that are
>>>> empty.
>>>>> 
>>>>> —Andries
>>>>> 
>>>>>> On Jan 21, 2015, at 5:04 PM, Hao Zhu <hz...@maprtech.com> wrote:
>>>>>> 
>>>>>> I figured out the differences after getting the json file from
>>> Andries.
>>>>>> 
>>>>>> My json file is like:
>>>>>> {“entities”:{xxx},“entities”:{yyy}... }
>>>>>> 
>>>>>> Andries' json file is like:
>>>>>> {“entities”:{xxx}}
>>>>>> {“entities”:{yyy}}
>>>>>> ...
>>>>>> 
>>>>>> So basically Andries' json file is contains multiple json files.
>>>>>> 
>>>>>> Fortunately, we can use the same SQL to get the same results:
>>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z2.json` t;
>>>>>> +------------+
>>>>>> |   EXPR$0   |
>>>>>> +------------+
>>>>>> | []         |
>>>>>> | [{"text":"GoPatriots"}] |
>>>>>> | [{"text":"aaa"}] |
>>>>>> | []         |
>>>>>> | [{"text":"bbb"}] |
>>>>>> +------------+
>>>>>> 5 rows selected (0.136 seconds)
>>>>>> 0: jdbc:drill:> with tmp as
>>>>>> . . . . . . . > (
>>>>>> . . . . . . . > select flatten(t.entities.hashtags) as c from
>>>>>> dfs.tmp.`z2.json` t
>>>>>> . . . . . . . > )
>>>>>> . . . . . . . > select tmp.c.text from tmp;
>>>>>> +------------+
>>>>>> |   EXPR$0   |
>>>>>> +------------+
>>>>>> | GoPatriots |
>>>>>> | aaa        |
>>>>>> | bbb        |
>>>>>> +------------+
>>>>>> 3 rows selected (0.115 seconds)
>>>>>> 
>>>>>> Thanks,
>>>>>> Hao
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Wed, Jan 21, 2015 at 4:34 PM, Andries Engelbrecht <
>>>>>> aengelbrecht@maprtech.com> wrote:
>>>>>> 
>>>>>>> In my case it returns the empty records when flatten is not used.
>>>>>>> 
>>>>>>> 0: jdbc:drill:zk=drilldemo:5181> select t.entities.hashtags as
>>> hashtags
>>>>>>> from `twitter.json` t limit 10;
>>>>>>> +------------+
>>>>>>> |  hashtags  |
>>>>>>> +------------+
>>>>>>> | []         |
>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>>>>>> | []         |
>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>>>>>> | []         |
>>>>>>> | []         |
>>>>>>> | []         |
>>>>>>> | []         |
>>>>>>> | []         |
>>>>>>> | [{"text":"CARvsSEA","indices":[36,45]}] |
>>>>>>> +------------+
>>>>>>> 
>>>>>>>> On Jan 21, 2015, at 4:26 PM, Hao Zhu <hz...@maprtech.com> wrote:
>>>>>>>> 
>>>>>>>> Actually not due to flatten, if you directly query the file, it will
>>>> only
>>>>>>>> show the non-null values.
>>>>>>>> 
>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z.json` t;
>>>>>>>> +------------+
>>>>>>>> |   EXPR$0   |
>>>>>>>> +------------+
>>>>>>>> | [{"text":"GoPatriots"},{"text":"aaa"},{"text":"bbb"}] |
>>>>>>>> +------------+
>>>>>>>> 1 row selected (0.123 seconds)
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Hao
>>>>>>>> 
>>>>>>>> On Wed, Jan 21, 2015 at 4:18 PM, Andries Engelbrecht <
>>>>>>>> aengelbrecht@maprtech.com> wrote:
>>>>>>>> 
>>>>>>>>> Very interesting, flatten seems to bypass empty records. Not sure
>>> if
>>>>>>> that
>>>>>>>>> is an ideal result for all use cases, but certainly usable in this
>>>> case.
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> —Andries
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Jan 21, 2015, at 3:45 PM, Hao Zhu <hz...@maprtech.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> I can also fetch non-null values for attached json file which
>>>> contains
>>>>>>> 6
>>>>>>>>> "entities", 3 of them are null, 3 of them have "text" value.
>>>>>>>>>> 
>>>>>>>>>> Could you share your complete "twitter.json"?
>>>>>>>>>> 
>>>>>>>>>> 0: jdbc:drill:> with tmp as
>>>>>>>>>> . . . . . . . > (
>>>>>>>>>> . . . . . . . > select flatten(t.entities.hashtags) as c from
>>>>>>>>> dfs.tmp.`z.json` t
>>>>>>>>>> . . . . . . . > )
>>>>>>>>>> . . . . . . . > select tmp.c.text from tmp;
>>>>>>>>>> +------------+
>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>> +------------+
>>>>>>>>>> | GoPatriots |
>>>>>>>>>> | aaa        |
>>>>>>>>>> | bbb        |
>>>>>>>>>> +------------+
>>>>>>>>>> 3 rows selected (0.122 seconds)
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Hao
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Wed, Jan 21, 2015 at 3:35 PM, Andries Engelbrecht <
>>>>>>>>>> aengelbrecht@maprtech.com> wrote:
>>>>>>>>>> The sample data I posted only has 1 element, but some records have
>>>>>>>>> multiple elements in them.
>>>>>>>>>> 
>>>>>>>>>> Interestingly enough though
>>>>>>>>>> select t.entities.hashtags[0].`text` from `twitter.json` t limit
>>> 10;
>>>>>>>>>> 
>>>>>>>>>> Produces
>>>>>>>>>> +------------+
>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>> +------------+
>>>>>>>>>> | null       |
>>>>>>>>>> | SportsNews |
>>>>>>>>>> | null       |
>>>>>>>>>> | SportsNews |
>>>>>>>>>> | null       |
>>>>>>>>>> | null       |
>>>>>>>>>> | null       |
>>>>>>>>>> | null       |
>>>>>>>>>> | null       |
>>>>>>>>>> | CARvsSEA   |
>>>>>>>>>> +——————+
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> And
>>>>>>>>>> 
>>>>>>>>>> select t.entities.hashtags[0] from `twitter.json` t limit 10;
>>>>>>>>>> 
>>>>>>>>>> +------------+
>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>> +------------+
>>>>>>>>>> | {"indices":[]} |
>>>>>>>>>> | {"text":"SportsNews","indices":[90,99]} |
>>>>>>>>>> | {"indices":[]} |
>>>>>>>>>> | {"text":"SportsNews","indices":[90,99]} |
>>>>>>>>>> | {"indices":[]} |
>>>>>>>>>> | {"indices":[]} |
>>>>>>>>>> | {"indices":[]} |
>>>>>>>>>> | {"indices":[]} |
>>>>>>>>>> | {"indices":[]} |
>>>>>>>>>> | {"text":"CARvsSEA","indices":[90,99]} |
>>>>>>>>>> +——————+
>>>>>>>>>> 
>>>>>>>>>> Strange part is that there is no indices map in the hashtags
>>> array,
>>>> so
>>>>>>>>> no idea why it shows up when pointing to the first lament in an
>>> empty
>>>>>>> array.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> —Andries
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Jan 21, 2015, at 3:24 PM, Hao Zhu <hz...@maprtech.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> I just noticed that the result of "hashtags" is just an array
>>> with
>>>>>>>>> only 1
>>>>>>>>>>> element.
>>>>>>>>>>> So take your example:
>>>>>>>>>>> [root@maprdemo tmp]# cat d.json
>>>>>>>>>>> {
>>>>>>>>>>> "entities": {
>>>>>>>>>>> "trends": [],
>>>>>>>>>>> "symbols": [],
>>>>>>>>>>> "urls": [],
>>>>>>>>>>> "hashtags": [],
>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>> },
>>>>>>>>>>> "entities": {
>>>>>>>>>>> "trends": [1,2,3],
>>>>>>>>>>> "symbols": [4,5,6],
>>>>>>>>>>> "urls": [7,8,9],
>>>>>>>>>>> "hashtags": [
>>>>>>>>>>> {
>>>>>>>>>>>  "text": "GoPatriots",
>>>>>>>>>>>  "indices": []
>>>>>>>>>>> }
>>>>>>>>>>> ],
>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>> }
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> Now we can do this to achieve the results:
>>>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json`
>>> t
>>>> ;
>>>>>>>>>>> +------------+
>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>> +------------+
>>>>>>>>>>> | [{"text":"GoPatriots"}] |
>>>>>>>>>>> +------------+
>>>>>>>>>>> 1 row selected (0.09 seconds)
>>>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags[0].text from
>>>>>>>>> dfs.tmp.`d.json` t ;
>>>>>>>>>>> +------------+
>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>> +------------+
>>>>>>>>>>> | GoPatriots |
>>>>>>>>>>> +------------+
>>>>>>>>>>> 1 row selected (0.108 seconds)
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Hao
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Jan 21, 2015 at 3:01 PM, Andries Engelbrecht <
>>>>>>>>>>> aengelbrecht@maprtech.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> When I run the query on a larger dataset it actually show the
>>>> empty
>>>>>>>>>>>> records.
>>>>>>>>>>>> 
>>>>>>>>>>>> select t.entities.hashtags from `twitter.json` t limit 10;
>>>>>>>>>>>> 
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> | []         |
>>>>>>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>>>>>>>>>>> | []         |
>>>>>>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>>>>>>>>>>> | []         |
>>>>>>>>>>>> | []         |
>>>>>>>>>>>> | []         |
>>>>>>>>>>>> | []         |
>>>>>>>>>>>> | []         |
>>>>>>>>>>>> | [{"text":"CARvsSEA","indices":[36,45]}] |
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> 10 rows selected (2.899 seconds)
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> However having the output as maps is not very useful, unless i
>>> can
>>>>>>>>> filter
>>>>>>>>>>>> out the records with empty arrays and then drill deeper into the
>>>> ones
>>>>>>>>> with
>>>>>>>>>>>> data in the arrays.
>>>>>>>>>>>> 
>>>>>>>>>>>> BTW: Hao I would have expiated your query to return both rows,
>>> one
>>>>>>>>> with an
>>>>>>>>>>>> empty array as above and the other with the array data.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> —Andries
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Jan 21, 2015, at 2:56 PM, Hao Zhu <hz...@maprtech.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I am not sure if below is expected behavior.
>>>>>>>>>>>>> If we only select "hashtags", and it will return only 1 row
>>>> ignoring
>>>>>>>>> the
>>>>>>>>>>>>> null value.
>>>>>>>>>>>>> However then if we try to get "hashtags.text", it fails...which
>>>>>>>>> means it
>>>>>>>>>>>> is
>>>>>>>>>>>>> still trying to read the NULL value.
>>>>>>>>>>>>> I am thinking it may confuse the SQL developers.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags from
>>> dfs.tmp.`d.json`
>>>> t ;
>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>> | [{"text":"GoPatriots"}] |
>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>> 1 row selected (0.109 seconds)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags.text from
>>>>>>>>> dfs.tmp.`d.json` t ;
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Query failed: Query failed: Failure while running fragment.,
>>>>>>>>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot
>>> be
>>>>>>>>> cast to
>>>>>>>>>>>>> org.apache.drill.exec.vector.complex.MapVector [
>>>>>>>>>>>>> 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
>>>>>>>>>>>>> [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Error: exception while executing query: Failure while executing
>>>>>>>>> query.
>>>>>>>>>>>>> (state=,code=0)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Hao
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht <
>>>>>>>>>>>>> aengelbrecht@maprtech.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Now try on hashtags with the following:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> drilldemo:5181> select t.entities.hashtags.`text` from
>>>>>>>>> `/twitter.json` t
>>>>>>>>>>>>>> where t.entities.hashtags is not null limit 10;
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Query failed: Query failed: Failure while running fragment.,
>>>>>>>>>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot
>>> be
>>>>>>>>> cast to
>>>>>>>>>>>>>> org.apache.drill.exec.vector.complex.MapVector [
>>>>>>>>>>>>>> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
>>>>>>>>>>>>>> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Error: exception while executing query: Failure while
>>> executing
>>>>>>>>> query.
>>>>>>>>>>>>>> (state=,code=0)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> {
>>>>>>>>>>>>>> "entities": {
>>>>>>>>>>>>>> "trends": [],
>>>>>>>>>>>>>> "symbols": [],
>>>>>>>>>>>>>> "urls": [],
>>>>>>>>>>>>>> "hashtags": [],
>>>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>>>> },
>>>>>>>>>>>>>> "entities": {
>>>>>>>>>>>>>> "trends": [1,2,3],
>>>>>>>>>>>>>> "symbols": [4,5,6],
>>>>>>>>>>>>>> "urls": [7,8,9],
>>>>>>>>>>>>>> "hashtags": [
>>>>>>>>>>>>>> {
>>>>>>>>>>>>>> "text": "GoPatriots",
>>>>>>>>>>>>>> "indices": []
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> ],
>>>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The issue seems to be that if some records have arrays with
>>>> maps in
>>>>>>>>> them
>>>>>>>>>>>>>> and others are empty.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> —Andries
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Jan 21, 2015, at 2:34 PM, Hao Zhu <hz...@maprtech.com>
>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Seems it works for below json file:
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>> "entities": {
>>>>>>>>>>>>>>> "trends": [],
>>>>>>>>>>>>>>> "symbols": [],
>>>>>>>>>>>>>>> "urls": [],
>>>>>>>>>>>>>>> "hashtags": [
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>> "text": "GoPatriots",
>>>>>>>>>>>>>>> "indices": [
>>>>>>>>>>>>>>>   83,
>>>>>>>>>>>>>>>   94
>>>>>>>>>>>>>>> ]
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> ],
>>>>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>>>>> },
>>>>>>>>>>>>>>> "entities": {
>>>>>>>>>>>>>>> "trends": [1,2,3],
>>>>>>>>>>>>>>> "symbols": [4,5,6],
>>>>>>>>>>>>>>> "urls": [7,8,9],
>>>>>>>>>>>>>>> "hashtags": [
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>> "text": "GoPatriots",
>>>>>>>>>>>>>>> "indices": []
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> ],
>>>>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json`
>>>> as t
>>>>>>>>> where
>>>>>>>>>>>>>>> t.entities.urls is not null;
>>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>>> | [7,8,9]    |
>>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>>> 1 row selected (0.139 seconds)
>>>>>>>>>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json`
>>>> as t
>>>>>>>>> where
>>>>>>>>>>>>>>> t.entities.urls is null;
>>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>>> No rows selected (0.158 seconds)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Hao
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya <
>>>> adityakishore@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I believe that this works if the array contains homogeneous
>>>>>>>>> primitive
>>>>>>>>>>>>>>>> types. In your example, it appears from the error, the array
>>>>>>> field
>>>>>>>>>>>>>> 'member'
>>>>>>>>>>>>>>>> contained maps for at least one record.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <
>>>>>>>>> cmatta@mapr.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Trying that locally did not work for me (drill 0.7.0):
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members` from
>>>>>>>>>>>>>>>> `Downloads/test.json` where repeated_count(`members`) > 0;
>>>>>>>>>>>>>>>>> Query failed: Query stopped., Failure while trying to
>>>>>>> materialize
>>>>>>>>>>>>>>>> incoming schema.  Errors:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Error in expression at index -1.  Error: Missing function
>>>>>>>>>>>>>>>> implementation: [repeated_count(MAP-REPEATED)].  Full
>>>> expression:
>>>>>>>>>>>>>> --UNKNOWN
>>>>>>>>>>>>>>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
>>>>>>>>>>>>>> bullseye-3:31010 ]
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Error: exception while executing query: Failure while
>>>> executing
>>>>>>>>>>>> query.
>>>>>>>>>>>>>>>> (state=,code=0)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> ​
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Chris Matta
>>>>>>>>>>>>>>>>> cmatta@mapr.com
>>>>>>>>>>>>>>>>> 215-701-3146
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <
>>>>>>> adityakishore@gmail.com
>>>>>>>>>> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> repeated_count('entities.urls') > 0
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
>>>>>>>>>>>>>>>>>> aengelbrecht@maprtech.com> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> How do you filter out records with an empty array in
>>> drill?
>>>>>>>>>>>>>>>>>>> i.e some records have "url":[]  and some will have an
>>> array
>>>>>>>>> with
>>>>>>>>>>>> data
>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> it. When trying to read records with data in the array
>>>> drill
>>>>>>>>> fails
>>>>>>>>>>>>>> due
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> records missing any data in the array. Trying a filter
>>>> with/*
>>>>>>>>> where
>>>>>>>>>>>>>>>>>>> "url":[0] is not null */ fails, also fails if applying
>>> url
>>>> is
>>>>>>>>> not
>>>>>>>>>>>>>>>> null.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Note some of the arrays contains maps, using twitter data
>>>> as
>>>>>>> an
>>>>>>>>>>>>>>>> example
>>>>>>>>>>>>>>>>>>> below. Some records have an empty array with
>>> “hashtags”:[]
>>>>>>> and
>>>>>>>>>>>>>> others
>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>> look similar to what is listed below.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> "entities": {
>>>>>>>>>>>>>>>>>>> "trends": [],
>>>>>>>>>>>>>>>>>>> "symbols": [],
>>>>>>>>>>>>>>>>>>> "urls": [],
>>>>>>>>>>>>>>>>>>> "hashtags": [
>>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>> "text": "GoPatriots",
>>>>>>>>>>>>>>>>>>> "indices": [
>>>>>>>>>>>>>>>>>>>   83,
>>>>>>>>>>>>>>>>>>>   94
>>>>>>>>>>>>>>>>>>> ]
>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>> ],
>>>>>>>>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>>>>>>>>> },
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>> —Andries
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> <z.json>
> 

Re: Filter out empty arrays in JSON

Posted by Andries Engelbrecht <ae...@maprtech.com>.
Chris 

Thanks, but I tried it before without success.

I still get this on Drill 0.7. 

select t.entities.hashtags from dfs.twitter.`/nfl` t where t.entities.hashtags[0] is not null limit 10;

Error: exception while executing query: Failure while executing query. (state=,code=0)
Query failed: Query failed: Failure while running fragment., Failure while trying to materialize incoming schema.  Errors:

Error in expression at index 0.  Error: Missing function implementation: [isnotnull(MAP-REQUIRED)].  Full expression: null.. [ 0dab71b6-1860-4739-b752-1458ed29b671 on drilldemo:31010 ]
[ 0dab71b6-1860-4739-b752-1458ed29b671 on drilldemo:31010 ]


Mine seems finicky about even using the first element in the array

select t.entities.hashtags[0] from dfs.twitter.`/nfl` t limit 10;

Query failed: Query failed: Failure while running fragment., index: 4, length: 4 (expected: range(0, 4)) [ 62cd2d3a-546e-41c8-a0c5-4da2b81c2ef8 on drilldemo:31010 ]
[ 62cd2d3a-546e-41c8-a0c5-4da2b81c2ef8 on drilldemo:31010 ]


However flatten seems to work. (as long as you don’t have thousands of small json files)

select flatten(t.entities.hashtags) from dfs.twitter.`/nfl` t limit 10;

+------------+
|   EXPR$0   |
+------------+
| {"text":"PantherPride","indices":["12","25"]} |
| {"text":"KeepPounding","indices":["48","61"]} |
| {"text":"vegas","indices":["107","113"]} |
| {"text":"nfl","indices":["23","27"]} |
| {"text":"Cowboys","indices":["7","15"]} |
| {"text":"Cowboys","indices":["7","15"]} |
| {"text":"Ebay","indices":["52","57"]} |
| {"text":"NFL","indices":["58","62"]} |
| {"text":"Christmas","indices":["63","73"]} |
| {"text":"Gift","indices":["74","79"]} |
+------------+


—Andries




On Feb 5, 2015, at 6:41 PM, Christopher Matta <cm...@mapr.com> wrote:

> Andreas, I think I may have solved your issue:
> 
>> select t.entities.hashtags FROM mfs.cmatta.`tweets/blackfriday` t WHERE t.entities.hashtags[0].text is not null limit 10;
> +------------+
> |   EXPR$0   |
> +------------+
> | [{"text":"Blackfriday","indices":[11,23]},{"text":"facebook","indices":[56,65]},{"text":"christmas","indices":[66,76]}]
> |
> | [{"text":"BlackFriday","indices":[65,77]},{"text":"ViernesNegro","indices":[105,118]},{"text":"ofertas","indices":[122,130]}]
> |
> | [{"text":"StarWars","indices":[21,30]},{"text":"BlackFriday","indices":[71,83]}]
> |
> | [{"text":"NotOneDime","indices":[14,25]}] |
> | [{"text":"BlackFriday","indices":[43,55]}] |
> | [{"text":"JRStudio","indices":[93,102]}] |
> | [{"text":"luv","indices":[38,42]},{"text":"luvmanicures","indices":[43,56]},{"text":"luvpedicures","indices":[57,70]},{"text":"luvroyaloak","indices":[71,83]},{"text":"woodwardave","indices":[84,96]}]
> |
> | [{"text":"BlackFriday","indices":[0,12]}] |
> | [{"text":"BlackFriday","indices":[18,30]},{"text":"SOE","indices":[65,69]},{"text":"DoubleSC","indices":[82,91]}]
> |
> | [{"text":"BlackFriday","indices":[18,30]},{"text":"SOE","indices":[65,69]},{"text":"DoubleSC","indices":[82,91]}]
> |
> +------------+
> 10 rows selected (1.697 seconds)
> 
> I noticed when only extracting the first text item in the hashtag array (
> t.entities.hashtags[0].text) it was returning null for those tweets that
> didn’t have hashtags, so filtering out that seems to work.
> ​
> 
> Chris Matta
> cmatta@mapr.com
> 215-701-3146
> 
> On Mon, Jan 26, 2015 at 6:43 PM, Jason Altekruse <al...@gmail.com>
> wrote:
> 
>> As Aditya commented before this will work if the lists only contain scalars
>> 
>> repeated_count('entities.urls') > 0
>> 
>> If the lists contain maps unfortunately this is not available today. There
>> is an enhancement request open for this feature. I have marked it for a fix
>> in 0.9 as it is more of a feature request than a bug and we are working on
>> closing a large number of bugs for 0.8 before we get to issues like this.
>> 
>> https://issues.apache.org/jira/browse/DRILL-1650
>> 
>> -Jason
>> 
>> On Mon, Jan 26, 2015 at 3:26 PM, Andries Engelbrecht <
>> aengelbrecht@maprtech.com> wrote:
>> 
>>> Unfortunately it seems that with larger data sets the use of flatten
>> seems
>>> to produce an error.
>>> 
>>> Any other options to filter out JSON records (objects) where an array is
>>> empty?
>>> 
>>> Thanks
>>> —Andries
>>> 
>>> 
>>> On Jan 21, 2015, at 5:18 PM, Andries Engelbrecht <
>>> aengelbrecht@maprtech.com> wrote:
>>> 
>>>> It was interesting to see flatten bypass the records with an empty
>> array.
>>>> 
>>>> Unfortunately some arrays are much more complex than the example here,
>>> and it is still useful to have the ability to filter out ones that are
>>> empty.
>>>> 
>>>> —Andries
>>>> 
>>>> On Jan 21, 2015, at 5:04 PM, Hao Zhu <hz...@maprtech.com> wrote:
>>>> 
>>>>> I figured out the differences after getting the json file from
>> Andries.
>>>>> 
>>>>> My json file is like:
>>>>> {“entities”:{xxx},“entities”:{yyy}... }
>>>>> 
>>>>> Andries' json file is like:
>>>>> {“entities”:{xxx}}
>>>>> {“entities”:{yyy}}
>>>>> ...
>>>>> 
>>>>> So basically Andries' json file is contains multiple json files.
>>>>> 
>>>>> Fortunately, we can use the same SQL to get the same results:
>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z2.json` t;
>>>>> +------------+
>>>>> |   EXPR$0   |
>>>>> +------------+
>>>>> | []         |
>>>>> | [{"text":"GoPatriots"}] |
>>>>> | [{"text":"aaa"}] |
>>>>> | []         |
>>>>> | [{"text":"bbb"}] |
>>>>> +------------+
>>>>> 5 rows selected (0.136 seconds)
>>>>> 0: jdbc:drill:> with tmp as
>>>>> . . . . . . . > (
>>>>> . . . . . . . > select flatten(t.entities.hashtags) as c from
>>>>> dfs.tmp.`z2.json` t
>>>>> . . . . . . . > )
>>>>> . . . . . . . > select tmp.c.text from tmp;
>>>>> +------------+
>>>>> |   EXPR$0   |
>>>>> +------------+
>>>>> | GoPatriots |
>>>>> | aaa        |
>>>>> | bbb        |
>>>>> +------------+
>>>>> 3 rows selected (0.115 seconds)
>>>>> 
>>>>> Thanks,
>>>>> Hao
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wed, Jan 21, 2015 at 4:34 PM, Andries Engelbrecht <
>>>>> aengelbrecht@maprtech.com> wrote:
>>>>> 
>>>>>> In my case it returns the empty records when flatten is not used.
>>>>>> 
>>>>>> 0: jdbc:drill:zk=drilldemo:5181> select t.entities.hashtags as
>> hashtags
>>>>>> from `twitter.json` t limit 10;
>>>>>> +------------+
>>>>>> |  hashtags  |
>>>>>> +------------+
>>>>>> | []         |
>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>>>>> | []         |
>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>>>>> | []         |
>>>>>> | []         |
>>>>>> | []         |
>>>>>> | []         |
>>>>>> | []         |
>>>>>> | [{"text":"CARvsSEA","indices":[36,45]}] |
>>>>>> +------------+
>>>>>> 
>>>>>> On Jan 21, 2015, at 4:26 PM, Hao Zhu <hz...@maprtech.com> wrote:
>>>>>> 
>>>>>>> Actually not due to flatten, if you directly query the file, it will
>>> only
>>>>>>> show the non-null values.
>>>>>>> 
>>>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z.json` t;
>>>>>>> +------------+
>>>>>>> |   EXPR$0   |
>>>>>>> +------------+
>>>>>>> | [{"text":"GoPatriots"},{"text":"aaa"},{"text":"bbb"}] |
>>>>>>> +------------+
>>>>>>> 1 row selected (0.123 seconds)
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Hao
>>>>>>> 
>>>>>>> On Wed, Jan 21, 2015 at 4:18 PM, Andries Engelbrecht <
>>>>>>> aengelbrecht@maprtech.com> wrote:
>>>>>>> 
>>>>>>>> Very interesting, flatten seems to bypass empty records. Not sure
>> if
>>>>>> that
>>>>>>>> is an ideal result for all use cases, but certainly usable in this
>>> case.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> —Andries
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Jan 21, 2015, at 3:45 PM, Hao Zhu <hz...@maprtech.com> wrote:
>>>>>>>> 
>>>>>>>>> I can also fetch non-null values for attached json file which
>>> contains
>>>>>> 6
>>>>>>>> "entities", 3 of them are null, 3 of them have "text" value.
>>>>>>>>> 
>>>>>>>>> Could you share your complete "twitter.json"?
>>>>>>>>> 
>>>>>>>>> 0: jdbc:drill:> with tmp as
>>>>>>>>> . . . . . . . > (
>>>>>>>>> . . . . . . . > select flatten(t.entities.hashtags) as c from
>>>>>>>> dfs.tmp.`z.json` t
>>>>>>>>> . . . . . . . > )
>>>>>>>>> . . . . . . . > select tmp.c.text from tmp;
>>>>>>>>> +------------+
>>>>>>>>> |   EXPR$0   |
>>>>>>>>> +------------+
>>>>>>>>> | GoPatriots |
>>>>>>>>> | aaa        |
>>>>>>>>> | bbb        |
>>>>>>>>> +------------+
>>>>>>>>> 3 rows selected (0.122 seconds)
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Hao
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Wed, Jan 21, 2015 at 3:35 PM, Andries Engelbrecht <
>>>>>>>> aengelbrecht@maprtech.com> wrote:
>>>>>>>>> The sample data I posted only has 1 element, but some records have
>>>>>>>> multiple elements in them.
>>>>>>>>> 
>>>>>>>>> Interestingly enough though
>>>>>>>>> select t.entities.hashtags[0].`text` from `twitter.json` t limit
>> 10;
>>>>>>>>> 
>>>>>>>>> Produces
>>>>>>>>> +------------+
>>>>>>>>> |   EXPR$0   |
>>>>>>>>> +------------+
>>>>>>>>> | null       |
>>>>>>>>> | SportsNews |
>>>>>>>>> | null       |
>>>>>>>>> | SportsNews |
>>>>>>>>> | null       |
>>>>>>>>> | null       |
>>>>>>>>> | null       |
>>>>>>>>> | null       |
>>>>>>>>> | null       |
>>>>>>>>> | CARvsSEA   |
>>>>>>>>> +——————+
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> And
>>>>>>>>> 
>>>>>>>>> select t.entities.hashtags[0] from `twitter.json` t limit 10;
>>>>>>>>> 
>>>>>>>>> +------------+
>>>>>>>>> |   EXPR$0   |
>>>>>>>>> +------------+
>>>>>>>>> | {"indices":[]} |
>>>>>>>>> | {"text":"SportsNews","indices":[90,99]} |
>>>>>>>>> | {"indices":[]} |
>>>>>>>>> | {"text":"SportsNews","indices":[90,99]} |
>>>>>>>>> | {"indices":[]} |
>>>>>>>>> | {"indices":[]} |
>>>>>>>>> | {"indices":[]} |
>>>>>>>>> | {"indices":[]} |
>>>>>>>>> | {"indices":[]} |
>>>>>>>>> | {"text":"CARvsSEA","indices":[90,99]} |
>>>>>>>>> +——————+
>>>>>>>>> 
>>>>>>>>> Strange part is that there is no indices map in the hashtags
>> array,
>>> so
>>>>>>>> no idea why it shows up when pointing to the first lament in an
>> empty
>>>>>> array.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> —Andries
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Jan 21, 2015, at 3:24 PM, Hao Zhu <hz...@maprtech.com> wrote:
>>>>>>>>> 
>>>>>>>>>> I just noticed that the result of "hashtags" is just an array
>> with
>>>>>>>> only 1
>>>>>>>>>> element.
>>>>>>>>>> So take your example:
>>>>>>>>>> [root@maprdemo tmp]# cat d.json
>>>>>>>>>> {
>>>>>>>>>> "entities": {
>>>>>>>>>> "trends": [],
>>>>>>>>>> "symbols": [],
>>>>>>>>>> "urls": [],
>>>>>>>>>> "hashtags": [],
>>>>>>>>>> "user_mentions": []
>>>>>>>>>> },
>>>>>>>>>> "entities": {
>>>>>>>>>> "trends": [1,2,3],
>>>>>>>>>> "symbols": [4,5,6],
>>>>>>>>>> "urls": [7,8,9],
>>>>>>>>>> "hashtags": [
>>>>>>>>>> {
>>>>>>>>>>   "text": "GoPatriots",
>>>>>>>>>>   "indices": []
>>>>>>>>>> }
>>>>>>>>>> ],
>>>>>>>>>> "user_mentions": []
>>>>>>>>>> }
>>>>>>>>>> }
>>>>>>>>>> 
>>>>>>>>>> Now we can do this to achieve the results:
>>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json`
>> t
>>> ;
>>>>>>>>>> +------------+
>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>> +------------+
>>>>>>>>>> | [{"text":"GoPatriots"}] |
>>>>>>>>>> +------------+
>>>>>>>>>> 1 row selected (0.09 seconds)
>>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags[0].text from
>>>>>>>> dfs.tmp.`d.json` t ;
>>>>>>>>>> +------------+
>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>> +------------+
>>>>>>>>>> | GoPatriots |
>>>>>>>>>> +------------+
>>>>>>>>>> 1 row selected (0.108 seconds)
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Hao
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Wed, Jan 21, 2015 at 3:01 PM, Andries Engelbrecht <
>>>>>>>>>> aengelbrecht@maprtech.com> wrote:
>>>>>>>>>> 
>>>>>>>>>>> When I run the query on a larger dataset it actually show the
>>> empty
>>>>>>>>>>> records.
>>>>>>>>>>> 
>>>>>>>>>>> select t.entities.hashtags from `twitter.json` t limit 10;
>>>>>>>>>>> 
>>>>>>>>>>> +------------+
>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>> +------------+
>>>>>>>>>>> | []         |
>>>>>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>>>>>>>>>> | []         |
>>>>>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>>>>>>>>>> | []         |
>>>>>>>>>>> | []         |
>>>>>>>>>>> | []         |
>>>>>>>>>>> | []         |
>>>>>>>>>>> | []         |
>>>>>>>>>>> | [{"text":"CARvsSEA","indices":[36,45]}] |
>>>>>>>>>>> +------------+
>>>>>>>>>>> 10 rows selected (2.899 seconds)
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> However having the output as maps is not very useful, unless i
>> can
>>>>>>>> filter
>>>>>>>>>>> out the records with empty arrays and then drill deeper into the
>>> ones
>>>>>>>> with
>>>>>>>>>>> data in the arrays.
>>>>>>>>>>> 
>>>>>>>>>>> BTW: Hao I would have expiated your query to return both rows,
>> one
>>>>>>>> with an
>>>>>>>>>>> empty array as above and the other with the array data.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> —Andries
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Jan 21, 2015, at 2:56 PM, Hao Zhu <hz...@maprtech.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> I am not sure if below is expected behavior.
>>>>>>>>>>>> If we only select "hashtags", and it will return only 1 row
>>> ignoring
>>>>>>>> the
>>>>>>>>>>>> null value.
>>>>>>>>>>>> However then if we try to get "hashtags.text", it fails...which
>>>>>>>> means it
>>>>>>>>>>> is
>>>>>>>>>>>> still trying to read the NULL value.
>>>>>>>>>>>> I am thinking it may confuse the SQL developers.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags from
>> dfs.tmp.`d.json`
>>> t ;
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> | [{"text":"GoPatriots"}] |
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> 1 row selected (0.109 seconds)
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags.text from
>>>>>>>> dfs.tmp.`d.json` t ;
>>>>>>>>>>>> 
>>>>>>>>>>>> Query failed: Query failed: Failure while running fragment.,
>>>>>>>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot
>> be
>>>>>>>> cast to
>>>>>>>>>>>> org.apache.drill.exec.vector.complex.MapVector [
>>>>>>>>>>>> 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
>>>>>>>>>>>> [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Error: exception while executing query: Failure while executing
>>>>>>>> query.
>>>>>>>>>>>> (state=,code=0)
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Hao
>>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht <
>>>>>>>>>>>> aengelbrecht@maprtech.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Now try on hashtags with the following:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> drilldemo:5181> select t.entities.hashtags.`text` from
>>>>>>>> `/twitter.json` t
>>>>>>>>>>>>> where t.entities.hashtags is not null limit 10;
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Query failed: Query failed: Failure while running fragment.,
>>>>>>>>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot
>> be
>>>>>>>> cast to
>>>>>>>>>>>>> org.apache.drill.exec.vector.complex.MapVector [
>>>>>>>>>>>>> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
>>>>>>>>>>>>> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Error: exception while executing query: Failure while
>> executing
>>>>>>>> query.
>>>>>>>>>>>>> (state=,code=0)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> {
>>>>>>>>>>>>> "entities": {
>>>>>>>>>>>>> "trends": [],
>>>>>>>>>>>>> "symbols": [],
>>>>>>>>>>>>> "urls": [],
>>>>>>>>>>>>> "hashtags": [],
>>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>>> },
>>>>>>>>>>>>> "entities": {
>>>>>>>>>>>>> "trends": [1,2,3],
>>>>>>>>>>>>> "symbols": [4,5,6],
>>>>>>>>>>>>> "urls": [7,8,9],
>>>>>>>>>>>>> "hashtags": [
>>>>>>>>>>>>> {
>>>>>>>>>>>>>  "text": "GoPatriots",
>>>>>>>>>>>>>  "indices": []
>>>>>>>>>>>>> }
>>>>>>>>>>>>> ],
>>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>>> }
>>>>>>>>>>>>> }
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The issue seems to be that if some records have arrays with
>>> maps in
>>>>>>>> them
>>>>>>>>>>>>> and others are empty.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> —Andries
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Jan 21, 2015, at 2:34 PM, Hao Zhu <hz...@maprtech.com>
>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Seems it works for below json file:
>>>>>>>>>>>>>> {
>>>>>>>>>>>>>> "entities": {
>>>>>>>>>>>>>> "trends": [],
>>>>>>>>>>>>>> "symbols": [],
>>>>>>>>>>>>>> "urls": [],
>>>>>>>>>>>>>> "hashtags": [
>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>  "text": "GoPatriots",
>>>>>>>>>>>>>>  "indices": [
>>>>>>>>>>>>>>    83,
>>>>>>>>>>>>>>    94
>>>>>>>>>>>>>>  ]
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> ],
>>>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>>>> },
>>>>>>>>>>>>>> "entities": {
>>>>>>>>>>>>>> "trends": [1,2,3],
>>>>>>>>>>>>>> "symbols": [4,5,6],
>>>>>>>>>>>>>> "urls": [7,8,9],
>>>>>>>>>>>>>> "hashtags": [
>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>  "text": "GoPatriots",
>>>>>>>>>>>>>>  "indices": []
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> ],
>>>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json`
>>> as t
>>>>>>>> where
>>>>>>>>>>>>>> t.entities.urls is not null;
>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>> | [7,8,9]    |
>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>> 1 row selected (0.139 seconds)
>>>>>>>>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json`
>>> as t
>>>>>>>> where
>>>>>>>>>>>>>> t.entities.urls is null;
>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>> No rows selected (0.158 seconds)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Hao
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya <
>>> adityakishore@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I believe that this works if the array contains homogeneous
>>>>>>>> primitive
>>>>>>>>>>>>>>> types. In your example, it appears from the error, the array
>>>>>> field
>>>>>>>>>>>>> 'member'
>>>>>>>>>>>>>>> contained maps for at least one record.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <
>>>>>>>> cmatta@mapr.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Trying that locally did not work for me (drill 0.7.0):
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members` from
>>>>>>>>>>>>>>> `Downloads/test.json` where repeated_count(`members`) > 0;
>>>>>>>>>>>>>>>> Query failed: Query stopped., Failure while trying to
>>>>>> materialize
>>>>>>>>>>>>>>> incoming schema.  Errors:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Error in expression at index -1.  Error: Missing function
>>>>>>>>>>>>>>> implementation: [repeated_count(MAP-REPEATED)].  Full
>>> expression:
>>>>>>>>>>>>> --UNKNOWN
>>>>>>>>>>>>>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
>>>>>>>>>>>>> bullseye-3:31010 ]
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Error: exception while executing query: Failure while
>>> executing
>>>>>>>>>>> query.
>>>>>>>>>>>>>>> (state=,code=0)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> ​
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Chris Matta
>>>>>>>>>>>>>>>> cmatta@mapr.com
>>>>>>>>>>>>>>>> 215-701-3146
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <
>>>>>> adityakishore@gmail.com
>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> repeated_count('entities.urls') > 0
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
>>>>>>>>>>>>>>>>> aengelbrecht@maprtech.com> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> How do you filter out records with an empty array in
>> drill?
>>>>>>>>>>>>>>>>>> i.e some records have "url":[]  and some will have an
>> array
>>>>>>>> with
>>>>>>>>>>> data
>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> it. When trying to read records with data in the array
>>> drill
>>>>>>>> fails
>>>>>>>>>>>>> due
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> records missing any data in the array. Trying a filter
>>> with/*
>>>>>>>> where
>>>>>>>>>>>>>>>>>> "url":[0] is not null */ fails, also fails if applying
>> url
>>> is
>>>>>>>> not
>>>>>>>>>>>>>>> null.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Note some of the arrays contains maps, using twitter data
>>> as
>>>>>> an
>>>>>>>>>>>>>>> example
>>>>>>>>>>>>>>>>>> below. Some records have an empty array with
>> “hashtags”:[]
>>>>>> and
>>>>>>>>>>>>> others
>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>> look similar to what is listed below.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> "entities": {
>>>>>>>>>>>>>>>>>> "trends": [],
>>>>>>>>>>>>>>>>>> "symbols": [],
>>>>>>>>>>>>>>>>>> "urls": [],
>>>>>>>>>>>>>>>>>> "hashtags": [
>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>  "text": "GoPatriots",
>>>>>>>>>>>>>>>>>>  "indices": [
>>>>>>>>>>>>>>>>>>    83,
>>>>>>>>>>>>>>>>>>    94
>>>>>>>>>>>>>>>>>>  ]
>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>> ],
>>>>>>>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>>>>>>>> },
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>> —Andries
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> <z.json>
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>> 
>>> 
>> 


Re: Filter out empty arrays in JSON

Posted by Christopher Matta <cm...@mapr.com>.
Andreas, I think I may have solved your issue:

> select t.entities.hashtags FROM mfs.cmatta.`tweets/blackfriday` t WHERE t.entities.hashtags[0].text is not null limit 10;
+------------+
|   EXPR$0   |
+------------+
| [{"text":"Blackfriday","indices":[11,23]},{"text":"facebook","indices":[56,65]},{"text":"christmas","indices":[66,76]}]
|
| [{"text":"BlackFriday","indices":[65,77]},{"text":"ViernesNegro","indices":[105,118]},{"text":"ofertas","indices":[122,130]}]
|
| [{"text":"StarWars","indices":[21,30]},{"text":"BlackFriday","indices":[71,83]}]
|
| [{"text":"NotOneDime","indices":[14,25]}] |
| [{"text":"BlackFriday","indices":[43,55]}] |
| [{"text":"JRStudio","indices":[93,102]}] |
| [{"text":"luv","indices":[38,42]},{"text":"luvmanicures","indices":[43,56]},{"text":"luvpedicures","indices":[57,70]},{"text":"luvroyaloak","indices":[71,83]},{"text":"woodwardave","indices":[84,96]}]
|
| [{"text":"BlackFriday","indices":[0,12]}] |
| [{"text":"BlackFriday","indices":[18,30]},{"text":"SOE","indices":[65,69]},{"text":"DoubleSC","indices":[82,91]}]
|
| [{"text":"BlackFriday","indices":[18,30]},{"text":"SOE","indices":[65,69]},{"text":"DoubleSC","indices":[82,91]}]
|
+------------+
10 rows selected (1.697 seconds)

I noticed when only extracting the first text item in the hashtag array (
t.entities.hashtags[0].text) it was returning null for those tweets that
didn’t have hashtags, so filtering out that seems to work.
​

Chris Matta
cmatta@mapr.com
215-701-3146

On Mon, Jan 26, 2015 at 6:43 PM, Jason Altekruse <al...@gmail.com>
wrote:

> As Aditya commented before this will work if the lists only contain scalars
>
> repeated_count('entities.urls') > 0
>
> If the lists contain maps unfortunately this is not available today. There
> is an enhancement request open for this feature. I have marked it for a fix
> in 0.9 as it is more of a feature request than a bug and we are working on
> closing a large number of bugs for 0.8 before we get to issues like this.
>
> https://issues.apache.org/jira/browse/DRILL-1650
>
> -Jason
>
> On Mon, Jan 26, 2015 at 3:26 PM, Andries Engelbrecht <
> aengelbrecht@maprtech.com> wrote:
>
> > Unfortunately it seems that with larger data sets the use of flatten
> seems
> > to produce an error.
> >
> > Any other options to filter out JSON records (objects) where an array is
> > empty?
> >
> > Thanks
> > —Andries
> >
> >
> > On Jan 21, 2015, at 5:18 PM, Andries Engelbrecht <
> > aengelbrecht@maprtech.com> wrote:
> >
> > > It was interesting to see flatten bypass the records with an empty
> array.
> > >
> > > Unfortunately some arrays are much more complex than the example here,
> > and it is still useful to have the ability to filter out ones that are
> > empty.
> > >
> > > —Andries
> > >
> > > On Jan 21, 2015, at 5:04 PM, Hao Zhu <hz...@maprtech.com> wrote:
> > >
> > >> I figured out the differences after getting the json file from
> Andries.
> > >>
> > >> My json file is like:
> > >> {“entities”:{xxx},“entities”:{yyy}... }
> > >>
> > >> Andries' json file is like:
> > >> {“entities”:{xxx}}
> > >> {“entities”:{yyy}}
> > >> ...
> > >>
> > >> So basically Andries' json file is contains multiple json files.
> > >>
> > >> Fortunately, we can use the same SQL to get the same results:
> > >> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z2.json` t;
> > >> +------------+
> > >> |   EXPR$0   |
> > >> +------------+
> > >> | []         |
> > >> | [{"text":"GoPatriots"}] |
> > >> | [{"text":"aaa"}] |
> > >> | []         |
> > >> | [{"text":"bbb"}] |
> > >> +------------+
> > >> 5 rows selected (0.136 seconds)
> > >> 0: jdbc:drill:> with tmp as
> > >> . . . . . . . > (
> > >> . . . . . . . > select flatten(t.entities.hashtags) as c from
> > >> dfs.tmp.`z2.json` t
> > >> . . . . . . . > )
> > >> . . . . . . . > select tmp.c.text from tmp;
> > >> +------------+
> > >> |   EXPR$0   |
> > >> +------------+
> > >> | GoPatriots |
> > >> | aaa        |
> > >> | bbb        |
> > >> +------------+
> > >> 3 rows selected (0.115 seconds)
> > >>
> > >> Thanks,
> > >> Hao
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Wed, Jan 21, 2015 at 4:34 PM, Andries Engelbrecht <
> > >> aengelbrecht@maprtech.com> wrote:
> > >>
> > >>> In my case it returns the empty records when flatten is not used.
> > >>>
> > >>> 0: jdbc:drill:zk=drilldemo:5181> select t.entities.hashtags as
> hashtags
> > >>> from `twitter.json` t limit 10;
> > >>> +------------+
> > >>> |  hashtags  |
> > >>> +------------+
> > >>> | []         |
> > >>> | [{"text":"SportsNews","indices":[0,11]}] |
> > >>> | []         |
> > >>> | [{"text":"SportsNews","indices":[0,11]}] |
> > >>> | []         |
> > >>> | []         |
> > >>> | []         |
> > >>> | []         |
> > >>> | []         |
> > >>> | [{"text":"CARvsSEA","indices":[36,45]}] |
> > >>> +------------+
> > >>>
> > >>> On Jan 21, 2015, at 4:26 PM, Hao Zhu <hz...@maprtech.com> wrote:
> > >>>
> > >>>> Actually not due to flatten, if you directly query the file, it will
> > only
> > >>>> show the non-null values.
> > >>>>
> > >>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z.json` t;
> > >>>> +------------+
> > >>>> |   EXPR$0   |
> > >>>> +------------+
> > >>>> | [{"text":"GoPatriots"},{"text":"aaa"},{"text":"bbb"}] |
> > >>>> +------------+
> > >>>> 1 row selected (0.123 seconds)
> > >>>>
> > >>>> Thanks,
> > >>>> Hao
> > >>>>
> > >>>> On Wed, Jan 21, 2015 at 4:18 PM, Andries Engelbrecht <
> > >>>> aengelbrecht@maprtech.com> wrote:
> > >>>>
> > >>>>> Very interesting, flatten seems to bypass empty records. Not sure
> if
> > >>> that
> > >>>>> is an ideal result for all use cases, but certainly usable in this
> > case.
> > >>>>>
> > >>>>> Thanks
> > >>>>> —Andries
> > >>>>>
> > >>>>>
> > >>>>> On Jan 21, 2015, at 3:45 PM, Hao Zhu <hz...@maprtech.com> wrote:
> > >>>>>
> > >>>>>> I can also fetch non-null values for attached json file which
> > contains
> > >>> 6
> > >>>>> "entities", 3 of them are null, 3 of them have "text" value.
> > >>>>>>
> > >>>>>> Could you share your complete "twitter.json"?
> > >>>>>>
> > >>>>>> 0: jdbc:drill:> with tmp as
> > >>>>>> . . . . . . . > (
> > >>>>>> . . . . . . . > select flatten(t.entities.hashtags) as c from
> > >>>>> dfs.tmp.`z.json` t
> > >>>>>> . . . . . . . > )
> > >>>>>> . . . . . . . > select tmp.c.text from tmp;
> > >>>>>> +------------+
> > >>>>>> |   EXPR$0   |
> > >>>>>> +------------+
> > >>>>>> | GoPatriots |
> > >>>>>> | aaa        |
> > >>>>>> | bbb        |
> > >>>>>> +------------+
> > >>>>>> 3 rows selected (0.122 seconds)
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>> Hao
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> On Wed, Jan 21, 2015 at 3:35 PM, Andries Engelbrecht <
> > >>>>> aengelbrecht@maprtech.com> wrote:
> > >>>>>> The sample data I posted only has 1 element, but some records have
> > >>>>> multiple elements in them.
> > >>>>>>
> > >>>>>> Interestingly enough though
> > >>>>>> select t.entities.hashtags[0].`text` from `twitter.json` t limit
> 10;
> > >>>>>>
> > >>>>>> Produces
> > >>>>>> +------------+
> > >>>>>> |   EXPR$0   |
> > >>>>>> +------------+
> > >>>>>> | null       |
> > >>>>>> | SportsNews |
> > >>>>>> | null       |
> > >>>>>> | SportsNews |
> > >>>>>> | null       |
> > >>>>>> | null       |
> > >>>>>> | null       |
> > >>>>>> | null       |
> > >>>>>> | null       |
> > >>>>>> | CARvsSEA   |
> > >>>>>> +——————+
> > >>>>>>
> > >>>>>>
> > >>>>>> And
> > >>>>>>
> > >>>>>> select t.entities.hashtags[0] from `twitter.json` t limit 10;
> > >>>>>>
> > >>>>>> +------------+
> > >>>>>> |   EXPR$0   |
> > >>>>>> +------------+
> > >>>>>> | {"indices":[]} |
> > >>>>>> | {"text":"SportsNews","indices":[90,99]} |
> > >>>>>> | {"indices":[]} |
> > >>>>>> | {"text":"SportsNews","indices":[90,99]} |
> > >>>>>> | {"indices":[]} |
> > >>>>>> | {"indices":[]} |
> > >>>>>> | {"indices":[]} |
> > >>>>>> | {"indices":[]} |
> > >>>>>> | {"indices":[]} |
> > >>>>>> | {"text":"CARvsSEA","indices":[90,99]} |
> > >>>>>> +——————+
> > >>>>>>
> > >>>>>> Strange part is that there is no indices map in the hashtags
> array,
> > so
> > >>>>> no idea why it shows up when pointing to the first lament in an
> empty
> > >>> array.
> > >>>>>>
> > >>>>>>
> > >>>>>> —Andries
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> On Jan 21, 2015, at 3:24 PM, Hao Zhu <hz...@maprtech.com> wrote:
> > >>>>>>
> > >>>>>>> I just noticed that the result of "hashtags" is just an array
> with
> > >>>>> only 1
> > >>>>>>> element.
> > >>>>>>> So take your example:
> > >>>>>>> [root@maprdemo tmp]# cat d.json
> > >>>>>>> {
> > >>>>>>> "entities": {
> > >>>>>>> "trends": [],
> > >>>>>>> "symbols": [],
> > >>>>>>> "urls": [],
> > >>>>>>> "hashtags": [],
> > >>>>>>> "user_mentions": []
> > >>>>>>> },
> > >>>>>>> "entities": {
> > >>>>>>> "trends": [1,2,3],
> > >>>>>>> "symbols": [4,5,6],
> > >>>>>>> "urls": [7,8,9],
> > >>>>>>> "hashtags": [
> > >>>>>>>  {
> > >>>>>>>    "text": "GoPatriots",
> > >>>>>>>    "indices": []
> > >>>>>>>  }
> > >>>>>>> ],
> > >>>>>>> "user_mentions": []
> > >>>>>>> }
> > >>>>>>> }
> > >>>>>>>
> > >>>>>>> Now we can do this to achieve the results:
> > >>>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json`
> t
> > ;
> > >>>>>>> +------------+
> > >>>>>>> |   EXPR$0   |
> > >>>>>>> +------------+
> > >>>>>>> | [{"text":"GoPatriots"}] |
> > >>>>>>> +------------+
> > >>>>>>> 1 row selected (0.09 seconds)
> > >>>>>>> 0: jdbc:drill:> select t.entities.hashtags[0].text from
> > >>>>> dfs.tmp.`d.json` t ;
> > >>>>>>> +------------+
> > >>>>>>> |   EXPR$0   |
> > >>>>>>> +------------+
> > >>>>>>> | GoPatriots |
> > >>>>>>> +------------+
> > >>>>>>> 1 row selected (0.108 seconds)
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>> Hao
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Wed, Jan 21, 2015 at 3:01 PM, Andries Engelbrecht <
> > >>>>>>> aengelbrecht@maprtech.com> wrote:
> > >>>>>>>
> > >>>>>>>> When I run the query on a larger dataset it actually show the
> > empty
> > >>>>>>>> records.
> > >>>>>>>>
> > >>>>>>>> select t.entities.hashtags from `twitter.json` t limit 10;
> > >>>>>>>>
> > >>>>>>>> +------------+
> > >>>>>>>> |   EXPR$0   |
> > >>>>>>>> +------------+
> > >>>>>>>> | []         |
> > >>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
> > >>>>>>>> | []         |
> > >>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
> > >>>>>>>> | []         |
> > >>>>>>>> | []         |
> > >>>>>>>> | []         |
> > >>>>>>>> | []         |
> > >>>>>>>> | []         |
> > >>>>>>>> | [{"text":"CARvsSEA","indices":[36,45]}] |
> > >>>>>>>> +------------+
> > >>>>>>>> 10 rows selected (2.899 seconds)
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> However having the output as maps is not very useful, unless i
> can
> > >>>>> filter
> > >>>>>>>> out the records with empty arrays and then drill deeper into the
> > ones
> > >>>>> with
> > >>>>>>>> data in the arrays.
> > >>>>>>>>
> > >>>>>>>> BTW: Hao I would have expiated your query to return both rows,
> one
> > >>>>> with an
> > >>>>>>>> empty array as above and the other with the array data.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> —Andries
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Jan 21, 2015, at 2:56 PM, Hao Zhu <hz...@maprtech.com> wrote:
> > >>>>>>>>
> > >>>>>>>>> I am not sure if below is expected behavior.
> > >>>>>>>>> If we only select "hashtags", and it will return only 1 row
> > ignoring
> > >>>>> the
> > >>>>>>>>> null value.
> > >>>>>>>>> However then if we try to get "hashtags.text", it fails...which
> > >>>>> means it
> > >>>>>>>> is
> > >>>>>>>>> still trying to read the NULL value.
> > >>>>>>>>> I am thinking it may confuse the SQL developers.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags from
> dfs.tmp.`d.json`
> > t ;
> > >>>>>>>>> +------------+
> > >>>>>>>>> |   EXPR$0   |
> > >>>>>>>>> +------------+
> > >>>>>>>>> | [{"text":"GoPatriots"}] |
> > >>>>>>>>> +------------+
> > >>>>>>>>> 1 row selected (0.109 seconds)
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags.text from
> > >>>>> dfs.tmp.`d.json` t ;
> > >>>>>>>>>
> > >>>>>>>>> Query failed: Query failed: Failure while running fragment.,
> > >>>>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot
> be
> > >>>>> cast to
> > >>>>>>>>> org.apache.drill.exec.vector.complex.MapVector [
> > >>>>>>>>> 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> > >>>>>>>>> [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Error: exception while executing query: Failure while executing
> > >>>>> query.
> > >>>>>>>>> (state=,code=0)
> > >>>>>>>>>
> > >>>>>>>>> Thanks,
> > >>>>>>>>> Hao
> > >>>>>>>>>
> > >>>>>>>>> On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht <
> > >>>>>>>>> aengelbrecht@maprtech.com> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Now try on hashtags with the following:
> > >>>>>>>>>>
> > >>>>>>>>>> drilldemo:5181> select t.entities.hashtags.`text` from
> > >>>>> `/twitter.json` t
> > >>>>>>>>>> where t.entities.hashtags is not null limit 10;
> > >>>>>>>>>>
> > >>>>>>>>>> Query failed: Query failed: Failure while running fragment.,
> > >>>>>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot
> be
> > >>>>> cast to
> > >>>>>>>>>> org.apache.drill.exec.vector.complex.MapVector [
> > >>>>>>>>>> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> > >>>>>>>>>> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Error: exception while executing query: Failure while
> executing
> > >>>>> query.
> > >>>>>>>>>> (state=,code=0)
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> {
> > >>>>>>>>>> "entities": {
> > >>>>>>>>>> "trends": [],
> > >>>>>>>>>> "symbols": [],
> > >>>>>>>>>> "urls": [],
> > >>>>>>>>>> "hashtags": [],
> > >>>>>>>>>> "user_mentions": []
> > >>>>>>>>>> },
> > >>>>>>>>>> "entities": {
> > >>>>>>>>>> "trends": [1,2,3],
> > >>>>>>>>>> "symbols": [4,5,6],
> > >>>>>>>>>> "urls": [7,8,9],
> > >>>>>>>>>> "hashtags": [
> > >>>>>>>>>> {
> > >>>>>>>>>>   "text": "GoPatriots",
> > >>>>>>>>>>   "indices": []
> > >>>>>>>>>> }
> > >>>>>>>>>> ],
> > >>>>>>>>>> "user_mentions": []
> > >>>>>>>>>> }
> > >>>>>>>>>> }
> > >>>>>>>>>>
> > >>>>>>>>>> The issue seems to be that if some records have arrays with
> > maps in
> > >>>>> them
> > >>>>>>>>>> and others are empty.
> > >>>>>>>>>>
> > >>>>>>>>>> —Andries
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On Jan 21, 2015, at 2:34 PM, Hao Zhu <hz...@maprtech.com>
> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Seems it works for below json file:
> > >>>>>>>>>>> {
> > >>>>>>>>>>> "entities": {
> > >>>>>>>>>>> "trends": [],
> > >>>>>>>>>>> "symbols": [],
> > >>>>>>>>>>> "urls": [],
> > >>>>>>>>>>> "hashtags": [
> > >>>>>>>>>>> {
> > >>>>>>>>>>>   "text": "GoPatriots",
> > >>>>>>>>>>>   "indices": [
> > >>>>>>>>>>>     83,
> > >>>>>>>>>>>     94
> > >>>>>>>>>>>   ]
> > >>>>>>>>>>> }
> > >>>>>>>>>>> ],
> > >>>>>>>>>>> "user_mentions": []
> > >>>>>>>>>>> },
> > >>>>>>>>>>> "entities": {
> > >>>>>>>>>>> "trends": [1,2,3],
> > >>>>>>>>>>> "symbols": [4,5,6],
> > >>>>>>>>>>> "urls": [7,8,9],
> > >>>>>>>>>>> "hashtags": [
> > >>>>>>>>>>> {
> > >>>>>>>>>>>   "text": "GoPatriots",
> > >>>>>>>>>>>   "indices": []
> > >>>>>>>>>>> }
> > >>>>>>>>>>> ],
> > >>>>>>>>>>> "user_mentions": []
> > >>>>>>>>>>> }
> > >>>>>>>>>>> }
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json`
> > as t
> > >>>>> where
> > >>>>>>>>>>> t.entities.urls is not null;
> > >>>>>>>>>>> +------------+
> > >>>>>>>>>>> |   EXPR$0   |
> > >>>>>>>>>>> +------------+
> > >>>>>>>>>>> | [7,8,9]    |
> > >>>>>>>>>>> +------------+
> > >>>>>>>>>>> 1 row selected (0.139 seconds)
> > >>>>>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json`
> > as t
> > >>>>> where
> > >>>>>>>>>>> t.entities.urls is null;
> > >>>>>>>>>>> +------------+
> > >>>>>>>>>>> |   EXPR$0   |
> > >>>>>>>>>>> +------------+
> > >>>>>>>>>>> +------------+
> > >>>>>>>>>>> No rows selected (0.158 seconds)
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks,
> > >>>>>>>>>>> Hao
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya <
> > adityakishore@gmail.com>
> > >>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> I believe that this works if the array contains homogeneous
> > >>>>> primitive
> > >>>>>>>>>>>> types. In your example, it appears from the error, the array
> > >>> field
> > >>>>>>>>>> 'member'
> > >>>>>>>>>>>> contained maps for at least one record.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <
> > >>>>> cmatta@mapr.com>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Trying that locally did not work for me (drill 0.7.0):
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members` from
> > >>>>>>>>>>>> `Downloads/test.json` where repeated_count(`members`) > 0;
> > >>>>>>>>>>>>> Query failed: Query stopped., Failure while trying to
> > >>> materialize
> > >>>>>>>>>>>> incoming schema.  Errors:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Error in expression at index -1.  Error: Missing function
> > >>>>>>>>>>>> implementation: [repeated_count(MAP-REPEATED)].  Full
> > expression:
> > >>>>>>>>>> --UNKNOWN
> > >>>>>>>>>>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
> > >>>>>>>>>> bullseye-3:31010 ]
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Error: exception while executing query: Failure while
> > executing
> > >>>>>>>> query.
> > >>>>>>>>>>>> (state=,code=0)
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> ​
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Chris Matta
> > >>>>>>>>>>>>> cmatta@mapr.com
> > >>>>>>>>>>>>> 215-701-3146
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <
> > >>> adityakishore@gmail.com
> > >>>>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> repeated_count('entities.urls') > 0
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
> > >>>>>>>>>>>>>> aengelbrecht@maprtech.com> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> How do you filter out records with an empty array in
> drill?
> > >>>>>>>>>>>>>>> i.e some records have "url":[]  and some will have an
> array
> > >>>>> with
> > >>>>>>>> data
> > >>>>>>>>>>>> in
> > >>>>>>>>>>>>>>> it. When trying to read records with data in the array
> > drill
> > >>>>> fails
> > >>>>>>>>>> due
> > >>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>> records missing any data in the array. Trying a filter
> > with/*
> > >>>>> where
> > >>>>>>>>>>>>>>> "url":[0] is not null */ fails, also fails if applying
> url
> > is
> > >>>>> not
> > >>>>>>>>>>>> null.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Note some of the arrays contains maps, using twitter data
> > as
> > >>> an
> > >>>>>>>>>>>> example
> > >>>>>>>>>>>>>>> below. Some records have an empty array with
> “hashtags”:[]
> > >>> and
> > >>>>>>>>>> others
> > >>>>>>>>>>>>>> will
> > >>>>>>>>>>>>>>> look similar to what is listed below.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> "entities": {
> > >>>>>>>>>>>>>>> "trends": [],
> > >>>>>>>>>>>>>>> "symbols": [],
> > >>>>>>>>>>>>>>> "urls": [],
> > >>>>>>>>>>>>>>> "hashtags": [
> > >>>>>>>>>>>>>>> {
> > >>>>>>>>>>>>>>>   "text": "GoPatriots",
> > >>>>>>>>>>>>>>>   "indices": [
> > >>>>>>>>>>>>>>>     83,
> > >>>>>>>>>>>>>>>     94
> > >>>>>>>>>>>>>>>   ]
> > >>>>>>>>>>>>>>> }
> > >>>>>>>>>>>>>>> ],
> > >>>>>>>>>>>>>>> "user_mentions": []
> > >>>>>>>>>>>>>>> },
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Thanks
> > >>>>>>>>>>>>>>> —Andries
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> <z.json>
> > >>>>>
> > >>>>>
> > >>>
> > >>>
> > >
> >
> >
>

Re: Filter out empty arrays in JSON

Posted by Jason Altekruse <al...@gmail.com>.
As Aditya commented before this will work if the lists only contain scalars

repeated_count('entities.urls') > 0

If the lists contain maps unfortunately this is not available today. There
is an enhancement request open for this feature. I have marked it for a fix
in 0.9 as it is more of a feature request than a bug and we are working on
closing a large number of bugs for 0.8 before we get to issues like this.

https://issues.apache.org/jira/browse/DRILL-1650

-Jason

On Mon, Jan 26, 2015 at 3:26 PM, Andries Engelbrecht <
aengelbrecht@maprtech.com> wrote:

> Unfortunately it seems that with larger data sets the use of flatten seems
> to produce an error.
>
> Any other options to filter out JSON records (objects) where an array is
> empty?
>
> Thanks
> —Andries
>
>
> On Jan 21, 2015, at 5:18 PM, Andries Engelbrecht <
> aengelbrecht@maprtech.com> wrote:
>
> > It was interesting to see flatten bypass the records with an empty array.
> >
> > Unfortunately some arrays are much more complex than the example here,
> and it is still useful to have the ability to filter out ones that are
> empty.
> >
> > —Andries
> >
> > On Jan 21, 2015, at 5:04 PM, Hao Zhu <hz...@maprtech.com> wrote:
> >
> >> I figured out the differences after getting the json file from Andries.
> >>
> >> My json file is like:
> >> {“entities”:{xxx},“entities”:{yyy}... }
> >>
> >> Andries' json file is like:
> >> {“entities”:{xxx}}
> >> {“entities”:{yyy}}
> >> ...
> >>
> >> So basically Andries' json file is contains multiple json files.
> >>
> >> Fortunately, we can use the same SQL to get the same results:
> >> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z2.json` t;
> >> +------------+
> >> |   EXPR$0   |
> >> +------------+
> >> | []         |
> >> | [{"text":"GoPatriots"}] |
> >> | [{"text":"aaa"}] |
> >> | []         |
> >> | [{"text":"bbb"}] |
> >> +------------+
> >> 5 rows selected (0.136 seconds)
> >> 0: jdbc:drill:> with tmp as
> >> . . . . . . . > (
> >> . . . . . . . > select flatten(t.entities.hashtags) as c from
> >> dfs.tmp.`z2.json` t
> >> . . . . . . . > )
> >> . . . . . . . > select tmp.c.text from tmp;
> >> +------------+
> >> |   EXPR$0   |
> >> +------------+
> >> | GoPatriots |
> >> | aaa        |
> >> | bbb        |
> >> +------------+
> >> 3 rows selected (0.115 seconds)
> >>
> >> Thanks,
> >> Hao
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Wed, Jan 21, 2015 at 4:34 PM, Andries Engelbrecht <
> >> aengelbrecht@maprtech.com> wrote:
> >>
> >>> In my case it returns the empty records when flatten is not used.
> >>>
> >>> 0: jdbc:drill:zk=drilldemo:5181> select t.entities.hashtags as hashtags
> >>> from `twitter.json` t limit 10;
> >>> +------------+
> >>> |  hashtags  |
> >>> +------------+
> >>> | []         |
> >>> | [{"text":"SportsNews","indices":[0,11]}] |
> >>> | []         |
> >>> | [{"text":"SportsNews","indices":[0,11]}] |
> >>> | []         |
> >>> | []         |
> >>> | []         |
> >>> | []         |
> >>> | []         |
> >>> | [{"text":"CARvsSEA","indices":[36,45]}] |
> >>> +------------+
> >>>
> >>> On Jan 21, 2015, at 4:26 PM, Hao Zhu <hz...@maprtech.com> wrote:
> >>>
> >>>> Actually not due to flatten, if you directly query the file, it will
> only
> >>>> show the non-null values.
> >>>>
> >>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z.json` t;
> >>>> +------------+
> >>>> |   EXPR$0   |
> >>>> +------------+
> >>>> | [{"text":"GoPatriots"},{"text":"aaa"},{"text":"bbb"}] |
> >>>> +------------+
> >>>> 1 row selected (0.123 seconds)
> >>>>
> >>>> Thanks,
> >>>> Hao
> >>>>
> >>>> On Wed, Jan 21, 2015 at 4:18 PM, Andries Engelbrecht <
> >>>> aengelbrecht@maprtech.com> wrote:
> >>>>
> >>>>> Very interesting, flatten seems to bypass empty records. Not sure if
> >>> that
> >>>>> is an ideal result for all use cases, but certainly usable in this
> case.
> >>>>>
> >>>>> Thanks
> >>>>> —Andries
> >>>>>
> >>>>>
> >>>>> On Jan 21, 2015, at 3:45 PM, Hao Zhu <hz...@maprtech.com> wrote:
> >>>>>
> >>>>>> I can also fetch non-null values for attached json file which
> contains
> >>> 6
> >>>>> "entities", 3 of them are null, 3 of them have "text" value.
> >>>>>>
> >>>>>> Could you share your complete "twitter.json"?
> >>>>>>
> >>>>>> 0: jdbc:drill:> with tmp as
> >>>>>> . . . . . . . > (
> >>>>>> . . . . . . . > select flatten(t.entities.hashtags) as c from
> >>>>> dfs.tmp.`z.json` t
> >>>>>> . . . . . . . > )
> >>>>>> . . . . . . . > select tmp.c.text from tmp;
> >>>>>> +------------+
> >>>>>> |   EXPR$0   |
> >>>>>> +------------+
> >>>>>> | GoPatriots |
> >>>>>> | aaa        |
> >>>>>> | bbb        |
> >>>>>> +------------+
> >>>>>> 3 rows selected (0.122 seconds)
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Hao
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Jan 21, 2015 at 3:35 PM, Andries Engelbrecht <
> >>>>> aengelbrecht@maprtech.com> wrote:
> >>>>>> The sample data I posted only has 1 element, but some records have
> >>>>> multiple elements in them.
> >>>>>>
> >>>>>> Interestingly enough though
> >>>>>> select t.entities.hashtags[0].`text` from `twitter.json` t limit 10;
> >>>>>>
> >>>>>> Produces
> >>>>>> +------------+
> >>>>>> |   EXPR$0   |
> >>>>>> +------------+
> >>>>>> | null       |
> >>>>>> | SportsNews |
> >>>>>> | null       |
> >>>>>> | SportsNews |
> >>>>>> | null       |
> >>>>>> | null       |
> >>>>>> | null       |
> >>>>>> | null       |
> >>>>>> | null       |
> >>>>>> | CARvsSEA   |
> >>>>>> +——————+
> >>>>>>
> >>>>>>
> >>>>>> And
> >>>>>>
> >>>>>> select t.entities.hashtags[0] from `twitter.json` t limit 10;
> >>>>>>
> >>>>>> +------------+
> >>>>>> |   EXPR$0   |
> >>>>>> +------------+
> >>>>>> | {"indices":[]} |
> >>>>>> | {"text":"SportsNews","indices":[90,99]} |
> >>>>>> | {"indices":[]} |
> >>>>>> | {"text":"SportsNews","indices":[90,99]} |
> >>>>>> | {"indices":[]} |
> >>>>>> | {"indices":[]} |
> >>>>>> | {"indices":[]} |
> >>>>>> | {"indices":[]} |
> >>>>>> | {"indices":[]} |
> >>>>>> | {"text":"CARvsSEA","indices":[90,99]} |
> >>>>>> +——————+
> >>>>>>
> >>>>>> Strange part is that there is no indices map in the hashtags array,
> so
> >>>>> no idea why it shows up when pointing to the first lament in an empty
> >>> array.
> >>>>>>
> >>>>>>
> >>>>>> —Andries
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Jan 21, 2015, at 3:24 PM, Hao Zhu <hz...@maprtech.com> wrote:
> >>>>>>
> >>>>>>> I just noticed that the result of "hashtags" is just an array with
> >>>>> only 1
> >>>>>>> element.
> >>>>>>> So take your example:
> >>>>>>> [root@maprdemo tmp]# cat d.json
> >>>>>>> {
> >>>>>>> "entities": {
> >>>>>>> "trends": [],
> >>>>>>> "symbols": [],
> >>>>>>> "urls": [],
> >>>>>>> "hashtags": [],
> >>>>>>> "user_mentions": []
> >>>>>>> },
> >>>>>>> "entities": {
> >>>>>>> "trends": [1,2,3],
> >>>>>>> "symbols": [4,5,6],
> >>>>>>> "urls": [7,8,9],
> >>>>>>> "hashtags": [
> >>>>>>>  {
> >>>>>>>    "text": "GoPatriots",
> >>>>>>>    "indices": []
> >>>>>>>  }
> >>>>>>> ],
> >>>>>>> "user_mentions": []
> >>>>>>> }
> >>>>>>> }
> >>>>>>>
> >>>>>>> Now we can do this to achieve the results:
> >>>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t
> ;
> >>>>>>> +------------+
> >>>>>>> |   EXPR$0   |
> >>>>>>> +------------+
> >>>>>>> | [{"text":"GoPatriots"}] |
> >>>>>>> +------------+
> >>>>>>> 1 row selected (0.09 seconds)
> >>>>>>> 0: jdbc:drill:> select t.entities.hashtags[0].text from
> >>>>> dfs.tmp.`d.json` t ;
> >>>>>>> +------------+
> >>>>>>> |   EXPR$0   |
> >>>>>>> +------------+
> >>>>>>> | GoPatriots |
> >>>>>>> +------------+
> >>>>>>> 1 row selected (0.108 seconds)
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Hao
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Jan 21, 2015 at 3:01 PM, Andries Engelbrecht <
> >>>>>>> aengelbrecht@maprtech.com> wrote:
> >>>>>>>
> >>>>>>>> When I run the query on a larger dataset it actually show the
> empty
> >>>>>>>> records.
> >>>>>>>>
> >>>>>>>> select t.entities.hashtags from `twitter.json` t limit 10;
> >>>>>>>>
> >>>>>>>> +------------+
> >>>>>>>> |   EXPR$0   |
> >>>>>>>> +------------+
> >>>>>>>> | []         |
> >>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
> >>>>>>>> | []         |
> >>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
> >>>>>>>> | []         |
> >>>>>>>> | []         |
> >>>>>>>> | []         |
> >>>>>>>> | []         |
> >>>>>>>> | []         |
> >>>>>>>> | [{"text":"CARvsSEA","indices":[36,45]}] |
> >>>>>>>> +------------+
> >>>>>>>> 10 rows selected (2.899 seconds)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> However having the output as maps is not very useful, unless i can
> >>>>> filter
> >>>>>>>> out the records with empty arrays and then drill deeper into the
> ones
> >>>>> with
> >>>>>>>> data in the arrays.
> >>>>>>>>
> >>>>>>>> BTW: Hao I would have expiated your query to return both rows, one
> >>>>> with an
> >>>>>>>> empty array as above and the other with the array data.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> —Andries
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Jan 21, 2015, at 2:56 PM, Hao Zhu <hz...@maprtech.com> wrote:
> >>>>>>>>
> >>>>>>>>> I am not sure if below is expected behavior.
> >>>>>>>>> If we only select "hashtags", and it will return only 1 row
> ignoring
> >>>>> the
> >>>>>>>>> null value.
> >>>>>>>>> However then if we try to get "hashtags.text", it fails...which
> >>>>> means it
> >>>>>>>> is
> >>>>>>>>> still trying to read the NULL value.
> >>>>>>>>> I am thinking it may confuse the SQL developers.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json`
> t ;
> >>>>>>>>> +------------+
> >>>>>>>>> |   EXPR$0   |
> >>>>>>>>> +------------+
> >>>>>>>>> | [{"text":"GoPatriots"}] |
> >>>>>>>>> +------------+
> >>>>>>>>> 1 row selected (0.109 seconds)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags.text from
> >>>>> dfs.tmp.`d.json` t ;
> >>>>>>>>>
> >>>>>>>>> Query failed: Query failed: Failure while running fragment.,
> >>>>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be
> >>>>> cast to
> >>>>>>>>> org.apache.drill.exec.vector.complex.MapVector [
> >>>>>>>>> 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> >>>>>>>>> [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Error: exception while executing query: Failure while executing
> >>>>> query.
> >>>>>>>>> (state=,code=0)
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Hao
> >>>>>>>>>
> >>>>>>>>> On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht <
> >>>>>>>>> aengelbrecht@maprtech.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> Now try on hashtags with the following:
> >>>>>>>>>>
> >>>>>>>>>> drilldemo:5181> select t.entities.hashtags.`text` from
> >>>>> `/twitter.json` t
> >>>>>>>>>> where t.entities.hashtags is not null limit 10;
> >>>>>>>>>>
> >>>>>>>>>> Query failed: Query failed: Failure while running fragment.,
> >>>>>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be
> >>>>> cast to
> >>>>>>>>>> org.apache.drill.exec.vector.complex.MapVector [
> >>>>>>>>>> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> >>>>>>>>>> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Error: exception while executing query: Failure while executing
> >>>>> query.
> >>>>>>>>>> (state=,code=0)
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> {
> >>>>>>>>>> "entities": {
> >>>>>>>>>> "trends": [],
> >>>>>>>>>> "symbols": [],
> >>>>>>>>>> "urls": [],
> >>>>>>>>>> "hashtags": [],
> >>>>>>>>>> "user_mentions": []
> >>>>>>>>>> },
> >>>>>>>>>> "entities": {
> >>>>>>>>>> "trends": [1,2,3],
> >>>>>>>>>> "symbols": [4,5,6],
> >>>>>>>>>> "urls": [7,8,9],
> >>>>>>>>>> "hashtags": [
> >>>>>>>>>> {
> >>>>>>>>>>   "text": "GoPatriots",
> >>>>>>>>>>   "indices": []
> >>>>>>>>>> }
> >>>>>>>>>> ],
> >>>>>>>>>> "user_mentions": []
> >>>>>>>>>> }
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> The issue seems to be that if some records have arrays with
> maps in
> >>>>> them
> >>>>>>>>>> and others are empty.
> >>>>>>>>>>
> >>>>>>>>>> —Andries
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Jan 21, 2015, at 2:34 PM, Hao Zhu <hz...@maprtech.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Seems it works for below json file:
> >>>>>>>>>>> {
> >>>>>>>>>>> "entities": {
> >>>>>>>>>>> "trends": [],
> >>>>>>>>>>> "symbols": [],
> >>>>>>>>>>> "urls": [],
> >>>>>>>>>>> "hashtags": [
> >>>>>>>>>>> {
> >>>>>>>>>>>   "text": "GoPatriots",
> >>>>>>>>>>>   "indices": [
> >>>>>>>>>>>     83,
> >>>>>>>>>>>     94
> >>>>>>>>>>>   ]
> >>>>>>>>>>> }
> >>>>>>>>>>> ],
> >>>>>>>>>>> "user_mentions": []
> >>>>>>>>>>> },
> >>>>>>>>>>> "entities": {
> >>>>>>>>>>> "trends": [1,2,3],
> >>>>>>>>>>> "symbols": [4,5,6],
> >>>>>>>>>>> "urls": [7,8,9],
> >>>>>>>>>>> "hashtags": [
> >>>>>>>>>>> {
> >>>>>>>>>>>   "text": "GoPatriots",
> >>>>>>>>>>>   "indices": []
> >>>>>>>>>>> }
> >>>>>>>>>>> ],
> >>>>>>>>>>> "user_mentions": []
> >>>>>>>>>>> }
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json`
> as t
> >>>>> where
> >>>>>>>>>>> t.entities.urls is not null;
> >>>>>>>>>>> +------------+
> >>>>>>>>>>> |   EXPR$0   |
> >>>>>>>>>>> +------------+
> >>>>>>>>>>> | [7,8,9]    |
> >>>>>>>>>>> +------------+
> >>>>>>>>>>> 1 row selected (0.139 seconds)
> >>>>>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json`
> as t
> >>>>> where
> >>>>>>>>>>> t.entities.urls is null;
> >>>>>>>>>>> +------------+
> >>>>>>>>>>> |   EXPR$0   |
> >>>>>>>>>>> +------------+
> >>>>>>>>>>> +------------+
> >>>>>>>>>>> No rows selected (0.158 seconds)
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Hao
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya <
> adityakishore@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> I believe that this works if the array contains homogeneous
> >>>>> primitive
> >>>>>>>>>>>> types. In your example, it appears from the error, the array
> >>> field
> >>>>>>>>>> 'member'
> >>>>>>>>>>>> contained maps for at least one record.
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <
> >>>>> cmatta@mapr.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Trying that locally did not work for me (drill 0.7.0):
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members` from
> >>>>>>>>>>>> `Downloads/test.json` where repeated_count(`members`) > 0;
> >>>>>>>>>>>>> Query failed: Query stopped., Failure while trying to
> >>> materialize
> >>>>>>>>>>>> incoming schema.  Errors:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Error in expression at index -1.  Error: Missing function
> >>>>>>>>>>>> implementation: [repeated_count(MAP-REPEATED)].  Full
> expression:
> >>>>>>>>>> --UNKNOWN
> >>>>>>>>>>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
> >>>>>>>>>> bullseye-3:31010 ]
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Error: exception while executing query: Failure while
> executing
> >>>>>>>> query.
> >>>>>>>>>>>> (state=,code=0)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> ​
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Chris Matta
> >>>>>>>>>>>>> cmatta@mapr.com
> >>>>>>>>>>>>> 215-701-3146
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <
> >>> adityakishore@gmail.com
> >>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> repeated_count('entities.urls') > 0
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
> >>>>>>>>>>>>>> aengelbrecht@maprtech.com> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> How do you filter out records with an empty array in drill?
> >>>>>>>>>>>>>>> i.e some records have "url":[]  and some will have an array
> >>>>> with
> >>>>>>>> data
> >>>>>>>>>>>> in
> >>>>>>>>>>>>>>> it. When trying to read records with data in the array
> drill
> >>>>> fails
> >>>>>>>>>> due
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>> records missing any data in the array. Trying a filter
> with/*
> >>>>> where
> >>>>>>>>>>>>>>> "url":[0] is not null */ fails, also fails if applying url
> is
> >>>>> not
> >>>>>>>>>>>> null.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Note some of the arrays contains maps, using twitter data
> as
> >>> an
> >>>>>>>>>>>> example
> >>>>>>>>>>>>>>> below. Some records have an empty array with “hashtags”:[]
> >>> and
> >>>>>>>>>> others
> >>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>> look similar to what is listed below.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> "entities": {
> >>>>>>>>>>>>>>> "trends": [],
> >>>>>>>>>>>>>>> "symbols": [],
> >>>>>>>>>>>>>>> "urls": [],
> >>>>>>>>>>>>>>> "hashtags": [
> >>>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>>   "text": "GoPatriots",
> >>>>>>>>>>>>>>>   "indices": [
> >>>>>>>>>>>>>>>     83,
> >>>>>>>>>>>>>>>     94
> >>>>>>>>>>>>>>>   ]
> >>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>> ],
> >>>>>>>>>>>>>>> "user_mentions": []
> >>>>>>>>>>>>>>> },
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>> —Andries
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> <z.json>
> >>>>>
> >>>>>
> >>>
> >>>
> >
>
>

Re: Filter out empty arrays in JSON

Posted by Andries Engelbrecht <ae...@maprtech.com>.
Unfortunately it seems that with larger data sets the use of flatten seems to produce an error.

Any other options to filter out JSON records (objects) where an array is empty?

Thanks
—Andries


On Jan 21, 2015, at 5:18 PM, Andries Engelbrecht <ae...@maprtech.com> wrote:

> It was interesting to see flatten bypass the records with an empty array.
> 
> Unfortunately some arrays are much more complex than the example here, and it is still useful to have the ability to filter out ones that are empty.
> 
> —Andries
> 
> On Jan 21, 2015, at 5:04 PM, Hao Zhu <hz...@maprtech.com> wrote:
> 
>> I figured out the differences after getting the json file from Andries.
>> 
>> My json file is like:
>> {“entities”:{xxx},“entities”:{yyy}... }
>> 
>> Andries' json file is like:
>> {“entities”:{xxx}}
>> {“entities”:{yyy}}
>> ...
>> 
>> So basically Andries' json file is contains multiple json files.
>> 
>> Fortunately, we can use the same SQL to get the same results:
>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z2.json` t;
>> +------------+
>> |   EXPR$0   |
>> +------------+
>> | []         |
>> | [{"text":"GoPatriots"}] |
>> | [{"text":"aaa"}] |
>> | []         |
>> | [{"text":"bbb"}] |
>> +------------+
>> 5 rows selected (0.136 seconds)
>> 0: jdbc:drill:> with tmp as
>> . . . . . . . > (
>> . . . . . . . > select flatten(t.entities.hashtags) as c from
>> dfs.tmp.`z2.json` t
>> . . . . . . . > )
>> . . . . . . . > select tmp.c.text from tmp;
>> +------------+
>> |   EXPR$0   |
>> +------------+
>> | GoPatriots |
>> | aaa        |
>> | bbb        |
>> +------------+
>> 3 rows selected (0.115 seconds)
>> 
>> Thanks,
>> Hao
>> 
>> 
>> 
>> 
>> 
>> 
>> On Wed, Jan 21, 2015 at 4:34 PM, Andries Engelbrecht <
>> aengelbrecht@maprtech.com> wrote:
>> 
>>> In my case it returns the empty records when flatten is not used.
>>> 
>>> 0: jdbc:drill:zk=drilldemo:5181> select t.entities.hashtags as hashtags
>>> from `twitter.json` t limit 10;
>>> +------------+
>>> |  hashtags  |
>>> +------------+
>>> | []         |
>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>> | []         |
>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>> | []         |
>>> | []         |
>>> | []         |
>>> | []         |
>>> | []         |
>>> | [{"text":"CARvsSEA","indices":[36,45]}] |
>>> +------------+
>>> 
>>> On Jan 21, 2015, at 4:26 PM, Hao Zhu <hz...@maprtech.com> wrote:
>>> 
>>>> Actually not due to flatten, if you directly query the file, it will only
>>>> show the non-null values.
>>>> 
>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z.json` t;
>>>> +------------+
>>>> |   EXPR$0   |
>>>> +------------+
>>>> | [{"text":"GoPatriots"},{"text":"aaa"},{"text":"bbb"}] |
>>>> +------------+
>>>> 1 row selected (0.123 seconds)
>>>> 
>>>> Thanks,
>>>> Hao
>>>> 
>>>> On Wed, Jan 21, 2015 at 4:18 PM, Andries Engelbrecht <
>>>> aengelbrecht@maprtech.com> wrote:
>>>> 
>>>>> Very interesting, flatten seems to bypass empty records. Not sure if
>>> that
>>>>> is an ideal result for all use cases, but certainly usable in this case.
>>>>> 
>>>>> Thanks
>>>>> —Andries
>>>>> 
>>>>> 
>>>>> On Jan 21, 2015, at 3:45 PM, Hao Zhu <hz...@maprtech.com> wrote:
>>>>> 
>>>>>> I can also fetch non-null values for attached json file which contains
>>> 6
>>>>> "entities", 3 of them are null, 3 of them have "text" value.
>>>>>> 
>>>>>> Could you share your complete "twitter.json"?
>>>>>> 
>>>>>> 0: jdbc:drill:> with tmp as
>>>>>> . . . . . . . > (
>>>>>> . . . . . . . > select flatten(t.entities.hashtags) as c from
>>>>> dfs.tmp.`z.json` t
>>>>>> . . . . . . . > )
>>>>>> . . . . . . . > select tmp.c.text from tmp;
>>>>>> +------------+
>>>>>> |   EXPR$0   |
>>>>>> +------------+
>>>>>> | GoPatriots |
>>>>>> | aaa        |
>>>>>> | bbb        |
>>>>>> +------------+
>>>>>> 3 rows selected (0.122 seconds)
>>>>>> 
>>>>>> Thanks,
>>>>>> Hao
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Wed, Jan 21, 2015 at 3:35 PM, Andries Engelbrecht <
>>>>> aengelbrecht@maprtech.com> wrote:
>>>>>> The sample data I posted only has 1 element, but some records have
>>>>> multiple elements in them.
>>>>>> 
>>>>>> Interestingly enough though
>>>>>> select t.entities.hashtags[0].`text` from `twitter.json` t limit 10;
>>>>>> 
>>>>>> Produces
>>>>>> +------------+
>>>>>> |   EXPR$0   |
>>>>>> +------------+
>>>>>> | null       |
>>>>>> | SportsNews |
>>>>>> | null       |
>>>>>> | SportsNews |
>>>>>> | null       |
>>>>>> | null       |
>>>>>> | null       |
>>>>>> | null       |
>>>>>> | null       |
>>>>>> | CARvsSEA   |
>>>>>> +——————+
>>>>>> 
>>>>>> 
>>>>>> And
>>>>>> 
>>>>>> select t.entities.hashtags[0] from `twitter.json` t limit 10;
>>>>>> 
>>>>>> +------------+
>>>>>> |   EXPR$0   |
>>>>>> +------------+
>>>>>> | {"indices":[]} |
>>>>>> | {"text":"SportsNews","indices":[90,99]} |
>>>>>> | {"indices":[]} |
>>>>>> | {"text":"SportsNews","indices":[90,99]} |
>>>>>> | {"indices":[]} |
>>>>>> | {"indices":[]} |
>>>>>> | {"indices":[]} |
>>>>>> | {"indices":[]} |
>>>>>> | {"indices":[]} |
>>>>>> | {"text":"CARvsSEA","indices":[90,99]} |
>>>>>> +——————+
>>>>>> 
>>>>>> Strange part is that there is no indices map in the hashtags array, so
>>>>> no idea why it shows up when pointing to the first lament in an empty
>>> array.
>>>>>> 
>>>>>> 
>>>>>> —Andries
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Jan 21, 2015, at 3:24 PM, Hao Zhu <hz...@maprtech.com> wrote:
>>>>>> 
>>>>>>> I just noticed that the result of "hashtags" is just an array with
>>>>> only 1
>>>>>>> element.
>>>>>>> So take your example:
>>>>>>> [root@maprdemo tmp]# cat d.json
>>>>>>> {
>>>>>>> "entities": {
>>>>>>> "trends": [],
>>>>>>> "symbols": [],
>>>>>>> "urls": [],
>>>>>>> "hashtags": [],
>>>>>>> "user_mentions": []
>>>>>>> },
>>>>>>> "entities": {
>>>>>>> "trends": [1,2,3],
>>>>>>> "symbols": [4,5,6],
>>>>>>> "urls": [7,8,9],
>>>>>>> "hashtags": [
>>>>>>>  {
>>>>>>>    "text": "GoPatriots",
>>>>>>>    "indices": []
>>>>>>>  }
>>>>>>> ],
>>>>>>> "user_mentions": []
>>>>>>> }
>>>>>>> }
>>>>>>> 
>>>>>>> Now we can do this to achieve the results:
>>>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ;
>>>>>>> +------------+
>>>>>>> |   EXPR$0   |
>>>>>>> +------------+
>>>>>>> | [{"text":"GoPatriots"}] |
>>>>>>> +------------+
>>>>>>> 1 row selected (0.09 seconds)
>>>>>>> 0: jdbc:drill:> select t.entities.hashtags[0].text from
>>>>> dfs.tmp.`d.json` t ;
>>>>>>> +------------+
>>>>>>> |   EXPR$0   |
>>>>>>> +------------+
>>>>>>> | GoPatriots |
>>>>>>> +------------+
>>>>>>> 1 row selected (0.108 seconds)
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Hao
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Jan 21, 2015 at 3:01 PM, Andries Engelbrecht <
>>>>>>> aengelbrecht@maprtech.com> wrote:
>>>>>>> 
>>>>>>>> When I run the query on a larger dataset it actually show the empty
>>>>>>>> records.
>>>>>>>> 
>>>>>>>> select t.entities.hashtags from `twitter.json` t limit 10;
>>>>>>>> 
>>>>>>>> +------------+
>>>>>>>> |   EXPR$0   |
>>>>>>>> +------------+
>>>>>>>> | []         |
>>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>>>>>>> | []         |
>>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>>>>>>> | []         |
>>>>>>>> | []         |
>>>>>>>> | []         |
>>>>>>>> | []         |
>>>>>>>> | []         |
>>>>>>>> | [{"text":"CARvsSEA","indices":[36,45]}] |
>>>>>>>> +------------+
>>>>>>>> 10 rows selected (2.899 seconds)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> However having the output as maps is not very useful, unless i can
>>>>> filter
>>>>>>>> out the records with empty arrays and then drill deeper into the ones
>>>>> with
>>>>>>>> data in the arrays.
>>>>>>>> 
>>>>>>>> BTW: Hao I would have expiated your query to return both rows, one
>>>>> with an
>>>>>>>> empty array as above and the other with the array data.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> —Andries
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Jan 21, 2015, at 2:56 PM, Hao Zhu <hz...@maprtech.com> wrote:
>>>>>>>> 
>>>>>>>>> I am not sure if below is expected behavior.
>>>>>>>>> If we only select "hashtags", and it will return only 1 row ignoring
>>>>> the
>>>>>>>>> null value.
>>>>>>>>> However then if we try to get "hashtags.text", it fails...which
>>>>> means it
>>>>>>>> is
>>>>>>>>> still trying to read the NULL value.
>>>>>>>>> I am thinking it may confuse the SQL developers.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ;
>>>>>>>>> +------------+
>>>>>>>>> |   EXPR$0   |
>>>>>>>>> +------------+
>>>>>>>>> | [{"text":"GoPatriots"}] |
>>>>>>>>> +------------+
>>>>>>>>> 1 row selected (0.109 seconds)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags.text from
>>>>> dfs.tmp.`d.json` t ;
>>>>>>>>> 
>>>>>>>>> Query failed: Query failed: Failure while running fragment.,
>>>>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be
>>>>> cast to
>>>>>>>>> org.apache.drill.exec.vector.complex.MapVector [
>>>>>>>>> 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
>>>>>>>>> [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Error: exception while executing query: Failure while executing
>>>>> query.
>>>>>>>>> (state=,code=0)
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Hao
>>>>>>>>> 
>>>>>>>>> On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht <
>>>>>>>>> aengelbrecht@maprtech.com> wrote:
>>>>>>>>> 
>>>>>>>>>> Now try on hashtags with the following:
>>>>>>>>>> 
>>>>>>>>>> drilldemo:5181> select t.entities.hashtags.`text` from
>>>>> `/twitter.json` t
>>>>>>>>>> where t.entities.hashtags is not null limit 10;
>>>>>>>>>> 
>>>>>>>>>> Query failed: Query failed: Failure while running fragment.,
>>>>>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be
>>>>> cast to
>>>>>>>>>> org.apache.drill.exec.vector.complex.MapVector [
>>>>>>>>>> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
>>>>>>>>>> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Error: exception while executing query: Failure while executing
>>>>> query.
>>>>>>>>>> (state=,code=0)
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> {
>>>>>>>>>> "entities": {
>>>>>>>>>> "trends": [],
>>>>>>>>>> "symbols": [],
>>>>>>>>>> "urls": [],
>>>>>>>>>> "hashtags": [],
>>>>>>>>>> "user_mentions": []
>>>>>>>>>> },
>>>>>>>>>> "entities": {
>>>>>>>>>> "trends": [1,2,3],
>>>>>>>>>> "symbols": [4,5,6],
>>>>>>>>>> "urls": [7,8,9],
>>>>>>>>>> "hashtags": [
>>>>>>>>>> {
>>>>>>>>>>   "text": "GoPatriots",
>>>>>>>>>>   "indices": []
>>>>>>>>>> }
>>>>>>>>>> ],
>>>>>>>>>> "user_mentions": []
>>>>>>>>>> }
>>>>>>>>>> }
>>>>>>>>>> 
>>>>>>>>>> The issue seems to be that if some records have arrays with maps in
>>>>> them
>>>>>>>>>> and others are empty.
>>>>>>>>>> 
>>>>>>>>>> —Andries
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Jan 21, 2015, at 2:34 PM, Hao Zhu <hz...@maprtech.com> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Seems it works for below json file:
>>>>>>>>>>> {
>>>>>>>>>>> "entities": {
>>>>>>>>>>> "trends": [],
>>>>>>>>>>> "symbols": [],
>>>>>>>>>>> "urls": [],
>>>>>>>>>>> "hashtags": [
>>>>>>>>>>> {
>>>>>>>>>>>   "text": "GoPatriots",
>>>>>>>>>>>   "indices": [
>>>>>>>>>>>     83,
>>>>>>>>>>>     94
>>>>>>>>>>>   ]
>>>>>>>>>>> }
>>>>>>>>>>> ],
>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>> },
>>>>>>>>>>> "entities": {
>>>>>>>>>>> "trends": [1,2,3],
>>>>>>>>>>> "symbols": [4,5,6],
>>>>>>>>>>> "urls": [7,8,9],
>>>>>>>>>>> "hashtags": [
>>>>>>>>>>> {
>>>>>>>>>>>   "text": "GoPatriots",
>>>>>>>>>>>   "indices": []
>>>>>>>>>>> }
>>>>>>>>>>> ],
>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>> }
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t
>>>>> where
>>>>>>>>>>> t.entities.urls is not null;
>>>>>>>>>>> +------------+
>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>> +------------+
>>>>>>>>>>> | [7,8,9]    |
>>>>>>>>>>> +------------+
>>>>>>>>>>> 1 row selected (0.139 seconds)
>>>>>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t
>>>>> where
>>>>>>>>>>> t.entities.urls is null;
>>>>>>>>>>> +------------+
>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>> +------------+
>>>>>>>>>>> +------------+
>>>>>>>>>>> No rows selected (0.158 seconds)
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Hao
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya <ad...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> I believe that this works if the array contains homogeneous
>>>>> primitive
>>>>>>>>>>>> types. In your example, it appears from the error, the array
>>> field
>>>>>>>>>> 'member'
>>>>>>>>>>>> contained maps for at least one record.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <
>>>>> cmatta@mapr.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Trying that locally did not work for me (drill 0.7.0):
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members` from
>>>>>>>>>>>> `Downloads/test.json` where repeated_count(`members`) > 0;
>>>>>>>>>>>>> Query failed: Query stopped., Failure while trying to
>>> materialize
>>>>>>>>>>>> incoming schema.  Errors:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Error in expression at index -1.  Error: Missing function
>>>>>>>>>>>> implementation: [repeated_count(MAP-REPEATED)].  Full expression:
>>>>>>>>>> --UNKNOWN
>>>>>>>>>>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
>>>>>>>>>> bullseye-3:31010 ]
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Error: exception while executing query: Failure while executing
>>>>>>>> query.
>>>>>>>>>>>> (state=,code=0)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> ​
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Chris Matta
>>>>>>>>>>>>> cmatta@mapr.com
>>>>>>>>>>>>> 215-701-3146
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <
>>> adityakishore@gmail.com
>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> repeated_count('entities.urls') > 0
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
>>>>>>>>>>>>>> aengelbrecht@maprtech.com> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> How do you filter out records with an empty array in drill?
>>>>>>>>>>>>>>> i.e some records have "url":[]  and some will have an array
>>>>> with
>>>>>>>> data
>>>>>>>>>>>> in
>>>>>>>>>>>>>>> it. When trying to read records with data in the array drill
>>>>> fails
>>>>>>>>>> due
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> records missing any data in the array. Trying a filter with/*
>>>>> where
>>>>>>>>>>>>>>> "url":[0] is not null */ fails, also fails if applying url is
>>>>> not
>>>>>>>>>>>> null.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Note some of the arrays contains maps, using twitter data as
>>> an
>>>>>>>>>>>> example
>>>>>>>>>>>>>>> below. Some records have an empty array with “hashtags”:[]
>>> and
>>>>>>>>>> others
>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>> look similar to what is listed below.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> "entities": {
>>>>>>>>>>>>>>> "trends": [],
>>>>>>>>>>>>>>> "symbols": [],
>>>>>>>>>>>>>>> "urls": [],
>>>>>>>>>>>>>>> "hashtags": [
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>   "text": "GoPatriots",
>>>>>>>>>>>>>>>   "indices": [
>>>>>>>>>>>>>>>     83,
>>>>>>>>>>>>>>>     94
>>>>>>>>>>>>>>>   ]
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> ],
>>>>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>>>>> },
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> —Andries
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> <z.json>
>>>>> 
>>>>> 
>>> 
>>> 
> 


Re: Filter out empty arrays in JSON

Posted by Andries Engelbrecht <ae...@maprtech.com>.
It was interesting to see flatten bypass the records with an empty array.

Unfortunately some arrays are much more complex than the example here, and it is still useful to have the ability to filter out ones that are empty.

—Andries

On Jan 21, 2015, at 5:04 PM, Hao Zhu <hz...@maprtech.com> wrote:

> I figured out the differences after getting the json file from Andries.
> 
> My json file is like:
> {“entities”:{xxx},“entities”:{yyy}... }
> 
> Andries' json file is like:
> {“entities”:{xxx}}
> {“entities”:{yyy}}
> ...
> 
> So basically Andries' json file is contains multiple json files.
> 
> Fortunately, we can use the same SQL to get the same results:
> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z2.json` t;
> +------------+
> |   EXPR$0   |
> +------------+
> | []         |
> | [{"text":"GoPatriots"}] |
> | [{"text":"aaa"}] |
> | []         |
> | [{"text":"bbb"}] |
> +------------+
> 5 rows selected (0.136 seconds)
> 0: jdbc:drill:> with tmp as
> . . . . . . . > (
> . . . . . . . > select flatten(t.entities.hashtags) as c from
> dfs.tmp.`z2.json` t
> . . . . . . . > )
> . . . . . . . > select tmp.c.text from tmp;
> +------------+
> |   EXPR$0   |
> +------------+
> | GoPatriots |
> | aaa        |
> | bbb        |
> +------------+
> 3 rows selected (0.115 seconds)
> 
> Thanks,
> Hao
> 
> 
> 
> 
> 
> 
> On Wed, Jan 21, 2015 at 4:34 PM, Andries Engelbrecht <
> aengelbrecht@maprtech.com> wrote:
> 
>> In my case it returns the empty records when flatten is not used.
>> 
>> 0: jdbc:drill:zk=drilldemo:5181> select t.entities.hashtags as hashtags
>> from `twitter.json` t limit 10;
>> +------------+
>> |  hashtags  |
>> +------------+
>> | []         |
>> | [{"text":"SportsNews","indices":[0,11]}] |
>> | []         |
>> | [{"text":"SportsNews","indices":[0,11]}] |
>> | []         |
>> | []         |
>> | []         |
>> | []         |
>> | []         |
>> | [{"text":"CARvsSEA","indices":[36,45]}] |
>> +------------+
>> 
>> On Jan 21, 2015, at 4:26 PM, Hao Zhu <hz...@maprtech.com> wrote:
>> 
>>> Actually not due to flatten, if you directly query the file, it will only
>>> show the non-null values.
>>> 
>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z.json` t;
>>> +------------+
>>> |   EXPR$0   |
>>> +------------+
>>> | [{"text":"GoPatriots"},{"text":"aaa"},{"text":"bbb"}] |
>>> +------------+
>>> 1 row selected (0.123 seconds)
>>> 
>>> Thanks,
>>> Hao
>>> 
>>> On Wed, Jan 21, 2015 at 4:18 PM, Andries Engelbrecht <
>>> aengelbrecht@maprtech.com> wrote:
>>> 
>>>> Very interesting, flatten seems to bypass empty records. Not sure if
>> that
>>>> is an ideal result for all use cases, but certainly usable in this case.
>>>> 
>>>> Thanks
>>>> —Andries
>>>> 
>>>> 
>>>> On Jan 21, 2015, at 3:45 PM, Hao Zhu <hz...@maprtech.com> wrote:
>>>> 
>>>>> I can also fetch non-null values for attached json file which contains
>> 6
>>>> "entities", 3 of them are null, 3 of them have "text" value.
>>>>> 
>>>>> Could you share your complete "twitter.json"?
>>>>> 
>>>>> 0: jdbc:drill:> with tmp as
>>>>> . . . . . . . > (
>>>>> . . . . . . . > select flatten(t.entities.hashtags) as c from
>>>> dfs.tmp.`z.json` t
>>>>> . . . . . . . > )
>>>>> . . . . . . . > select tmp.c.text from tmp;
>>>>> +------------+
>>>>> |   EXPR$0   |
>>>>> +------------+
>>>>> | GoPatriots |
>>>>> | aaa        |
>>>>> | bbb        |
>>>>> +------------+
>>>>> 3 rows selected (0.122 seconds)
>>>>> 
>>>>> Thanks,
>>>>> Hao
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wed, Jan 21, 2015 at 3:35 PM, Andries Engelbrecht <
>>>> aengelbrecht@maprtech.com> wrote:
>>>>> The sample data I posted only has 1 element, but some records have
>>>> multiple elements in them.
>>>>> 
>>>>> Interestingly enough though
>>>>> select t.entities.hashtags[0].`text` from `twitter.json` t limit 10;
>>>>> 
>>>>> Produces
>>>>> +------------+
>>>>> |   EXPR$0   |
>>>>> +------------+
>>>>> | null       |
>>>>> | SportsNews |
>>>>> | null       |
>>>>> | SportsNews |
>>>>> | null       |
>>>>> | null       |
>>>>> | null       |
>>>>> | null       |
>>>>> | null       |
>>>>> | CARvsSEA   |
>>>>> +——————+
>>>>> 
>>>>> 
>>>>> And
>>>>> 
>>>>> select t.entities.hashtags[0] from `twitter.json` t limit 10;
>>>>> 
>>>>> +------------+
>>>>> |   EXPR$0   |
>>>>> +------------+
>>>>> | {"indices":[]} |
>>>>> | {"text":"SportsNews","indices":[90,99]} |
>>>>> | {"indices":[]} |
>>>>> | {"text":"SportsNews","indices":[90,99]} |
>>>>> | {"indices":[]} |
>>>>> | {"indices":[]} |
>>>>> | {"indices":[]} |
>>>>> | {"indices":[]} |
>>>>> | {"indices":[]} |
>>>>> | {"text":"CARvsSEA","indices":[90,99]} |
>>>>> +——————+
>>>>> 
>>>>> Strange part is that there is no indices map in the hashtags array, so
>>>> no idea why it shows up when pointing to the first lament in an empty
>> array.
>>>>> 
>>>>> 
>>>>> —Andries
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Jan 21, 2015, at 3:24 PM, Hao Zhu <hz...@maprtech.com> wrote:
>>>>> 
>>>>>> I just noticed that the result of "hashtags" is just an array with
>>>> only 1
>>>>>> element.
>>>>>> So take your example:
>>>>>> [root@maprdemo tmp]# cat d.json
>>>>>> {
>>>>>> "entities": {
>>>>>> "trends": [],
>>>>>> "symbols": [],
>>>>>> "urls": [],
>>>>>> "hashtags": [],
>>>>>> "user_mentions": []
>>>>>> },
>>>>>> "entities": {
>>>>>> "trends": [1,2,3],
>>>>>> "symbols": [4,5,6],
>>>>>> "urls": [7,8,9],
>>>>>> "hashtags": [
>>>>>>   {
>>>>>>     "text": "GoPatriots",
>>>>>>     "indices": []
>>>>>>   }
>>>>>> ],
>>>>>> "user_mentions": []
>>>>>> }
>>>>>> }
>>>>>> 
>>>>>> Now we can do this to achieve the results:
>>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ;
>>>>>> +------------+
>>>>>> |   EXPR$0   |
>>>>>> +------------+
>>>>>> | [{"text":"GoPatriots"}] |
>>>>>> +------------+
>>>>>> 1 row selected (0.09 seconds)
>>>>>> 0: jdbc:drill:> select t.entities.hashtags[0].text from
>>>> dfs.tmp.`d.json` t ;
>>>>>> +------------+
>>>>>> |   EXPR$0   |
>>>>>> +------------+
>>>>>> | GoPatriots |
>>>>>> +------------+
>>>>>> 1 row selected (0.108 seconds)
>>>>>> 
>>>>>> Thanks,
>>>>>> Hao
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Wed, Jan 21, 2015 at 3:01 PM, Andries Engelbrecht <
>>>>>> aengelbrecht@maprtech.com> wrote:
>>>>>> 
>>>>>>> When I run the query on a larger dataset it actually show the empty
>>>>>>> records.
>>>>>>> 
>>>>>>> select t.entities.hashtags from `twitter.json` t limit 10;
>>>>>>> 
>>>>>>> +------------+
>>>>>>> |   EXPR$0   |
>>>>>>> +------------+
>>>>>>> | []         |
>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>>>>>> | []         |
>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>>>>>> | []         |
>>>>>>> | []         |
>>>>>>> | []         |
>>>>>>> | []         |
>>>>>>> | []         |
>>>>>>> | [{"text":"CARvsSEA","indices":[36,45]}] |
>>>>>>> +------------+
>>>>>>> 10 rows selected (2.899 seconds)
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> However having the output as maps is not very useful, unless i can
>>>> filter
>>>>>>> out the records with empty arrays and then drill deeper into the ones
>>>> with
>>>>>>> data in the arrays.
>>>>>>> 
>>>>>>> BTW: Hao I would have expiated your query to return both rows, one
>>>> with an
>>>>>>> empty array as above and the other with the array data.
>>>>>>> 
>>>>>>> 
>>>>>>> —Andries
>>>>>>> 
>>>>>>> 
>>>>>>> On Jan 21, 2015, at 2:56 PM, Hao Zhu <hz...@maprtech.com> wrote:
>>>>>>> 
>>>>>>>> I am not sure if below is expected behavior.
>>>>>>>> If we only select "hashtags", and it will return only 1 row ignoring
>>>> the
>>>>>>>> null value.
>>>>>>>> However then if we try to get "hashtags.text", it fails...which
>>>> means it
>>>>>>> is
>>>>>>>> still trying to read the NULL value.
>>>>>>>> I am thinking it may confuse the SQL developers.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ;
>>>>>>>> +------------+
>>>>>>>> |   EXPR$0   |
>>>>>>>> +------------+
>>>>>>>> | [{"text":"GoPatriots"}] |
>>>>>>>> +------------+
>>>>>>>> 1 row selected (0.109 seconds)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags.text from
>>>> dfs.tmp.`d.json` t ;
>>>>>>>> 
>>>>>>>> Query failed: Query failed: Failure while running fragment.,
>>>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be
>>>> cast to
>>>>>>>> org.apache.drill.exec.vector.complex.MapVector [
>>>>>>>> 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
>>>>>>>> [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Error: exception while executing query: Failure while executing
>>>> query.
>>>>>>>> (state=,code=0)
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Hao
>>>>>>>> 
>>>>>>>> On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht <
>>>>>>>> aengelbrecht@maprtech.com> wrote:
>>>>>>>> 
>>>>>>>>> Now try on hashtags with the following:
>>>>>>>>> 
>>>>>>>>> drilldemo:5181> select t.entities.hashtags.`text` from
>>>> `/twitter.json` t
>>>>>>>>> where t.entities.hashtags is not null limit 10;
>>>>>>>>> 
>>>>>>>>> Query failed: Query failed: Failure while running fragment.,
>>>>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be
>>>> cast to
>>>>>>>>> org.apache.drill.exec.vector.complex.MapVector [
>>>>>>>>> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
>>>>>>>>> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Error: exception while executing query: Failure while executing
>>>> query.
>>>>>>>>> (state=,code=0)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> {
>>>>>>>>> "entities": {
>>>>>>>>> "trends": [],
>>>>>>>>> "symbols": [],
>>>>>>>>> "urls": [],
>>>>>>>>> "hashtags": [],
>>>>>>>>> "user_mentions": []
>>>>>>>>> },
>>>>>>>>> "entities": {
>>>>>>>>> "trends": [1,2,3],
>>>>>>>>> "symbols": [4,5,6],
>>>>>>>>> "urls": [7,8,9],
>>>>>>>>> "hashtags": [
>>>>>>>>>  {
>>>>>>>>>    "text": "GoPatriots",
>>>>>>>>>    "indices": []
>>>>>>>>>  }
>>>>>>>>> ],
>>>>>>>>> "user_mentions": []
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> The issue seems to be that if some records have arrays with maps in
>>>> them
>>>>>>>>> and others are empty.
>>>>>>>>> 
>>>>>>>>> —Andries
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Jan 21, 2015, at 2:34 PM, Hao Zhu <hz...@maprtech.com> wrote:
>>>>>>>>> 
>>>>>>>>>> Seems it works for below json file:
>>>>>>>>>> {
>>>>>>>>>> "entities": {
>>>>>>>>>> "trends": [],
>>>>>>>>>> "symbols": [],
>>>>>>>>>> "urls": [],
>>>>>>>>>> "hashtags": [
>>>>>>>>>>  {
>>>>>>>>>>    "text": "GoPatriots",
>>>>>>>>>>    "indices": [
>>>>>>>>>>      83,
>>>>>>>>>>      94
>>>>>>>>>>    ]
>>>>>>>>>>  }
>>>>>>>>>> ],
>>>>>>>>>> "user_mentions": []
>>>>>>>>>> },
>>>>>>>>>> "entities": {
>>>>>>>>>> "trends": [1,2,3],
>>>>>>>>>> "symbols": [4,5,6],
>>>>>>>>>> "urls": [7,8,9],
>>>>>>>>>> "hashtags": [
>>>>>>>>>>  {
>>>>>>>>>>    "text": "GoPatriots",
>>>>>>>>>>    "indices": []
>>>>>>>>>>  }
>>>>>>>>>> ],
>>>>>>>>>> "user_mentions": []
>>>>>>>>>> }
>>>>>>>>>> }
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t
>>>> where
>>>>>>>>>> t.entities.urls is not null;
>>>>>>>>>> +------------+
>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>> +------------+
>>>>>>>>>> | [7,8,9]    |
>>>>>>>>>> +------------+
>>>>>>>>>> 1 row selected (0.139 seconds)
>>>>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t
>>>> where
>>>>>>>>>> t.entities.urls is null;
>>>>>>>>>> +------------+
>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>> +------------+
>>>>>>>>>> +------------+
>>>>>>>>>> No rows selected (0.158 seconds)
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Hao
>>>>>>>>>> 
>>>>>>>>>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya <ad...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> I believe that this works if the array contains homogeneous
>>>> primitive
>>>>>>>>>>> types. In your example, it appears from the error, the array
>> field
>>>>>>>>> 'member'
>>>>>>>>>>> contained maps for at least one record.
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <
>>>> cmatta@mapr.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Trying that locally did not work for me (drill 0.7.0):
>>>>>>>>>>>> 
>>>>>>>>>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members` from
>>>>>>>>>>> `Downloads/test.json` where repeated_count(`members`) > 0;
>>>>>>>>>>>> Query failed: Query stopped., Failure while trying to
>> materialize
>>>>>>>>>>> incoming schema.  Errors:
>>>>>>>>>>>> 
>>>>>>>>>>>> Error in expression at index -1.  Error: Missing function
>>>>>>>>>>> implementation: [repeated_count(MAP-REPEATED)].  Full expression:
>>>>>>>>> --UNKNOWN
>>>>>>>>>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
>>>>>>>>> bullseye-3:31010 ]
>>>>>>>>>>>> 
>>>>>>>>>>>> Error: exception while executing query: Failure while executing
>>>>>>> query.
>>>>>>>>>>> (state=,code=0)
>>>>>>>>>>>> 
>>>>>>>>>>>> ​
>>>>>>>>>>>> 
>>>>>>>>>>>> Chris Matta
>>>>>>>>>>>> cmatta@mapr.com
>>>>>>>>>>>> 215-701-3146
>>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <
>> adityakishore@gmail.com
>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> repeated_count('entities.urls') > 0
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
>>>>>>>>>>>>> aengelbrecht@maprtech.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> How do you filter out records with an empty array in drill?
>>>>>>>>>>>>>> i.e some records have "url":[]  and some will have an array
>>>> with
>>>>>>> data
>>>>>>>>>>> in
>>>>>>>>>>>>>> it. When trying to read records with data in the array drill
>>>> fails
>>>>>>>>> due
>>>>>>>>>>>>> to
>>>>>>>>>>>>>> records missing any data in the array. Trying a filter with/*
>>>> where
>>>>>>>>>>>>>> "url":[0] is not null */ fails, also fails if applying url is
>>>> not
>>>>>>>>>>> null.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Note some of the arrays contains maps, using twitter data as
>> an
>>>>>>>>>>> example
>>>>>>>>>>>>>> below. Some records have an empty array with “hashtags”:[]
>> and
>>>>>>>>> others
>>>>>>>>>>>>> will
>>>>>>>>>>>>>> look similar to what is listed below.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> "entities": {
>>>>>>>>>>>>>> "trends": [],
>>>>>>>>>>>>>> "symbols": [],
>>>>>>>>>>>>>> "urls": [],
>>>>>>>>>>>>>> "hashtags": [
>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>    "text": "GoPatriots",
>>>>>>>>>>>>>>    "indices": [
>>>>>>>>>>>>>>      83,
>>>>>>>>>>>>>>      94
>>>>>>>>>>>>>>    ]
>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>> ],
>>>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>>>> },
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> —Andries
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>>> <z.json>
>>>> 
>>>> 
>> 
>> 


Re: Filter out empty arrays in JSON

Posted by Hao Zhu <hz...@maprtech.com>.
I figured out the differences after getting the json file from Andries.

My json file is like:
{“entities”:{xxx},“entities”:{yyy}... }

Andries' json file is like:
{“entities”:{xxx}}
{“entities”:{yyy}}
...

So basically Andries' json file is contains multiple json files.

Fortunately, we can use the same SQL to get the same results:
0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z2.json` t;
+------------+
|   EXPR$0   |
+------------+
| []         |
| [{"text":"GoPatriots"}] |
| [{"text":"aaa"}] |
| []         |
| [{"text":"bbb"}] |
+------------+
5 rows selected (0.136 seconds)
0: jdbc:drill:> with tmp as
. . . . . . . > (
. . . . . . . > select flatten(t.entities.hashtags) as c from
dfs.tmp.`z2.json` t
. . . . . . . > )
. . . . . . . > select tmp.c.text from tmp;
+------------+
|   EXPR$0   |
+------------+
| GoPatriots |
| aaa        |
| bbb        |
+------------+
3 rows selected (0.115 seconds)

Thanks,
Hao






On Wed, Jan 21, 2015 at 4:34 PM, Andries Engelbrecht <
aengelbrecht@maprtech.com> wrote:

> In my case it returns the empty records when flatten is not used.
>
> 0: jdbc:drill:zk=drilldemo:5181> select t.entities.hashtags as hashtags
> from `twitter.json` t limit 10;
> +------------+
> |  hashtags  |
> +------------+
> | []         |
> | [{"text":"SportsNews","indices":[0,11]}] |
> | []         |
> | [{"text":"SportsNews","indices":[0,11]}] |
> | []         |
> | []         |
> | []         |
> | []         |
> | []         |
> | [{"text":"CARvsSEA","indices":[36,45]}] |
> +------------+
>
> On Jan 21, 2015, at 4:26 PM, Hao Zhu <hz...@maprtech.com> wrote:
>
> > Actually not due to flatten, if you directly query the file, it will only
> > show the non-null values.
> >
> > 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z.json` t;
> > +------------+
> > |   EXPR$0   |
> > +------------+
> > | [{"text":"GoPatriots"},{"text":"aaa"},{"text":"bbb"}] |
> > +------------+
> > 1 row selected (0.123 seconds)
> >
> > Thanks,
> > Hao
> >
> > On Wed, Jan 21, 2015 at 4:18 PM, Andries Engelbrecht <
> > aengelbrecht@maprtech.com> wrote:
> >
> >> Very interesting, flatten seems to bypass empty records. Not sure if
> that
> >> is an ideal result for all use cases, but certainly usable in this case.
> >>
> >> Thanks
> >> —Andries
> >>
> >>
> >> On Jan 21, 2015, at 3:45 PM, Hao Zhu <hz...@maprtech.com> wrote:
> >>
> >>> I can also fetch non-null values for attached json file which contains
> 6
> >> "entities", 3 of them are null, 3 of them have "text" value.
> >>>
> >>> Could you share your complete "twitter.json"?
> >>>
> >>> 0: jdbc:drill:> with tmp as
> >>> . . . . . . . > (
> >>> . . . . . . . > select flatten(t.entities.hashtags) as c from
> >> dfs.tmp.`z.json` t
> >>> . . . . . . . > )
> >>> . . . . . . . > select tmp.c.text from tmp;
> >>> +------------+
> >>> |   EXPR$0   |
> >>> +------------+
> >>> | GoPatriots |
> >>> | aaa        |
> >>> | bbb        |
> >>> +------------+
> >>> 3 rows selected (0.122 seconds)
> >>>
> >>> Thanks,
> >>> Hao
> >>>
> >>>
> >>>
> >>> On Wed, Jan 21, 2015 at 3:35 PM, Andries Engelbrecht <
> >> aengelbrecht@maprtech.com> wrote:
> >>> The sample data I posted only has 1 element, but some records have
> >> multiple elements in them.
> >>>
> >>> Interestingly enough though
> >>> select t.entities.hashtags[0].`text` from `twitter.json` t limit 10;
> >>>
> >>> Produces
> >>> +------------+
> >>> |   EXPR$0   |
> >>> +------------+
> >>> | null       |
> >>> | SportsNews |
> >>> | null       |
> >>> | SportsNews |
> >>> | null       |
> >>> | null       |
> >>> | null       |
> >>> | null       |
> >>> | null       |
> >>> | CARvsSEA   |
> >>> +——————+
> >>>
> >>>
> >>> And
> >>>
> >>> select t.entities.hashtags[0] from `twitter.json` t limit 10;
> >>>
> >>> +------------+
> >>> |   EXPR$0   |
> >>> +------------+
> >>> | {"indices":[]} |
> >>> | {"text":"SportsNews","indices":[90,99]} |
> >>> | {"indices":[]} |
> >>> | {"text":"SportsNews","indices":[90,99]} |
> >>> | {"indices":[]} |
> >>> | {"indices":[]} |
> >>> | {"indices":[]} |
> >>> | {"indices":[]} |
> >>> | {"indices":[]} |
> >>> | {"text":"CARvsSEA","indices":[90,99]} |
> >>> +——————+
> >>>
> >>> Strange part is that there is no indices map in the hashtags array, so
> >> no idea why it shows up when pointing to the first lament in an empty
> array.
> >>>
> >>>
> >>> —Andries
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Jan 21, 2015, at 3:24 PM, Hao Zhu <hz...@maprtech.com> wrote:
> >>>
> >>>> I just noticed that the result of "hashtags" is just an array with
> >> only 1
> >>>> element.
> >>>> So take your example:
> >>>> [root@maprdemo tmp]# cat d.json
> >>>> {
> >>>> "entities": {
> >>>>  "trends": [],
> >>>>  "symbols": [],
> >>>>  "urls": [],
> >>>>  "hashtags": [],
> >>>>  "user_mentions": []
> >>>> },
> >>>> "entities": {
> >>>>  "trends": [1,2,3],
> >>>>  "symbols": [4,5,6],
> >>>>  "urls": [7,8,9],
> >>>>  "hashtags": [
> >>>>    {
> >>>>      "text": "GoPatriots",
> >>>>      "indices": []
> >>>>    }
> >>>>  ],
> >>>>  "user_mentions": []
> >>>> }
> >>>> }
> >>>>
> >>>> Now we can do this to achieve the results:
> >>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ;
> >>>> +------------+
> >>>> |   EXPR$0   |
> >>>> +------------+
> >>>> | [{"text":"GoPatriots"}] |
> >>>> +------------+
> >>>> 1 row selected (0.09 seconds)
> >>>> 0: jdbc:drill:> select t.entities.hashtags[0].text from
> >> dfs.tmp.`d.json` t ;
> >>>> +------------+
> >>>> |   EXPR$0   |
> >>>> +------------+
> >>>> | GoPatriots |
> >>>> +------------+
> >>>> 1 row selected (0.108 seconds)
> >>>>
> >>>> Thanks,
> >>>> Hao
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Jan 21, 2015 at 3:01 PM, Andries Engelbrecht <
> >>>> aengelbrecht@maprtech.com> wrote:
> >>>>
> >>>>> When I run the query on a larger dataset it actually show the empty
> >>>>> records.
> >>>>>
> >>>>> select t.entities.hashtags from `twitter.json` t limit 10;
> >>>>>
> >>>>> +------------+
> >>>>> |   EXPR$0   |
> >>>>> +------------+
> >>>>> | []         |
> >>>>> | [{"text":"SportsNews","indices":[0,11]}] |
> >>>>> | []         |
> >>>>> | [{"text":"SportsNews","indices":[0,11]}] |
> >>>>> | []         |
> >>>>> | []         |
> >>>>> | []         |
> >>>>> | []         |
> >>>>> | []         |
> >>>>> | [{"text":"CARvsSEA","indices":[36,45]}] |
> >>>>> +------------+
> >>>>> 10 rows selected (2.899 seconds)
> >>>>>
> >>>>>
> >>>>>
> >>>>> However having the output as maps is not very useful, unless i can
> >> filter
> >>>>> out the records with empty arrays and then drill deeper into the ones
> >> with
> >>>>> data in the arrays.
> >>>>>
> >>>>> BTW: Hao I would have expiated your query to return both rows, one
> >> with an
> >>>>> empty array as above and the other with the array data.
> >>>>>
> >>>>>
> >>>>> —Andries
> >>>>>
> >>>>>
> >>>>> On Jan 21, 2015, at 2:56 PM, Hao Zhu <hz...@maprtech.com> wrote:
> >>>>>
> >>>>>> I am not sure if below is expected behavior.
> >>>>>> If we only select "hashtags", and it will return only 1 row ignoring
> >> the
> >>>>>> null value.
> >>>>>> However then if we try to get "hashtags.text", it fails...which
> >> means it
> >>>>> is
> >>>>>> still trying to read the NULL value.
> >>>>>> I am thinking it may confuse the SQL developers.
> >>>>>>
> >>>>>>
> >>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ;
> >>>>>> +------------+
> >>>>>> |   EXPR$0   |
> >>>>>> +------------+
> >>>>>> | [{"text":"GoPatriots"}] |
> >>>>>> +------------+
> >>>>>> 1 row selected (0.109 seconds)
> >>>>>>
> >>>>>>
> >>>>>> 0: jdbc:drill:> select t.entities.hashtags.text from
> >> dfs.tmp.`d.json` t ;
> >>>>>>
> >>>>>> Query failed: Query failed: Failure while running fragment.,
> >>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be
> >> cast to
> >>>>>> org.apache.drill.exec.vector.complex.MapVector [
> >>>>>> 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> >>>>>> [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> >>>>>>
> >>>>>>
> >>>>>> Error: exception while executing query: Failure while executing
> >> query.
> >>>>>> (state=,code=0)
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Hao
> >>>>>>
> >>>>>> On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht <
> >>>>>> aengelbrecht@maprtech.com> wrote:
> >>>>>>
> >>>>>>> Now try on hashtags with the following:
> >>>>>>>
> >>>>>>> drilldemo:5181> select t.entities.hashtags.`text` from
> >> `/twitter.json` t
> >>>>>>> where t.entities.hashtags is not null limit 10;
> >>>>>>>
> >>>>>>> Query failed: Query failed: Failure while running fragment.,
> >>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be
> >> cast to
> >>>>>>> org.apache.drill.exec.vector.complex.MapVector [
> >>>>>>> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> >>>>>>> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> >>>>>>>
> >>>>>>>
> >>>>>>> Error: exception while executing query: Failure while executing
> >> query.
> >>>>>>> (state=,code=0)
> >>>>>>>
> >>>>>>>
> >>>>>>> {
> >>>>>>> "entities": {
> >>>>>>> "trends": [],
> >>>>>>> "symbols": [],
> >>>>>>> "urls": [],
> >>>>>>> "hashtags": [],
> >>>>>>> "user_mentions": []
> >>>>>>> },
> >>>>>>> "entities": {
> >>>>>>> "trends": [1,2,3],
> >>>>>>> "symbols": [4,5,6],
> >>>>>>> "urls": [7,8,9],
> >>>>>>> "hashtags": [
> >>>>>>>   {
> >>>>>>>     "text": "GoPatriots",
> >>>>>>>     "indices": []
> >>>>>>>   }
> >>>>>>> ],
> >>>>>>> "user_mentions": []
> >>>>>>> }
> >>>>>>> }
> >>>>>>>
> >>>>>>> The issue seems to be that if some records have arrays with maps in
> >> them
> >>>>>>> and others are empty.
> >>>>>>>
> >>>>>>> —Andries
> >>>>>>>
> >>>>>>>
> >>>>>>> On Jan 21, 2015, at 2:34 PM, Hao Zhu <hz...@maprtech.com> wrote:
> >>>>>>>
> >>>>>>>> Seems it works for below json file:
> >>>>>>>> {
> >>>>>>>> "entities": {
> >>>>>>>> "trends": [],
> >>>>>>>> "symbols": [],
> >>>>>>>> "urls": [],
> >>>>>>>> "hashtags": [
> >>>>>>>>   {
> >>>>>>>>     "text": "GoPatriots",
> >>>>>>>>     "indices": [
> >>>>>>>>       83,
> >>>>>>>>       94
> >>>>>>>>     ]
> >>>>>>>>   }
> >>>>>>>> ],
> >>>>>>>> "user_mentions": []
> >>>>>>>> },
> >>>>>>>> "entities": {
> >>>>>>>> "trends": [1,2,3],
> >>>>>>>> "symbols": [4,5,6],
> >>>>>>>> "urls": [7,8,9],
> >>>>>>>> "hashtags": [
> >>>>>>>>   {
> >>>>>>>>     "text": "GoPatriots",
> >>>>>>>>     "indices": []
> >>>>>>>>   }
> >>>>>>>> ],
> >>>>>>>> "user_mentions": []
> >>>>>>>> }
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t
> >> where
> >>>>>>>> t.entities.urls is not null;
> >>>>>>>> +------------+
> >>>>>>>> |   EXPR$0   |
> >>>>>>>> +------------+
> >>>>>>>> | [7,8,9]    |
> >>>>>>>> +------------+
> >>>>>>>> 1 row selected (0.139 seconds)
> >>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t
> >> where
> >>>>>>>> t.entities.urls is null;
> >>>>>>>> +------------+
> >>>>>>>> |   EXPR$0   |
> >>>>>>>> +------------+
> >>>>>>>> +------------+
> >>>>>>>> No rows selected (0.158 seconds)
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Hao
> >>>>>>>>
> >>>>>>>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya <ad...@gmail.com>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>>> I believe that this works if the array contains homogeneous
> >> primitive
> >>>>>>>>> types. In your example, it appears from the error, the array
> field
> >>>>>>> 'member'
> >>>>>>>>> contained maps for at least one record.
> >>>>>>>>>
> >>>>>>>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <
> >> cmatta@mapr.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Trying that locally did not work for me (drill 0.7.0):
> >>>>>>>>>>
> >>>>>>>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members` from
> >>>>>>>>> `Downloads/test.json` where repeated_count(`members`) > 0;
> >>>>>>>>>> Query failed: Query stopped., Failure while trying to
> materialize
> >>>>>>>>> incoming schema.  Errors:
> >>>>>>>>>>
> >>>>>>>>>> Error in expression at index -1.  Error: Missing function
> >>>>>>>>> implementation: [repeated_count(MAP-REPEATED)].  Full expression:
> >>>>>>> --UNKNOWN
> >>>>>>>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
> >>>>>>> bullseye-3:31010 ]
> >>>>>>>>>>
> >>>>>>>>>> Error: exception while executing query: Failure while executing
> >>>>> query.
> >>>>>>>>> (state=,code=0)
> >>>>>>>>>>
> >>>>>>>>>> ​
> >>>>>>>>>>
> >>>>>>>>>> Chris Matta
> >>>>>>>>>> cmatta@mapr.com
> >>>>>>>>>> 215-701-3146
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <
> adityakishore@gmail.com
> >>>
> >>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> repeated_count('entities.urls') > 0
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
> >>>>>>>>>>> aengelbrecht@maprtech.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> How do you filter out records with an empty array in drill?
> >>>>>>>>>>>> i.e some records have "url":[]  and some will have an array
> >> with
> >>>>> data
> >>>>>>>>> in
> >>>>>>>>>>>> it. When trying to read records with data in the array drill
> >> fails
> >>>>>>> due
> >>>>>>>>>>> to
> >>>>>>>>>>>> records missing any data in the array. Trying a filter with/*
> >> where
> >>>>>>>>>>>> "url":[0] is not null */ fails, also fails if applying url is
> >> not
> >>>>>>>>> null.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Note some of the arrays contains maps, using twitter data as
> an
> >>>>>>>>> example
> >>>>>>>>>>>> below. Some records have an empty array with “hashtags”:[]
> and
> >>>>>>> others
> >>>>>>>>>>> will
> >>>>>>>>>>>> look similar to what is listed below.
> >>>>>>>>>>>>
> >>>>>>>>>>>> "entities": {
> >>>>>>>>>>>> "trends": [],
> >>>>>>>>>>>> "symbols": [],
> >>>>>>>>>>>> "urls": [],
> >>>>>>>>>>>> "hashtags": [
> >>>>>>>>>>>>   {
> >>>>>>>>>>>>     "text": "GoPatriots",
> >>>>>>>>>>>>     "indices": [
> >>>>>>>>>>>>       83,
> >>>>>>>>>>>>       94
> >>>>>>>>>>>>     ]
> >>>>>>>>>>>>   }
> >>>>>>>>>>>> ],
> >>>>>>>>>>>> "user_mentions": []
> >>>>>>>>>>>> },
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks
> >>>>>>>>>>>> —Andries
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>>
> >>>
> >>>
> >>> <z.json>
> >>
> >>
>
>

Re: Filter out empty arrays in JSON

Posted by Andries Engelbrecht <ae...@maprtech.com>.
In my case it returns the empty records when flatten is not used.

0: jdbc:drill:zk=drilldemo:5181> select t.entities.hashtags as hashtags from `twitter.json` t limit 10;
+------------+
|  hashtags  |
+------------+
| []         |
| [{"text":"SportsNews","indices":[0,11]}] |
| []         |
| [{"text":"SportsNews","indices":[0,11]}] |
| []         |
| []         |
| []         |
| []         |
| []         |
| [{"text":"CARvsSEA","indices":[36,45]}] |
+------------+

On Jan 21, 2015, at 4:26 PM, Hao Zhu <hz...@maprtech.com> wrote:

> Actually not due to flatten, if you directly query the file, it will only
> show the non-null values.
> 
> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z.json` t;
> +------------+
> |   EXPR$0   |
> +------------+
> | [{"text":"GoPatriots"},{"text":"aaa"},{"text":"bbb"}] |
> +------------+
> 1 row selected (0.123 seconds)
> 
> Thanks,
> Hao
> 
> On Wed, Jan 21, 2015 at 4:18 PM, Andries Engelbrecht <
> aengelbrecht@maprtech.com> wrote:
> 
>> Very interesting, flatten seems to bypass empty records. Not sure if that
>> is an ideal result for all use cases, but certainly usable in this case.
>> 
>> Thanks
>> —Andries
>> 
>> 
>> On Jan 21, 2015, at 3:45 PM, Hao Zhu <hz...@maprtech.com> wrote:
>> 
>>> I can also fetch non-null values for attached json file which contains 6
>> "entities", 3 of them are null, 3 of them have "text" value.
>>> 
>>> Could you share your complete "twitter.json"?
>>> 
>>> 0: jdbc:drill:> with tmp as
>>> . . . . . . . > (
>>> . . . . . . . > select flatten(t.entities.hashtags) as c from
>> dfs.tmp.`z.json` t
>>> . . . . . . . > )
>>> . . . . . . . > select tmp.c.text from tmp;
>>> +------------+
>>> |   EXPR$0   |
>>> +------------+
>>> | GoPatriots |
>>> | aaa        |
>>> | bbb        |
>>> +------------+
>>> 3 rows selected (0.122 seconds)
>>> 
>>> Thanks,
>>> Hao
>>> 
>>> 
>>> 
>>> On Wed, Jan 21, 2015 at 3:35 PM, Andries Engelbrecht <
>> aengelbrecht@maprtech.com> wrote:
>>> The sample data I posted only has 1 element, but some records have
>> multiple elements in them.
>>> 
>>> Interestingly enough though
>>> select t.entities.hashtags[0].`text` from `twitter.json` t limit 10;
>>> 
>>> Produces
>>> +------------+
>>> |   EXPR$0   |
>>> +------------+
>>> | null       |
>>> | SportsNews |
>>> | null       |
>>> | SportsNews |
>>> | null       |
>>> | null       |
>>> | null       |
>>> | null       |
>>> | null       |
>>> | CARvsSEA   |
>>> +——————+
>>> 
>>> 
>>> And
>>> 
>>> select t.entities.hashtags[0] from `twitter.json` t limit 10;
>>> 
>>> +------------+
>>> |   EXPR$0   |
>>> +------------+
>>> | {"indices":[]} |
>>> | {"text":"SportsNews","indices":[90,99]} |
>>> | {"indices":[]} |
>>> | {"text":"SportsNews","indices":[90,99]} |
>>> | {"indices":[]} |
>>> | {"indices":[]} |
>>> | {"indices":[]} |
>>> | {"indices":[]} |
>>> | {"indices":[]} |
>>> | {"text":"CARvsSEA","indices":[90,99]} |
>>> +——————+
>>> 
>>> Strange part is that there is no indices map in the hashtags array, so
>> no idea why it shows up when pointing to the first lament in an empty array.
>>> 
>>> 
>>> —Andries
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Jan 21, 2015, at 3:24 PM, Hao Zhu <hz...@maprtech.com> wrote:
>>> 
>>>> I just noticed that the result of "hashtags" is just an array with
>> only 1
>>>> element.
>>>> So take your example:
>>>> [root@maprdemo tmp]# cat d.json
>>>> {
>>>> "entities": {
>>>>  "trends": [],
>>>>  "symbols": [],
>>>>  "urls": [],
>>>>  "hashtags": [],
>>>>  "user_mentions": []
>>>> },
>>>> "entities": {
>>>>  "trends": [1,2,3],
>>>>  "symbols": [4,5,6],
>>>>  "urls": [7,8,9],
>>>>  "hashtags": [
>>>>    {
>>>>      "text": "GoPatriots",
>>>>      "indices": []
>>>>    }
>>>>  ],
>>>>  "user_mentions": []
>>>> }
>>>> }
>>>> 
>>>> Now we can do this to achieve the results:
>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ;
>>>> +------------+
>>>> |   EXPR$0   |
>>>> +------------+
>>>> | [{"text":"GoPatriots"}] |
>>>> +------------+
>>>> 1 row selected (0.09 seconds)
>>>> 0: jdbc:drill:> select t.entities.hashtags[0].text from
>> dfs.tmp.`d.json` t ;
>>>> +------------+
>>>> |   EXPR$0   |
>>>> +------------+
>>>> | GoPatriots |
>>>> +------------+
>>>> 1 row selected (0.108 seconds)
>>>> 
>>>> Thanks,
>>>> Hao
>>>> 
>>>> 
>>>> 
>>>> On Wed, Jan 21, 2015 at 3:01 PM, Andries Engelbrecht <
>>>> aengelbrecht@maprtech.com> wrote:
>>>> 
>>>>> When I run the query on a larger dataset it actually show the empty
>>>>> records.
>>>>> 
>>>>> select t.entities.hashtags from `twitter.json` t limit 10;
>>>>> 
>>>>> +------------+
>>>>> |   EXPR$0   |
>>>>> +------------+
>>>>> | []         |
>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>>>> | []         |
>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>>>> | []         |
>>>>> | []         |
>>>>> | []         |
>>>>> | []         |
>>>>> | []         |
>>>>> | [{"text":"CARvsSEA","indices":[36,45]}] |
>>>>> +------------+
>>>>> 10 rows selected (2.899 seconds)
>>>>> 
>>>>> 
>>>>> 
>>>>> However having the output as maps is not very useful, unless i can
>> filter
>>>>> out the records with empty arrays and then drill deeper into the ones
>> with
>>>>> data in the arrays.
>>>>> 
>>>>> BTW: Hao I would have expiated your query to return both rows, one
>> with an
>>>>> empty array as above and the other with the array data.
>>>>> 
>>>>> 
>>>>> —Andries
>>>>> 
>>>>> 
>>>>> On Jan 21, 2015, at 2:56 PM, Hao Zhu <hz...@maprtech.com> wrote:
>>>>> 
>>>>>> I am not sure if below is expected behavior.
>>>>>> If we only select "hashtags", and it will return only 1 row ignoring
>> the
>>>>>> null value.
>>>>>> However then if we try to get "hashtags.text", it fails...which
>> means it
>>>>> is
>>>>>> still trying to read the NULL value.
>>>>>> I am thinking it may confuse the SQL developers.
>>>>>> 
>>>>>> 
>>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ;
>>>>>> +------------+
>>>>>> |   EXPR$0   |
>>>>>> +------------+
>>>>>> | [{"text":"GoPatriots"}] |
>>>>>> +------------+
>>>>>> 1 row selected (0.109 seconds)
>>>>>> 
>>>>>> 
>>>>>> 0: jdbc:drill:> select t.entities.hashtags.text from
>> dfs.tmp.`d.json` t ;
>>>>>> 
>>>>>> Query failed: Query failed: Failure while running fragment.,
>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be
>> cast to
>>>>>> org.apache.drill.exec.vector.complex.MapVector [
>>>>>> 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
>>>>>> [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
>>>>>> 
>>>>>> 
>>>>>> Error: exception while executing query: Failure while executing
>> query.
>>>>>> (state=,code=0)
>>>>>> 
>>>>>> Thanks,
>>>>>> Hao
>>>>>> 
>>>>>> On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht <
>>>>>> aengelbrecht@maprtech.com> wrote:
>>>>>> 
>>>>>>> Now try on hashtags with the following:
>>>>>>> 
>>>>>>> drilldemo:5181> select t.entities.hashtags.`text` from
>> `/twitter.json` t
>>>>>>> where t.entities.hashtags is not null limit 10;
>>>>>>> 
>>>>>>> Query failed: Query failed: Failure while running fragment.,
>>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be
>> cast to
>>>>>>> org.apache.drill.exec.vector.complex.MapVector [
>>>>>>> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
>>>>>>> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
>>>>>>> 
>>>>>>> 
>>>>>>> Error: exception while executing query: Failure while executing
>> query.
>>>>>>> (state=,code=0)
>>>>>>> 
>>>>>>> 
>>>>>>> {
>>>>>>> "entities": {
>>>>>>> "trends": [],
>>>>>>> "symbols": [],
>>>>>>> "urls": [],
>>>>>>> "hashtags": [],
>>>>>>> "user_mentions": []
>>>>>>> },
>>>>>>> "entities": {
>>>>>>> "trends": [1,2,3],
>>>>>>> "symbols": [4,5,6],
>>>>>>> "urls": [7,8,9],
>>>>>>> "hashtags": [
>>>>>>>   {
>>>>>>>     "text": "GoPatriots",
>>>>>>>     "indices": []
>>>>>>>   }
>>>>>>> ],
>>>>>>> "user_mentions": []
>>>>>>> }
>>>>>>> }
>>>>>>> 
>>>>>>> The issue seems to be that if some records have arrays with maps in
>> them
>>>>>>> and others are empty.
>>>>>>> 
>>>>>>> —Andries
>>>>>>> 
>>>>>>> 
>>>>>>> On Jan 21, 2015, at 2:34 PM, Hao Zhu <hz...@maprtech.com> wrote:
>>>>>>> 
>>>>>>>> Seems it works for below json file:
>>>>>>>> {
>>>>>>>> "entities": {
>>>>>>>> "trends": [],
>>>>>>>> "symbols": [],
>>>>>>>> "urls": [],
>>>>>>>> "hashtags": [
>>>>>>>>   {
>>>>>>>>     "text": "GoPatriots",
>>>>>>>>     "indices": [
>>>>>>>>       83,
>>>>>>>>       94
>>>>>>>>     ]
>>>>>>>>   }
>>>>>>>> ],
>>>>>>>> "user_mentions": []
>>>>>>>> },
>>>>>>>> "entities": {
>>>>>>>> "trends": [1,2,3],
>>>>>>>> "symbols": [4,5,6],
>>>>>>>> "urls": [7,8,9],
>>>>>>>> "hashtags": [
>>>>>>>>   {
>>>>>>>>     "text": "GoPatriots",
>>>>>>>>     "indices": []
>>>>>>>>   }
>>>>>>>> ],
>>>>>>>> "user_mentions": []
>>>>>>>> }
>>>>>>>> }
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t
>> where
>>>>>>>> t.entities.urls is not null;
>>>>>>>> +------------+
>>>>>>>> |   EXPR$0   |
>>>>>>>> +------------+
>>>>>>>> | [7,8,9]    |
>>>>>>>> +------------+
>>>>>>>> 1 row selected (0.139 seconds)
>>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t
>> where
>>>>>>>> t.entities.urls is null;
>>>>>>>> +------------+
>>>>>>>> |   EXPR$0   |
>>>>>>>> +------------+
>>>>>>>> +------------+
>>>>>>>> No rows selected (0.158 seconds)
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Hao
>>>>>>>> 
>>>>>>>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya <ad...@gmail.com>
>>>>> wrote:
>>>>>>>> 
>>>>>>>>> I believe that this works if the array contains homogeneous
>> primitive
>>>>>>>>> types. In your example, it appears from the error, the array field
>>>>>>> 'member'
>>>>>>>>> contained maps for at least one record.
>>>>>>>>> 
>>>>>>>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <
>> cmatta@mapr.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Trying that locally did not work for me (drill 0.7.0):
>>>>>>>>>> 
>>>>>>>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members` from
>>>>>>>>> `Downloads/test.json` where repeated_count(`members`) > 0;
>>>>>>>>>> Query failed: Query stopped., Failure while trying to materialize
>>>>>>>>> incoming schema.  Errors:
>>>>>>>>>> 
>>>>>>>>>> Error in expression at index -1.  Error: Missing function
>>>>>>>>> implementation: [repeated_count(MAP-REPEATED)].  Full expression:
>>>>>>> --UNKNOWN
>>>>>>>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
>>>>>>> bullseye-3:31010 ]
>>>>>>>>>> 
>>>>>>>>>> Error: exception while executing query: Failure while executing
>>>>> query.
>>>>>>>>> (state=,code=0)
>>>>>>>>>> 
>>>>>>>>>> ​
>>>>>>>>>> 
>>>>>>>>>> Chris Matta
>>>>>>>>>> cmatta@mapr.com
>>>>>>>>>> 215-701-3146
>>>>>>>>>> 
>>>>>>>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <adityakishore@gmail.com
>>> 
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> repeated_count('entities.urls') > 0
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
>>>>>>>>>>> aengelbrecht@maprtech.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> How do you filter out records with an empty array in drill?
>>>>>>>>>>>> i.e some records have "url":[]  and some will have an array
>> with
>>>>> data
>>>>>>>>> in
>>>>>>>>>>>> it. When trying to read records with data in the array drill
>> fails
>>>>>>> due
>>>>>>>>>>> to
>>>>>>>>>>>> records missing any data in the array. Trying a filter with/*
>> where
>>>>>>>>>>>> "url":[0] is not null */ fails, also fails if applying url is
>> not
>>>>>>>>> null.
>>>>>>>>>>>> 
>>>>>>>>>>>> Note some of the arrays contains maps, using twitter data as an
>>>>>>>>> example
>>>>>>>>>>>> below. Some records have an empty array with “hashtags”:[]  and
>>>>>>> others
>>>>>>>>>>> will
>>>>>>>>>>>> look similar to what is listed below.
>>>>>>>>>>>> 
>>>>>>>>>>>> "entities": {
>>>>>>>>>>>> "trends": [],
>>>>>>>>>>>> "symbols": [],
>>>>>>>>>>>> "urls": [],
>>>>>>>>>>>> "hashtags": [
>>>>>>>>>>>>   {
>>>>>>>>>>>>     "text": "GoPatriots",
>>>>>>>>>>>>     "indices": [
>>>>>>>>>>>>       83,
>>>>>>>>>>>>       94
>>>>>>>>>>>>     ]
>>>>>>>>>>>>   }
>>>>>>>>>>>> ],
>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>> },
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> —Andries
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>>> <z.json>
>> 
>> 


Re: Filter out empty arrays in JSON

Posted by Hao Zhu <hz...@maprtech.com>.
Actually not due to flatten, if you directly query the file, it will only
show the non-null values.

0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z.json` t;
+------------+
|   EXPR$0   |
+------------+
| [{"text":"GoPatriots"},{"text":"aaa"},{"text":"bbb"}] |
+------------+
1 row selected (0.123 seconds)

Thanks,
Hao

On Wed, Jan 21, 2015 at 4:18 PM, Andries Engelbrecht <
aengelbrecht@maprtech.com> wrote:

> Very interesting, flatten seems to bypass empty records. Not sure if that
> is an ideal result for all use cases, but certainly usable in this case.
>
> Thanks
> —Andries
>
>
> On Jan 21, 2015, at 3:45 PM, Hao Zhu <hz...@maprtech.com> wrote:
>
> > I can also fetch non-null values for attached json file which contains 6
> "entities", 3 of them are null, 3 of them have "text" value.
> >
> > Could you share your complete "twitter.json"?
> >
> > 0: jdbc:drill:> with tmp as
> > . . . . . . . > (
> > . . . . . . . > select flatten(t.entities.hashtags) as c from
> dfs.tmp.`z.json` t
> > . . . . . . . > )
> > . . . . . . . > select tmp.c.text from tmp;
> > +------------+
> > |   EXPR$0   |
> > +------------+
> > | GoPatriots |
> > | aaa        |
> > | bbb        |
> > +------------+
> > 3 rows selected (0.122 seconds)
> >
> > Thanks,
> > Hao
> >
> >
> >
> > On Wed, Jan 21, 2015 at 3:35 PM, Andries Engelbrecht <
> aengelbrecht@maprtech.com> wrote:
> > The sample data I posted only has 1 element, but some records have
> multiple elements in them.
> >
> > Interestingly enough though
> > select t.entities.hashtags[0].`text` from `twitter.json` t limit 10;
> >
> > Produces
> > +------------+
> > |   EXPR$0   |
> > +------------+
> > | null       |
> > | SportsNews |
> > | null       |
> > | SportsNews |
> > | null       |
> > | null       |
> > | null       |
> > | null       |
> > | null       |
> > | CARvsSEA   |
> > +——————+
> >
> >
> > And
> >
> > select t.entities.hashtags[0] from `twitter.json` t limit 10;
> >
> > +------------+
> > |   EXPR$0   |
> > +------------+
> > | {"indices":[]} |
> > | {"text":"SportsNews","indices":[90,99]} |
> > | {"indices":[]} |
> > | {"text":"SportsNews","indices":[90,99]} |
> > | {"indices":[]} |
> > | {"indices":[]} |
> > | {"indices":[]} |
> > | {"indices":[]} |
> > | {"indices":[]} |
> > | {"text":"CARvsSEA","indices":[90,99]} |
> > +——————+
> >
> > Strange part is that there is no indices map in the hashtags array, so
> no idea why it shows up when pointing to the first lament in an empty array.
> >
> >
> > —Andries
> >
> >
> >
> >
> >
> >
> > On Jan 21, 2015, at 3:24 PM, Hao Zhu <hz...@maprtech.com> wrote:
> >
> > > I just noticed that the result of "hashtags" is just an array with
> only 1
> > > element.
> > > So take your example:
> > > [root@maprdemo tmp]# cat d.json
> > > {
> > > "entities": {
> > >   "trends": [],
> > >   "symbols": [],
> > >   "urls": [],
> > >   "hashtags": [],
> > >   "user_mentions": []
> > > },
> > > "entities": {
> > >   "trends": [1,2,3],
> > >   "symbols": [4,5,6],
> > >   "urls": [7,8,9],
> > >   "hashtags": [
> > >     {
> > >       "text": "GoPatriots",
> > >       "indices": []
> > >     }
> > >   ],
> > >   "user_mentions": []
> > > }
> > > }
> > >
> > > Now we can do this to achieve the results:
> > > 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ;
> > > +------------+
> > > |   EXPR$0   |
> > > +------------+
> > > | [{"text":"GoPatriots"}] |
> > > +------------+
> > > 1 row selected (0.09 seconds)
> > > 0: jdbc:drill:> select t.entities.hashtags[0].text from
> dfs.tmp.`d.json` t ;
> > > +------------+
> > > |   EXPR$0   |
> > > +------------+
> > > | GoPatriots |
> > > +------------+
> > > 1 row selected (0.108 seconds)
> > >
> > > Thanks,
> > > Hao
> > >
> > >
> > >
> > > On Wed, Jan 21, 2015 at 3:01 PM, Andries Engelbrecht <
> > > aengelbrecht@maprtech.com> wrote:
> > >
> > >> When I run the query on a larger dataset it actually show the empty
> > >> records.
> > >>
> > >> select t.entities.hashtags from `twitter.json` t limit 10;
> > >>
> > >> +------------+
> > >> |   EXPR$0   |
> > >> +------------+
> > >> | []         |
> > >> | [{"text":"SportsNews","indices":[0,11]}] |
> > >> | []         |
> > >> | [{"text":"SportsNews","indices":[0,11]}] |
> > >> | []         |
> > >> | []         |
> > >> | []         |
> > >> | []         |
> > >> | []         |
> > >> | [{"text":"CARvsSEA","indices":[36,45]}] |
> > >> +------------+
> > >> 10 rows selected (2.899 seconds)
> > >>
> > >>
> > >>
> > >> However having the output as maps is not very useful, unless i can
> filter
> > >> out the records with empty arrays and then drill deeper into the ones
> with
> > >> data in the arrays.
> > >>
> > >> BTW: Hao I would have expiated your query to return both rows, one
> with an
> > >> empty array as above and the other with the array data.
> > >>
> > >>
> > >> —Andries
> > >>
> > >>
> > >> On Jan 21, 2015, at 2:56 PM, Hao Zhu <hz...@maprtech.com> wrote:
> > >>
> > >>> I am not sure if below is expected behavior.
> > >>> If we only select "hashtags", and it will return only 1 row ignoring
> the
> > >>> null value.
> > >>> However then if we try to get "hashtags.text", it fails...which
> means it
> > >> is
> > >>> still trying to read the NULL value.
> > >>> I am thinking it may confuse the SQL developers.
> > >>>
> > >>>
> > >>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ;
> > >>> +------------+
> > >>> |   EXPR$0   |
> > >>> +------------+
> > >>> | [{"text":"GoPatriots"}] |
> > >>> +------------+
> > >>> 1 row selected (0.109 seconds)
> > >>>
> > >>>
> > >>> 0: jdbc:drill:> select t.entities.hashtags.text from
> dfs.tmp.`d.json` t ;
> > >>>
> > >>> Query failed: Query failed: Failure while running fragment.,
> > >>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be
> cast to
> > >>> org.apache.drill.exec.vector.complex.MapVector [
> > >>> 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> > >>> [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> > >>>
> > >>>
> > >>> Error: exception while executing query: Failure while executing
> query.
> > >>> (state=,code=0)
> > >>>
> > >>> Thanks,
> > >>> Hao
> > >>>
> > >>> On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht <
> > >>> aengelbrecht@maprtech.com> wrote:
> > >>>
> > >>>> Now try on hashtags with the following:
> > >>>>
> > >>>> drilldemo:5181> select t.entities.hashtags.`text` from
> `/twitter.json` t
> > >>>> where t.entities.hashtags is not null limit 10;
> > >>>>
> > >>>> Query failed: Query failed: Failure while running fragment.,
> > >>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be
> cast to
> > >>>> org.apache.drill.exec.vector.complex.MapVector [
> > >>>> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> > >>>> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> > >>>>
> > >>>>
> > >>>> Error: exception while executing query: Failure while executing
> query.
> > >>>> (state=,code=0)
> > >>>>
> > >>>>
> > >>>> {
> > >>>> "entities": {
> > >>>>  "trends": [],
> > >>>>  "symbols": [],
> > >>>>  "urls": [],
> > >>>>  "hashtags": [],
> > >>>>  "user_mentions": []
> > >>>> },
> > >>>> "entities": {
> > >>>>  "trends": [1,2,3],
> > >>>>  "symbols": [4,5,6],
> > >>>>  "urls": [7,8,9],
> > >>>>  "hashtags": [
> > >>>>    {
> > >>>>      "text": "GoPatriots",
> > >>>>      "indices": []
> > >>>>    }
> > >>>>  ],
> > >>>>  "user_mentions": []
> > >>>> }
> > >>>> }
> > >>>>
> > >>>> The issue seems to be that if some records have arrays with maps in
> them
> > >>>> and others are empty.
> > >>>>
> > >>>> —Andries
> > >>>>
> > >>>>
> > >>>> On Jan 21, 2015, at 2:34 PM, Hao Zhu <hz...@maprtech.com> wrote:
> > >>>>
> > >>>>> Seems it works for below json file:
> > >>>>> {
> > >>>>> "entities": {
> > >>>>>  "trends": [],
> > >>>>>  "symbols": [],
> > >>>>>  "urls": [],
> > >>>>>  "hashtags": [
> > >>>>>    {
> > >>>>>      "text": "GoPatriots",
> > >>>>>      "indices": [
> > >>>>>        83,
> > >>>>>        94
> > >>>>>      ]
> > >>>>>    }
> > >>>>>  ],
> > >>>>>  "user_mentions": []
> > >>>>> },
> > >>>>> "entities": {
> > >>>>>  "trends": [1,2,3],
> > >>>>>  "symbols": [4,5,6],
> > >>>>>  "urls": [7,8,9],
> > >>>>>  "hashtags": [
> > >>>>>    {
> > >>>>>      "text": "GoPatriots",
> > >>>>>      "indices": []
> > >>>>>    }
> > >>>>>  ],
> > >>>>>  "user_mentions": []
> > >>>>> }
> > >>>>> }
> > >>>>>
> > >>>>>
> > >>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t
> where
> > >>>>> t.entities.urls is not null;
> > >>>>> +------------+
> > >>>>> |   EXPR$0   |
> > >>>>> +------------+
> > >>>>> | [7,8,9]    |
> > >>>>> +------------+
> > >>>>> 1 row selected (0.139 seconds)
> > >>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t
> where
> > >>>>> t.entities.urls is null;
> > >>>>> +------------+
> > >>>>> |   EXPR$0   |
> > >>>>> +------------+
> > >>>>> +------------+
> > >>>>> No rows selected (0.158 seconds)
> > >>>>>
> > >>>>> Thanks,
> > >>>>> Hao
> > >>>>>
> > >>>>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya <ad...@gmail.com>
> > >> wrote:
> > >>>>>
> > >>>>>> I believe that this works if the array contains homogeneous
> primitive
> > >>>>>> types. In your example, it appears from the error, the array field
> > >>>> 'member'
> > >>>>>> contained maps for at least one record.
> > >>>>>>
> > >>>>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <
> cmatta@mapr.com>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Trying that locally did not work for me (drill 0.7.0):
> > >>>>>>>
> > >>>>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members` from
> > >>>>>> `Downloads/test.json` where repeated_count(`members`) > 0;
> > >>>>>>> Query failed: Query stopped., Failure while trying to materialize
> > >>>>>> incoming schema.  Errors:
> > >>>>>>>
> > >>>>>>> Error in expression at index -1.  Error: Missing function
> > >>>>>> implementation: [repeated_count(MAP-REPEATED)].  Full expression:
> > >>>> --UNKNOWN
> > >>>>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
> > >>>> bullseye-3:31010 ]
> > >>>>>>>
> > >>>>>>> Error: exception while executing query: Failure while executing
> > >> query.
> > >>>>>> (state=,code=0)
> > >>>>>>>
> > >>>>>>> ​
> > >>>>>>>
> > >>>>>>> Chris Matta
> > >>>>>>> cmatta@mapr.com
> > >>>>>>> 215-701-3146
> > >>>>>>>
> > >>>>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <adityakishore@gmail.com
> >
> > >>>> wrote:
> > >>>>>>>
> > >>>>>>>> repeated_count('entities.urls') > 0
> > >>>>>>>>
> > >>>>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
> > >>>>>>>> aengelbrecht@maprtech.com> wrote:
> > >>>>>>>>
> > >>>>>>>>> How do you filter out records with an empty array in drill?
> > >>>>>>>>> i.e some records have "url":[]  and some will have an array
> with
> > >> data
> > >>>>>> in
> > >>>>>>>>> it. When trying to read records with data in the array drill
> fails
> > >>>> due
> > >>>>>>>> to
> > >>>>>>>>> records missing any data in the array. Trying a filter with/*
> where
> > >>>>>>>>> "url":[0] is not null */ fails, also fails if applying url is
> not
> > >>>>>> null.
> > >>>>>>>>>
> > >>>>>>>>> Note some of the arrays contains maps, using twitter data as an
> > >>>>>> example
> > >>>>>>>>> below. Some records have an empty array with “hashtags”:[]  and
> > >>>> others
> > >>>>>>>> will
> > >>>>>>>>> look similar to what is listed below.
> > >>>>>>>>>
> > >>>>>>>>> "entities": {
> > >>>>>>>>>  "trends": [],
> > >>>>>>>>>  "symbols": [],
> > >>>>>>>>>  "urls": [],
> > >>>>>>>>>  "hashtags": [
> > >>>>>>>>>    {
> > >>>>>>>>>      "text": "GoPatriots",
> > >>>>>>>>>      "indices": [
> > >>>>>>>>>        83,
> > >>>>>>>>>        94
> > >>>>>>>>>      ]
> > >>>>>>>>>    }
> > >>>>>>>>>  ],
> > >>>>>>>>>  "user_mentions": []
> > >>>>>>>>> },
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Thanks
> > >>>>>>>>> —Andries
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>>>
> > >>
> > >>
> >
> >
> > <z.json>
>
>

Re: Filter out empty arrays in JSON

Posted by Andries Engelbrecht <ae...@maprtech.com>.
Very interesting, flatten seems to bypass empty records. Not sure if that is an ideal result for all use cases, but certainly usable in this case.

Thanks
—Andries


On Jan 21, 2015, at 3:45 PM, Hao Zhu <hz...@maprtech.com> wrote:

> I can also fetch non-null values for attached json file which contains 6 "entities", 3 of them are null, 3 of them have "text" value.
> 
> Could you share your complete "twitter.json"?
> 
> 0: jdbc:drill:> with tmp as
> . . . . . . . > (
> . . . . . . . > select flatten(t.entities.hashtags) as c from dfs.tmp.`z.json` t
> . . . . . . . > )
> . . . . . . . > select tmp.c.text from tmp;
> +------------+
> |   EXPR$0   |
> +------------+
> | GoPatriots |
> | aaa        |
> | bbb        |
> +------------+
> 3 rows selected (0.122 seconds)
> 
> Thanks,
> Hao
> 
> 
> 
> On Wed, Jan 21, 2015 at 3:35 PM, Andries Engelbrecht <ae...@maprtech.com> wrote:
> The sample data I posted only has 1 element, but some records have multiple elements in them.
> 
> Interestingly enough though
> select t.entities.hashtags[0].`text` from `twitter.json` t limit 10;
> 
> Produces
> +------------+
> |   EXPR$0   |
> +------------+
> | null       |
> | SportsNews |
> | null       |
> | SportsNews |
> | null       |
> | null       |
> | null       |
> | null       |
> | null       |
> | CARvsSEA   |
> +——————+
> 
> 
> And
> 
> select t.entities.hashtags[0] from `twitter.json` t limit 10;
> 
> +------------+
> |   EXPR$0   |
> +------------+
> | {"indices":[]} |
> | {"text":"SportsNews","indices":[90,99]} |
> | {"indices":[]} |
> | {"text":"SportsNews","indices":[90,99]} |
> | {"indices":[]} |
> | {"indices":[]} |
> | {"indices":[]} |
> | {"indices":[]} |
> | {"indices":[]} |
> | {"text":"CARvsSEA","indices":[90,99]} |
> +——————+
> 
> Strange part is that there is no indices map in the hashtags array, so no idea why it shows up when pointing to the first lament in an empty array.
> 
> 
> —Andries
> 
> 
> 
> 
> 
> 
> On Jan 21, 2015, at 3:24 PM, Hao Zhu <hz...@maprtech.com> wrote:
> 
> > I just noticed that the result of "hashtags" is just an array with only 1
> > element.
> > So take your example:
> > [root@maprdemo tmp]# cat d.json
> > {
> > "entities": {
> >   "trends": [],
> >   "symbols": [],
> >   "urls": [],
> >   "hashtags": [],
> >   "user_mentions": []
> > },
> > "entities": {
> >   "trends": [1,2,3],
> >   "symbols": [4,5,6],
> >   "urls": [7,8,9],
> >   "hashtags": [
> >     {
> >       "text": "GoPatriots",
> >       "indices": []
> >     }
> >   ],
> >   "user_mentions": []
> > }
> > }
> >
> > Now we can do this to achieve the results:
> > 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ;
> > +------------+
> > |   EXPR$0   |
> > +------------+
> > | [{"text":"GoPatriots"}] |
> > +------------+
> > 1 row selected (0.09 seconds)
> > 0: jdbc:drill:> select t.entities.hashtags[0].text from dfs.tmp.`d.json` t ;
> > +------------+
> > |   EXPR$0   |
> > +------------+
> > | GoPatriots |
> > +------------+
> > 1 row selected (0.108 seconds)
> >
> > Thanks,
> > Hao
> >
> >
> >
> > On Wed, Jan 21, 2015 at 3:01 PM, Andries Engelbrecht <
> > aengelbrecht@maprtech.com> wrote:
> >
> >> When I run the query on a larger dataset it actually show the empty
> >> records.
> >>
> >> select t.entities.hashtags from `twitter.json` t limit 10;
> >>
> >> +------------+
> >> |   EXPR$0   |
> >> +------------+
> >> | []         |
> >> | [{"text":"SportsNews","indices":[0,11]}] |
> >> | []         |
> >> | [{"text":"SportsNews","indices":[0,11]}] |
> >> | []         |
> >> | []         |
> >> | []         |
> >> | []         |
> >> | []         |
> >> | [{"text":"CARvsSEA","indices":[36,45]}] |
> >> +------------+
> >> 10 rows selected (2.899 seconds)
> >>
> >>
> >>
> >> However having the output as maps is not very useful, unless i can filter
> >> out the records with empty arrays and then drill deeper into the ones with
> >> data in the arrays.
> >>
> >> BTW: Hao I would have expiated your query to return both rows, one with an
> >> empty array as above and the other with the array data.
> >>
> >>
> >> —Andries
> >>
> >>
> >> On Jan 21, 2015, at 2:56 PM, Hao Zhu <hz...@maprtech.com> wrote:
> >>
> >>> I am not sure if below is expected behavior.
> >>> If we only select "hashtags", and it will return only 1 row ignoring the
> >>> null value.
> >>> However then if we try to get "hashtags.text", it fails...which means it
> >> is
> >>> still trying to read the NULL value.
> >>> I am thinking it may confuse the SQL developers.
> >>>
> >>>
> >>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ;
> >>> +------------+
> >>> |   EXPR$0   |
> >>> +------------+
> >>> | [{"text":"GoPatriots"}] |
> >>> +------------+
> >>> 1 row selected (0.109 seconds)
> >>>
> >>>
> >>> 0: jdbc:drill:> select t.entities.hashtags.text from dfs.tmp.`d.json` t ;
> >>>
> >>> Query failed: Query failed: Failure while running fragment.,
> >>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be cast to
> >>> org.apache.drill.exec.vector.complex.MapVector [
> >>> 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> >>> [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> >>>
> >>>
> >>> Error: exception while executing query: Failure while executing query.
> >>> (state=,code=0)
> >>>
> >>> Thanks,
> >>> Hao
> >>>
> >>> On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht <
> >>> aengelbrecht@maprtech.com> wrote:
> >>>
> >>>> Now try on hashtags with the following:
> >>>>
> >>>> drilldemo:5181> select t.entities.hashtags.`text` from `/twitter.json` t
> >>>> where t.entities.hashtags is not null limit 10;
> >>>>
> >>>> Query failed: Query failed: Failure while running fragment.,
> >>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be cast to
> >>>> org.apache.drill.exec.vector.complex.MapVector [
> >>>> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> >>>> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> >>>>
> >>>>
> >>>> Error: exception while executing query: Failure while executing query.
> >>>> (state=,code=0)
> >>>>
> >>>>
> >>>> {
> >>>> "entities": {
> >>>>  "trends": [],
> >>>>  "symbols": [],
> >>>>  "urls": [],
> >>>>  "hashtags": [],
> >>>>  "user_mentions": []
> >>>> },
> >>>> "entities": {
> >>>>  "trends": [1,2,3],
> >>>>  "symbols": [4,5,6],
> >>>>  "urls": [7,8,9],
> >>>>  "hashtags": [
> >>>>    {
> >>>>      "text": "GoPatriots",
> >>>>      "indices": []
> >>>>    }
> >>>>  ],
> >>>>  "user_mentions": []
> >>>> }
> >>>> }
> >>>>
> >>>> The issue seems to be that if some records have arrays with maps in them
> >>>> and others are empty.
> >>>>
> >>>> —Andries
> >>>>
> >>>>
> >>>> On Jan 21, 2015, at 2:34 PM, Hao Zhu <hz...@maprtech.com> wrote:
> >>>>
> >>>>> Seems it works for below json file:
> >>>>> {
> >>>>> "entities": {
> >>>>>  "trends": [],
> >>>>>  "symbols": [],
> >>>>>  "urls": [],
> >>>>>  "hashtags": [
> >>>>>    {
> >>>>>      "text": "GoPatriots",
> >>>>>      "indices": [
> >>>>>        83,
> >>>>>        94
> >>>>>      ]
> >>>>>    }
> >>>>>  ],
> >>>>>  "user_mentions": []
> >>>>> },
> >>>>> "entities": {
> >>>>>  "trends": [1,2,3],
> >>>>>  "symbols": [4,5,6],
> >>>>>  "urls": [7,8,9],
> >>>>>  "hashtags": [
> >>>>>    {
> >>>>>      "text": "GoPatriots",
> >>>>>      "indices": []
> >>>>>    }
> >>>>>  ],
> >>>>>  "user_mentions": []
> >>>>> }
> >>>>> }
> >>>>>
> >>>>>
> >>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t where
> >>>>> t.entities.urls is not null;
> >>>>> +------------+
> >>>>> |   EXPR$0   |
> >>>>> +------------+
> >>>>> | [7,8,9]    |
> >>>>> +------------+
> >>>>> 1 row selected (0.139 seconds)
> >>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t where
> >>>>> t.entities.urls is null;
> >>>>> +------------+
> >>>>> |   EXPR$0   |
> >>>>> +------------+
> >>>>> +------------+
> >>>>> No rows selected (0.158 seconds)
> >>>>>
> >>>>> Thanks,
> >>>>> Hao
> >>>>>
> >>>>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya <ad...@gmail.com>
> >> wrote:
> >>>>>
> >>>>>> I believe that this works if the array contains homogeneous primitive
> >>>>>> types. In your example, it appears from the error, the array field
> >>>> 'member'
> >>>>>> contained maps for at least one record.
> >>>>>>
> >>>>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <cm...@mapr.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Trying that locally did not work for me (drill 0.7.0):
> >>>>>>>
> >>>>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members` from
> >>>>>> `Downloads/test.json` where repeated_count(`members`) > 0;
> >>>>>>> Query failed: Query stopped., Failure while trying to materialize
> >>>>>> incoming schema.  Errors:
> >>>>>>>
> >>>>>>> Error in expression at index -1.  Error: Missing function
> >>>>>> implementation: [repeated_count(MAP-REPEATED)].  Full expression:
> >>>> --UNKNOWN
> >>>>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
> >>>> bullseye-3:31010 ]
> >>>>>>>
> >>>>>>> Error: exception while executing query: Failure while executing
> >> query.
> >>>>>> (state=,code=0)
> >>>>>>>
> >>>>>>> ​
> >>>>>>>
> >>>>>>> Chris Matta
> >>>>>>> cmatta@mapr.com
> >>>>>>> 215-701-3146
> >>>>>>>
> >>>>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <ad...@gmail.com>
> >>>> wrote:
> >>>>>>>
> >>>>>>>> repeated_count('entities.urls') > 0
> >>>>>>>>
> >>>>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
> >>>>>>>> aengelbrecht@maprtech.com> wrote:
> >>>>>>>>
> >>>>>>>>> How do you filter out records with an empty array in drill?
> >>>>>>>>> i.e some records have "url":[]  and some will have an array with
> >> data
> >>>>>> in
> >>>>>>>>> it. When trying to read records with data in the array drill fails
> >>>> due
> >>>>>>>> to
> >>>>>>>>> records missing any data in the array. Trying a filter with/* where
> >>>>>>>>> "url":[0] is not null */ fails, also fails if applying url is not
> >>>>>> null.
> >>>>>>>>>
> >>>>>>>>> Note some of the arrays contains maps, using twitter data as an
> >>>>>> example
> >>>>>>>>> below. Some records have an empty array with “hashtags”:[]  and
> >>>> others
> >>>>>>>> will
> >>>>>>>>> look similar to what is listed below.
> >>>>>>>>>
> >>>>>>>>> "entities": {
> >>>>>>>>>  "trends": [],
> >>>>>>>>>  "symbols": [],
> >>>>>>>>>  "urls": [],
> >>>>>>>>>  "hashtags": [
> >>>>>>>>>    {
> >>>>>>>>>      "text": "GoPatriots",
> >>>>>>>>>      "indices": [
> >>>>>>>>>        83,
> >>>>>>>>>        94
> >>>>>>>>>      ]
> >>>>>>>>>    }
> >>>>>>>>>  ],
> >>>>>>>>>  "user_mentions": []
> >>>>>>>>> },
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks
> >>>>>>>>> —Andries
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >>
> 
> 
> <z.json>


Re: Filter out empty arrays in JSON

Posted by Hao Zhu <hz...@maprtech.com>.
I can also fetch non-null values for attached json file which contains 6
"entities", 3 of them are null, 3 of them have "text" value.

Could you share your complete "twitter.json"?

0: jdbc:drill:> with tmp as
. . . . . . . > (
. . . . . . . > select flatten(t.entities.hashtags) as c from
dfs.tmp.`z.json` t
. . . . . . . > )
. . . . . . . > select tmp.c.text from tmp;
+------------+
|   EXPR$0   |
+------------+
| GoPatriots |
| aaa        |
| bbb        |
+------------+
3 rows selected (0.122 seconds)

Thanks,
Hao



On Wed, Jan 21, 2015 at 3:35 PM, Andries Engelbrecht <
aengelbrecht@maprtech.com> wrote:

> The sample data I posted only has 1 element, but some records have
> multiple elements in them.
>
> Interestingly enough though
> select t.entities.hashtags[0].`text` from `twitter.json` t limit 10;
>
> Produces
> +------------+
> |   EXPR$0   |
> +------------+
> | null       |
> | SportsNews |
> | null       |
> | SportsNews |
> | null       |
> | null       |
> | null       |
> | null       |
> | null       |
> | CARvsSEA   |
> +——————+
>
>
> And
>
> select t.entities.hashtags[0] from `twitter.json` t limit 10;
>
> +------------+
> |   EXPR$0   |
> +------------+
> | {"indices":[]} |
> | {"text":"SportsNews","indices":[90,99]} |
> | {"indices":[]} |
> | {"text":"SportsNews","indices":[90,99]} |
> | {"indices":[]} |
> | {"indices":[]} |
> | {"indices":[]} |
> | {"indices":[]} |
> | {"indices":[]} |
> | {"text":"CARvsSEA","indices":[90,99]} |
> +——————+
>
> Strange part is that there is no indices map in the hashtags array, so no
> idea why it shows up when pointing to the first lament in an empty array.
>
>
> —Andries
>
>
>
>
>
>
> On Jan 21, 2015, at 3:24 PM, Hao Zhu <hz...@maprtech.com> wrote:
>
> > I just noticed that the result of "hashtags" is just an array with only 1
> > element.
> > So take your example:
> > [root@maprdemo tmp]# cat d.json
> > {
> > "entities": {
> >   "trends": [],
> >   "symbols": [],
> >   "urls": [],
> >   "hashtags": [],
> >   "user_mentions": []
> > },
> > "entities": {
> >   "trends": [1,2,3],
> >   "symbols": [4,5,6],
> >   "urls": [7,8,9],
> >   "hashtags": [
> >     {
> >       "text": "GoPatriots",
> >       "indices": []
> >     }
> >   ],
> >   "user_mentions": []
> > }
> > }
> >
> > Now we can do this to achieve the results:
> > 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ;
> > +------------+
> > |   EXPR$0   |
> > +------------+
> > | [{"text":"GoPatriots"}] |
> > +------------+
> > 1 row selected (0.09 seconds)
> > 0: jdbc:drill:> select t.entities.hashtags[0].text from dfs.tmp.`d.json`
> t ;
> > +------------+
> > |   EXPR$0   |
> > +------------+
> > | GoPatriots |
> > +------------+
> > 1 row selected (0.108 seconds)
> >
> > Thanks,
> > Hao
> >
> >
> >
> > On Wed, Jan 21, 2015 at 3:01 PM, Andries Engelbrecht <
> > aengelbrecht@maprtech.com> wrote:
> >
> >> When I run the query on a larger dataset it actually show the empty
> >> records.
> >>
> >> select t.entities.hashtags from `twitter.json` t limit 10;
> >>
> >> +------------+
> >> |   EXPR$0   |
> >> +------------+
> >> | []         |
> >> | [{"text":"SportsNews","indices":[0,11]}] |
> >> | []         |
> >> | [{"text":"SportsNews","indices":[0,11]}] |
> >> | []         |
> >> | []         |
> >> | []         |
> >> | []         |
> >> | []         |
> >> | [{"text":"CARvsSEA","indices":[36,45]}] |
> >> +------------+
> >> 10 rows selected (2.899 seconds)
> >>
> >>
> >>
> >> However having the output as maps is not very useful, unless i can
> filter
> >> out the records with empty arrays and then drill deeper into the ones
> with
> >> data in the arrays.
> >>
> >> BTW: Hao I would have expiated your query to return both rows, one with
> an
> >> empty array as above and the other with the array data.
> >>
> >>
> >> —Andries
> >>
> >>
> >> On Jan 21, 2015, at 2:56 PM, Hao Zhu <hz...@maprtech.com> wrote:
> >>
> >>> I am not sure if below is expected behavior.
> >>> If we only select "hashtags", and it will return only 1 row ignoring
> the
> >>> null value.
> >>> However then if we try to get "hashtags.text", it fails...which means
> it
> >> is
> >>> still trying to read the NULL value.
> >>> I am thinking it may confuse the SQL developers.
> >>>
> >>>
> >>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ;
> >>> +------------+
> >>> |   EXPR$0   |
> >>> +------------+
> >>> | [{"text":"GoPatriots"}] |
> >>> +------------+
> >>> 1 row selected (0.109 seconds)
> >>>
> >>>
> >>> 0: jdbc:drill:> select t.entities.hashtags.text from dfs.tmp.`d.json`
> t ;
> >>>
> >>> Query failed: Query failed: Failure while running fragment.,
> >>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be cast
> to
> >>> org.apache.drill.exec.vector.complex.MapVector [
> >>> 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> >>> [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> >>>
> >>>
> >>> Error: exception while executing query: Failure while executing query.
> >>> (state=,code=0)
> >>>
> >>> Thanks,
> >>> Hao
> >>>
> >>> On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht <
> >>> aengelbrecht@maprtech.com> wrote:
> >>>
> >>>> Now try on hashtags with the following:
> >>>>
> >>>> drilldemo:5181> select t.entities.hashtags.`text` from
> `/twitter.json` t
> >>>> where t.entities.hashtags is not null limit 10;
> >>>>
> >>>> Query failed: Query failed: Failure while running fragment.,
> >>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be cast
> to
> >>>> org.apache.drill.exec.vector.complex.MapVector [
> >>>> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> >>>> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> >>>>
> >>>>
> >>>> Error: exception while executing query: Failure while executing query.
> >>>> (state=,code=0)
> >>>>
> >>>>
> >>>> {
> >>>> "entities": {
> >>>>  "trends": [],
> >>>>  "symbols": [],
> >>>>  "urls": [],
> >>>>  "hashtags": [],
> >>>>  "user_mentions": []
> >>>> },
> >>>> "entities": {
> >>>>  "trends": [1,2,3],
> >>>>  "symbols": [4,5,6],
> >>>>  "urls": [7,8,9],
> >>>>  "hashtags": [
> >>>>    {
> >>>>      "text": "GoPatriots",
> >>>>      "indices": []
> >>>>    }
> >>>>  ],
> >>>>  "user_mentions": []
> >>>> }
> >>>> }
> >>>>
> >>>> The issue seems to be that if some records have arrays with maps in
> them
> >>>> and others are empty.
> >>>>
> >>>> —Andries
> >>>>
> >>>>
> >>>> On Jan 21, 2015, at 2:34 PM, Hao Zhu <hz...@maprtech.com> wrote:
> >>>>
> >>>>> Seems it works for below json file:
> >>>>> {
> >>>>> "entities": {
> >>>>>  "trends": [],
> >>>>>  "symbols": [],
> >>>>>  "urls": [],
> >>>>>  "hashtags": [
> >>>>>    {
> >>>>>      "text": "GoPatriots",
> >>>>>      "indices": [
> >>>>>        83,
> >>>>>        94
> >>>>>      ]
> >>>>>    }
> >>>>>  ],
> >>>>>  "user_mentions": []
> >>>>> },
> >>>>> "entities": {
> >>>>>  "trends": [1,2,3],
> >>>>>  "symbols": [4,5,6],
> >>>>>  "urls": [7,8,9],
> >>>>>  "hashtags": [
> >>>>>    {
> >>>>>      "text": "GoPatriots",
> >>>>>      "indices": []
> >>>>>    }
> >>>>>  ],
> >>>>>  "user_mentions": []
> >>>>> }
> >>>>> }
> >>>>>
> >>>>>
> >>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t
> where
> >>>>> t.entities.urls is not null;
> >>>>> +------------+
> >>>>> |   EXPR$0   |
> >>>>> +------------+
> >>>>> | [7,8,9]    |
> >>>>> +------------+
> >>>>> 1 row selected (0.139 seconds)
> >>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t
> where
> >>>>> t.entities.urls is null;
> >>>>> +------------+
> >>>>> |   EXPR$0   |
> >>>>> +------------+
> >>>>> +------------+
> >>>>> No rows selected (0.158 seconds)
> >>>>>
> >>>>> Thanks,
> >>>>> Hao
> >>>>>
> >>>>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya <ad...@gmail.com>
> >> wrote:
> >>>>>
> >>>>>> I believe that this works if the array contains homogeneous
> primitive
> >>>>>> types. In your example, it appears from the error, the array field
> >>>> 'member'
> >>>>>> contained maps for at least one record.
> >>>>>>
> >>>>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <cmatta@mapr.com
> >
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Trying that locally did not work for me (drill 0.7.0):
> >>>>>>>
> >>>>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members` from
> >>>>>> `Downloads/test.json` where repeated_count(`members`) > 0;
> >>>>>>> Query failed: Query stopped., Failure while trying to materialize
> >>>>>> incoming schema.  Errors:
> >>>>>>>
> >>>>>>> Error in expression at index -1.  Error: Missing function
> >>>>>> implementation: [repeated_count(MAP-REPEATED)].  Full expression:
> >>>> --UNKNOWN
> >>>>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
> >>>> bullseye-3:31010 ]
> >>>>>>>
> >>>>>>> Error: exception while executing query: Failure while executing
> >> query.
> >>>>>> (state=,code=0)
> >>>>>>>
> >>>>>>> ​
> >>>>>>>
> >>>>>>> Chris Matta
> >>>>>>> cmatta@mapr.com
> >>>>>>> 215-701-3146
> >>>>>>>
> >>>>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <ad...@gmail.com>
> >>>> wrote:
> >>>>>>>
> >>>>>>>> repeated_count('entities.urls') > 0
> >>>>>>>>
> >>>>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
> >>>>>>>> aengelbrecht@maprtech.com> wrote:
> >>>>>>>>
> >>>>>>>>> How do you filter out records with an empty array in drill?
> >>>>>>>>> i.e some records have "url":[]  and some will have an array with
> >> data
> >>>>>> in
> >>>>>>>>> it. When trying to read records with data in the array drill
> fails
> >>>> due
> >>>>>>>> to
> >>>>>>>>> records missing any data in the array. Trying a filter with/*
> where
> >>>>>>>>> "url":[0] is not null */ fails, also fails if applying url is not
> >>>>>> null.
> >>>>>>>>>
> >>>>>>>>> Note some of the arrays contains maps, using twitter data as an
> >>>>>> example
> >>>>>>>>> below. Some records have an empty array with “hashtags”:[]  and
> >>>> others
> >>>>>>>> will
> >>>>>>>>> look similar to what is listed below.
> >>>>>>>>>
> >>>>>>>>> "entities": {
> >>>>>>>>>  "trends": [],
> >>>>>>>>>  "symbols": [],
> >>>>>>>>>  "urls": [],
> >>>>>>>>>  "hashtags": [
> >>>>>>>>>    {
> >>>>>>>>>      "text": "GoPatriots",
> >>>>>>>>>      "indices": [
> >>>>>>>>>        83,
> >>>>>>>>>        94
> >>>>>>>>>      ]
> >>>>>>>>>    }
> >>>>>>>>>  ],
> >>>>>>>>>  "user_mentions": []
> >>>>>>>>> },
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks
> >>>>>>>>> —Andries
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: Filter out empty arrays in JSON

Posted by Andries Engelbrecht <ae...@maprtech.com>.
The sample data I posted only has 1 element, but some records have multiple elements in them.

Interestingly enough though
select t.entities.hashtags[0].`text` from `twitter.json` t limit 10;

Produces
+------------+
|   EXPR$0   |
+------------+
| null       |
| SportsNews |
| null       |
| SportsNews |
| null       |
| null       |
| null       |
| null       |
| null       |
| CARvsSEA   |
+——————+


And

select t.entities.hashtags[0] from `twitter.json` t limit 10;

+------------+
|   EXPR$0   |
+------------+
| {"indices":[]} |
| {"text":"SportsNews","indices":[90,99]} |
| {"indices":[]} |
| {"text":"SportsNews","indices":[90,99]} |
| {"indices":[]} |
| {"indices":[]} |
| {"indices":[]} |
| {"indices":[]} |
| {"indices":[]} |
| {"text":"CARvsSEA","indices":[90,99]} |
+——————+

Strange part is that there is no indices map in the hashtags array, so no idea why it shows up when pointing to the first lament in an empty array.


—Andries






On Jan 21, 2015, at 3:24 PM, Hao Zhu <hz...@maprtech.com> wrote:

> I just noticed that the result of "hashtags" is just an array with only 1
> element.
> So take your example:
> [root@maprdemo tmp]# cat d.json
> {
> "entities": {
>   "trends": [],
>   "symbols": [],
>   "urls": [],
>   "hashtags": [],
>   "user_mentions": []
> },
> "entities": {
>   "trends": [1,2,3],
>   "symbols": [4,5,6],
>   "urls": [7,8,9],
>   "hashtags": [
>     {
>       "text": "GoPatriots",
>       "indices": []
>     }
>   ],
>   "user_mentions": []
> }
> }
> 
> Now we can do this to achieve the results:
> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ;
> +------------+
> |   EXPR$0   |
> +------------+
> | [{"text":"GoPatriots"}] |
> +------------+
> 1 row selected (0.09 seconds)
> 0: jdbc:drill:> select t.entities.hashtags[0].text from dfs.tmp.`d.json` t ;
> +------------+
> |   EXPR$0   |
> +------------+
> | GoPatriots |
> +------------+
> 1 row selected (0.108 seconds)
> 
> Thanks,
> Hao
> 
> 
> 
> On Wed, Jan 21, 2015 at 3:01 PM, Andries Engelbrecht <
> aengelbrecht@maprtech.com> wrote:
> 
>> When I run the query on a larger dataset it actually show the empty
>> records.
>> 
>> select t.entities.hashtags from `twitter.json` t limit 10;
>> 
>> +------------+
>> |   EXPR$0   |
>> +------------+
>> | []         |
>> | [{"text":"SportsNews","indices":[0,11]}] |
>> | []         |
>> | [{"text":"SportsNews","indices":[0,11]}] |
>> | []         |
>> | []         |
>> | []         |
>> | []         |
>> | []         |
>> | [{"text":"CARvsSEA","indices":[36,45]}] |
>> +------------+
>> 10 rows selected (2.899 seconds)
>> 
>> 
>> 
>> However having the output as maps is not very useful, unless i can filter
>> out the records with empty arrays and then drill deeper into the ones with
>> data in the arrays.
>> 
>> BTW: Hao I would have expiated your query to return both rows, one with an
>> empty array as above and the other with the array data.
>> 
>> 
>> —Andries
>> 
>> 
>> On Jan 21, 2015, at 2:56 PM, Hao Zhu <hz...@maprtech.com> wrote:
>> 
>>> I am not sure if below is expected behavior.
>>> If we only select "hashtags", and it will return only 1 row ignoring the
>>> null value.
>>> However then if we try to get "hashtags.text", it fails...which means it
>> is
>>> still trying to read the NULL value.
>>> I am thinking it may confuse the SQL developers.
>>> 
>>> 
>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ;
>>> +------------+
>>> |   EXPR$0   |
>>> +------------+
>>> | [{"text":"GoPatriots"}] |
>>> +------------+
>>> 1 row selected (0.109 seconds)
>>> 
>>> 
>>> 0: jdbc:drill:> select t.entities.hashtags.text from dfs.tmp.`d.json` t ;
>>> 
>>> Query failed: Query failed: Failure while running fragment.,
>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be cast to
>>> org.apache.drill.exec.vector.complex.MapVector [
>>> 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
>>> [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
>>> 
>>> 
>>> Error: exception while executing query: Failure while executing query.
>>> (state=,code=0)
>>> 
>>> Thanks,
>>> Hao
>>> 
>>> On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht <
>>> aengelbrecht@maprtech.com> wrote:
>>> 
>>>> Now try on hashtags with the following:
>>>> 
>>>> drilldemo:5181> select t.entities.hashtags.`text` from `/twitter.json` t
>>>> where t.entities.hashtags is not null limit 10;
>>>> 
>>>> Query failed: Query failed: Failure while running fragment.,
>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be cast to
>>>> org.apache.drill.exec.vector.complex.MapVector [
>>>> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
>>>> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
>>>> 
>>>> 
>>>> Error: exception while executing query: Failure while executing query.
>>>> (state=,code=0)
>>>> 
>>>> 
>>>> {
>>>> "entities": {
>>>>  "trends": [],
>>>>  "symbols": [],
>>>>  "urls": [],
>>>>  "hashtags": [],
>>>>  "user_mentions": []
>>>> },
>>>> "entities": {
>>>>  "trends": [1,2,3],
>>>>  "symbols": [4,5,6],
>>>>  "urls": [7,8,9],
>>>>  "hashtags": [
>>>>    {
>>>>      "text": "GoPatriots",
>>>>      "indices": []
>>>>    }
>>>>  ],
>>>>  "user_mentions": []
>>>> }
>>>> }
>>>> 
>>>> The issue seems to be that if some records have arrays with maps in them
>>>> and others are empty.
>>>> 
>>>> —Andries
>>>> 
>>>> 
>>>> On Jan 21, 2015, at 2:34 PM, Hao Zhu <hz...@maprtech.com> wrote:
>>>> 
>>>>> Seems it works for below json file:
>>>>> {
>>>>> "entities": {
>>>>>  "trends": [],
>>>>>  "symbols": [],
>>>>>  "urls": [],
>>>>>  "hashtags": [
>>>>>    {
>>>>>      "text": "GoPatriots",
>>>>>      "indices": [
>>>>>        83,
>>>>>        94
>>>>>      ]
>>>>>    }
>>>>>  ],
>>>>>  "user_mentions": []
>>>>> },
>>>>> "entities": {
>>>>>  "trends": [1,2,3],
>>>>>  "symbols": [4,5,6],
>>>>>  "urls": [7,8,9],
>>>>>  "hashtags": [
>>>>>    {
>>>>>      "text": "GoPatriots",
>>>>>      "indices": []
>>>>>    }
>>>>>  ],
>>>>>  "user_mentions": []
>>>>> }
>>>>> }
>>>>> 
>>>>> 
>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t where
>>>>> t.entities.urls is not null;
>>>>> +------------+
>>>>> |   EXPR$0   |
>>>>> +------------+
>>>>> | [7,8,9]    |
>>>>> +------------+
>>>>> 1 row selected (0.139 seconds)
>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t where
>>>>> t.entities.urls is null;
>>>>> +------------+
>>>>> |   EXPR$0   |
>>>>> +------------+
>>>>> +------------+
>>>>> No rows selected (0.158 seconds)
>>>>> 
>>>>> Thanks,
>>>>> Hao
>>>>> 
>>>>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya <ad...@gmail.com>
>> wrote:
>>>>> 
>>>>>> I believe that this works if the array contains homogeneous primitive
>>>>>> types. In your example, it appears from the error, the array field
>>>> 'member'
>>>>>> contained maps for at least one record.
>>>>>> 
>>>>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <cm...@mapr.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Trying that locally did not work for me (drill 0.7.0):
>>>>>>> 
>>>>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members` from
>>>>>> `Downloads/test.json` where repeated_count(`members`) > 0;
>>>>>>> Query failed: Query stopped., Failure while trying to materialize
>>>>>> incoming schema.  Errors:
>>>>>>> 
>>>>>>> Error in expression at index -1.  Error: Missing function
>>>>>> implementation: [repeated_count(MAP-REPEATED)].  Full expression:
>>>> --UNKNOWN
>>>>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
>>>> bullseye-3:31010 ]
>>>>>>> 
>>>>>>> Error: exception while executing query: Failure while executing
>> query.
>>>>>> (state=,code=0)
>>>>>>> 
>>>>>>> ​
>>>>>>> 
>>>>>>> Chris Matta
>>>>>>> cmatta@mapr.com
>>>>>>> 215-701-3146
>>>>>>> 
>>>>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <ad...@gmail.com>
>>>> wrote:
>>>>>>> 
>>>>>>>> repeated_count('entities.urls') > 0
>>>>>>>> 
>>>>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
>>>>>>>> aengelbrecht@maprtech.com> wrote:
>>>>>>>> 
>>>>>>>>> How do you filter out records with an empty array in drill?
>>>>>>>>> i.e some records have "url":[]  and some will have an array with
>> data
>>>>>> in
>>>>>>>>> it. When trying to read records with data in the array drill fails
>>>> due
>>>>>>>> to
>>>>>>>>> records missing any data in the array. Trying a filter with/* where
>>>>>>>>> "url":[0] is not null */ fails, also fails if applying url is not
>>>>>> null.
>>>>>>>>> 
>>>>>>>>> Note some of the arrays contains maps, using twitter data as an
>>>>>> example
>>>>>>>>> below. Some records have an empty array with “hashtags”:[]  and
>>>> others
>>>>>>>> will
>>>>>>>>> look similar to what is listed below.
>>>>>>>>> 
>>>>>>>>> "entities": {
>>>>>>>>>  "trends": [],
>>>>>>>>>  "symbols": [],
>>>>>>>>>  "urls": [],
>>>>>>>>>  "hashtags": [
>>>>>>>>>    {
>>>>>>>>>      "text": "GoPatriots",
>>>>>>>>>      "indices": [
>>>>>>>>>        83,
>>>>>>>>>        94
>>>>>>>>>      ]
>>>>>>>>>    }
>>>>>>>>>  ],
>>>>>>>>>  "user_mentions": []
>>>>>>>>> },
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> —Andries
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>> 


Re: Filter out empty arrays in JSON

Posted by Hao Zhu <hz...@maprtech.com>.
I just noticed that the result of "hashtags" is just an array with only 1
element.
So take your example:
[root@maprdemo tmp]# cat d.json
{
"entities": {
   "trends": [],
   "symbols": [],
   "urls": [],
   "hashtags": [],
   "user_mentions": []
 },
"entities": {
   "trends": [1,2,3],
   "symbols": [4,5,6],
   "urls": [7,8,9],
   "hashtags": [
     {
       "text": "GoPatriots",
       "indices": []
     }
   ],
   "user_mentions": []
 }
}

Now we can do this to achieve the results:
0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ;
+------------+
|   EXPR$0   |
+------------+
| [{"text":"GoPatriots"}] |
+------------+
1 row selected (0.09 seconds)
0: jdbc:drill:> select t.entities.hashtags[0].text from dfs.tmp.`d.json` t ;
+------------+
|   EXPR$0   |
+------------+
| GoPatriots |
+------------+
1 row selected (0.108 seconds)

Thanks,
Hao



On Wed, Jan 21, 2015 at 3:01 PM, Andries Engelbrecht <
aengelbrecht@maprtech.com> wrote:

> When I run the query on a larger dataset it actually show the empty
> records.
>
> select t.entities.hashtags from `twitter.json` t limit 10;
>
> +------------+
> |   EXPR$0   |
> +------------+
> | []         |
> | [{"text":"SportsNews","indices":[0,11]}] |
> | []         |
> | [{"text":"SportsNews","indices":[0,11]}] |
> | []         |
> | []         |
> | []         |
> | []         |
> | []         |
> | [{"text":"CARvsSEA","indices":[36,45]}] |
> +------------+
> 10 rows selected (2.899 seconds)
>
>
>
> However having the output as maps is not very useful, unless i can filter
> out the records with empty arrays and then drill deeper into the ones with
> data in the arrays.
>
> BTW: Hao I would have expiated your query to return both rows, one with an
> empty array as above and the other with the array data.
>
>
> —Andries
>
>
> On Jan 21, 2015, at 2:56 PM, Hao Zhu <hz...@maprtech.com> wrote:
>
> > I am not sure if below is expected behavior.
> > If we only select "hashtags", and it will return only 1 row ignoring the
> > null value.
> > However then if we try to get "hashtags.text", it fails...which means it
> is
> > still trying to read the NULL value.
> > I am thinking it may confuse the SQL developers.
> >
> >
> > 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ;
> > +------------+
> > |   EXPR$0   |
> > +------------+
> > | [{"text":"GoPatriots"}] |
> > +------------+
> > 1 row selected (0.109 seconds)
> >
> >
> > 0: jdbc:drill:> select t.entities.hashtags.text from dfs.tmp.`d.json` t ;
> >
> > Query failed: Query failed: Failure while running fragment.,
> > org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be cast to
> > org.apache.drill.exec.vector.complex.MapVector [
> > 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> > [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> >
> >
> > Error: exception while executing query: Failure while executing query.
> > (state=,code=0)
> >
> > Thanks,
> > Hao
> >
> > On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht <
> > aengelbrecht@maprtech.com> wrote:
> >
> >> Now try on hashtags with the following:
> >>
> >> drilldemo:5181> select t.entities.hashtags.`text` from `/twitter.json` t
> >> where t.entities.hashtags is not null limit 10;
> >>
> >> Query failed: Query failed: Failure while running fragment.,
> >> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be cast to
> >> org.apache.drill.exec.vector.complex.MapVector [
> >> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> >> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> >>
> >>
> >> Error: exception while executing query: Failure while executing query.
> >> (state=,code=0)
> >>
> >>
> >> {
> >> "entities": {
> >>   "trends": [],
> >>   "symbols": [],
> >>   "urls": [],
> >>   "hashtags": [],
> >>   "user_mentions": []
> >> },
> >> "entities": {
> >>   "trends": [1,2,3],
> >>   "symbols": [4,5,6],
> >>   "urls": [7,8,9],
> >>   "hashtags": [
> >>     {
> >>       "text": "GoPatriots",
> >>       "indices": []
> >>     }
> >>   ],
> >>   "user_mentions": []
> >> }
> >> }
> >>
> >> The issue seems to be that if some records have arrays with maps in them
> >> and others are empty.
> >>
> >> —Andries
> >>
> >>
> >> On Jan 21, 2015, at 2:34 PM, Hao Zhu <hz...@maprtech.com> wrote:
> >>
> >>> Seems it works for below json file:
> >>> {
> >>> "entities": {
> >>>   "trends": [],
> >>>   "symbols": [],
> >>>   "urls": [],
> >>>   "hashtags": [
> >>>     {
> >>>       "text": "GoPatriots",
> >>>       "indices": [
> >>>         83,
> >>>         94
> >>>       ]
> >>>     }
> >>>   ],
> >>>   "user_mentions": []
> >>> },
> >>> "entities": {
> >>>   "trends": [1,2,3],
> >>>   "symbols": [4,5,6],
> >>>   "urls": [7,8,9],
> >>>   "hashtags": [
> >>>     {
> >>>       "text": "GoPatriots",
> >>>       "indices": []
> >>>     }
> >>>   ],
> >>>   "user_mentions": []
> >>> }
> >>> }
> >>>
> >>>
> >>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t where
> >>> t.entities.urls is not null;
> >>> +------------+
> >>> |   EXPR$0   |
> >>> +------------+
> >>> | [7,8,9]    |
> >>> +------------+
> >>> 1 row selected (0.139 seconds)
> >>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t where
> >>> t.entities.urls is null;
> >>> +------------+
> >>> |   EXPR$0   |
> >>> +------------+
> >>> +------------+
> >>> No rows selected (0.158 seconds)
> >>>
> >>> Thanks,
> >>> Hao
> >>>
> >>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya <ad...@gmail.com>
> wrote:
> >>>
> >>>> I believe that this works if the array contains homogeneous primitive
> >>>> types. In your example, it appears from the error, the array field
> >> 'member'
> >>>> contained maps for at least one record.
> >>>>
> >>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <cm...@mapr.com>
> >>>> wrote:
> >>>>
> >>>>> Trying that locally did not work for me (drill 0.7.0):
> >>>>>
> >>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members` from
> >>>> `Downloads/test.json` where repeated_count(`members`) > 0;
> >>>>> Query failed: Query stopped., Failure while trying to materialize
> >>>> incoming schema.  Errors:
> >>>>>
> >>>>> Error in expression at index -1.  Error: Missing function
> >>>> implementation: [repeated_count(MAP-REPEATED)].  Full expression:
> >> --UNKNOWN
> >>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
> >> bullseye-3:31010 ]
> >>>>>
> >>>>> Error: exception while executing query: Failure while executing
> query.
> >>>> (state=,code=0)
> >>>>>
> >>>>> ​
> >>>>>
> >>>>> Chris Matta
> >>>>> cmatta@mapr.com
> >>>>> 215-701-3146
> >>>>>
> >>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <ad...@gmail.com>
> >> wrote:
> >>>>>
> >>>>>> repeated_count('entities.urls') > 0
> >>>>>>
> >>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
> >>>>>> aengelbrecht@maprtech.com> wrote:
> >>>>>>
> >>>>>>> How do you filter out records with an empty array in drill?
> >>>>>>> i.e some records have "url":[]  and some will have an array with
> data
> >>>> in
> >>>>>>> it. When trying to read records with data in the array drill fails
> >> due
> >>>>>> to
> >>>>>>> records missing any data in the array. Trying a filter with/* where
> >>>>>>> "url":[0] is not null */ fails, also fails if applying url is not
> >>>> null.
> >>>>>>>
> >>>>>>> Note some of the arrays contains maps, using twitter data as an
> >>>> example
> >>>>>>> below. Some records have an empty array with “hashtags”:[]  and
> >> others
> >>>>>> will
> >>>>>>> look similar to what is listed below.
> >>>>>>>
> >>>>>>> "entities": {
> >>>>>>>   "trends": [],
> >>>>>>>   "symbols": [],
> >>>>>>>   "urls": [],
> >>>>>>>   "hashtags": [
> >>>>>>>     {
> >>>>>>>       "text": "GoPatriots",
> >>>>>>>       "indices": [
> >>>>>>>         83,
> >>>>>>>         94
> >>>>>>>       ]
> >>>>>>>     }
> >>>>>>>   ],
> >>>>>>>   "user_mentions": []
> >>>>>>> },
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks
> >>>>>>> —Andries
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>
> >>
>
>

Re: Filter out empty arrays in JSON

Posted by Andries Engelbrecht <ae...@maprtech.com>.
When I run the query on a larger dataset it actually show the empty records.

select t.entities.hashtags from `twitter.json` t limit 10;

+------------+
|   EXPR$0   |
+------------+
| []         |
| [{"text":"SportsNews","indices":[0,11]}] |
| []         |
| [{"text":"SportsNews","indices":[0,11]}] |
| []         |
| []         |
| []         |
| []         |
| []         |
| [{"text":"CARvsSEA","indices":[36,45]}] |
+------------+
10 rows selected (2.899 seconds)



However having the output as maps is not very useful, unless i can filter out the records with empty arrays and then drill deeper into the ones with data in the arrays.

BTW: Hao I would have expiated your query to return both rows, one with an empty array as above and the other with the array data.


—Andries


On Jan 21, 2015, at 2:56 PM, Hao Zhu <hz...@maprtech.com> wrote:

> I am not sure if below is expected behavior.
> If we only select "hashtags", and it will return only 1 row ignoring the
> null value.
> However then if we try to get "hashtags.text", it fails...which means it is
> still trying to read the NULL value.
> I am thinking it may confuse the SQL developers.
> 
> 
> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ;
> +------------+
> |   EXPR$0   |
> +------------+
> | [{"text":"GoPatriots"}] |
> +------------+
> 1 row selected (0.109 seconds)
> 
> 
> 0: jdbc:drill:> select t.entities.hashtags.text from dfs.tmp.`d.json` t ;
> 
> Query failed: Query failed: Failure while running fragment.,
> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be cast to
> org.apache.drill.exec.vector.complex.MapVector [
> 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> 
> 
> Error: exception while executing query: Failure while executing query.
> (state=,code=0)
> 
> Thanks,
> Hao
> 
> On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht <
> aengelbrecht@maprtech.com> wrote:
> 
>> Now try on hashtags with the following:
>> 
>> drilldemo:5181> select t.entities.hashtags.`text` from `/twitter.json` t
>> where t.entities.hashtags is not null limit 10;
>> 
>> Query failed: Query failed: Failure while running fragment.,
>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be cast to
>> org.apache.drill.exec.vector.complex.MapVector [
>> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
>> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
>> 
>> 
>> Error: exception while executing query: Failure while executing query.
>> (state=,code=0)
>> 
>> 
>> {
>> "entities": {
>>   "trends": [],
>>   "symbols": [],
>>   "urls": [],
>>   "hashtags": [],
>>   "user_mentions": []
>> },
>> "entities": {
>>   "trends": [1,2,3],
>>   "symbols": [4,5,6],
>>   "urls": [7,8,9],
>>   "hashtags": [
>>     {
>>       "text": "GoPatriots",
>>       "indices": []
>>     }
>>   ],
>>   "user_mentions": []
>> }
>> }
>> 
>> The issue seems to be that if some records have arrays with maps in them
>> and others are empty.
>> 
>> —Andries
>> 
>> 
>> On Jan 21, 2015, at 2:34 PM, Hao Zhu <hz...@maprtech.com> wrote:
>> 
>>> Seems it works for below json file:
>>> {
>>> "entities": {
>>>   "trends": [],
>>>   "symbols": [],
>>>   "urls": [],
>>>   "hashtags": [
>>>     {
>>>       "text": "GoPatriots",
>>>       "indices": [
>>>         83,
>>>         94
>>>       ]
>>>     }
>>>   ],
>>>   "user_mentions": []
>>> },
>>> "entities": {
>>>   "trends": [1,2,3],
>>>   "symbols": [4,5,6],
>>>   "urls": [7,8,9],
>>>   "hashtags": [
>>>     {
>>>       "text": "GoPatriots",
>>>       "indices": []
>>>     }
>>>   ],
>>>   "user_mentions": []
>>> }
>>> }
>>> 
>>> 
>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t where
>>> t.entities.urls is not null;
>>> +------------+
>>> |   EXPR$0   |
>>> +------------+
>>> | [7,8,9]    |
>>> +------------+
>>> 1 row selected (0.139 seconds)
>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t where
>>> t.entities.urls is null;
>>> +------------+
>>> |   EXPR$0   |
>>> +------------+
>>> +------------+
>>> No rows selected (0.158 seconds)
>>> 
>>> Thanks,
>>> Hao
>>> 
>>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya <ad...@gmail.com> wrote:
>>> 
>>>> I believe that this works if the array contains homogeneous primitive
>>>> types. In your example, it appears from the error, the array field
>> 'member'
>>>> contained maps for at least one record.
>>>> 
>>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <cm...@mapr.com>
>>>> wrote:
>>>> 
>>>>> Trying that locally did not work for me (drill 0.7.0):
>>>>> 
>>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members` from
>>>> `Downloads/test.json` where repeated_count(`members`) > 0;
>>>>> Query failed: Query stopped., Failure while trying to materialize
>>>> incoming schema.  Errors:
>>>>> 
>>>>> Error in expression at index -1.  Error: Missing function
>>>> implementation: [repeated_count(MAP-REPEATED)].  Full expression:
>> --UNKNOWN
>>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
>> bullseye-3:31010 ]
>>>>> 
>>>>> Error: exception while executing query: Failure while executing query.
>>>> (state=,code=0)
>>>>> 
>>>>> ​
>>>>> 
>>>>> Chris Matta
>>>>> cmatta@mapr.com
>>>>> 215-701-3146
>>>>> 
>>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <ad...@gmail.com>
>> wrote:
>>>>> 
>>>>>> repeated_count('entities.urls') > 0
>>>>>> 
>>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
>>>>>> aengelbrecht@maprtech.com> wrote:
>>>>>> 
>>>>>>> How do you filter out records with an empty array in drill?
>>>>>>> i.e some records have "url":[]  and some will have an array with data
>>>> in
>>>>>>> it. When trying to read records with data in the array drill fails
>> due
>>>>>> to
>>>>>>> records missing any data in the array. Trying a filter with/* where
>>>>>>> "url":[0] is not null */ fails, also fails if applying url is not
>>>> null.
>>>>>>> 
>>>>>>> Note some of the arrays contains maps, using twitter data as an
>>>> example
>>>>>>> below. Some records have an empty array with “hashtags”:[]  and
>> others
>>>>>> will
>>>>>>> look similar to what is listed below.
>>>>>>> 
>>>>>>> "entities": {
>>>>>>>   "trends": [],
>>>>>>>   "symbols": [],
>>>>>>>   "urls": [],
>>>>>>>   "hashtags": [
>>>>>>>     {
>>>>>>>       "text": "GoPatriots",
>>>>>>>       "indices": [
>>>>>>>         83,
>>>>>>>         94
>>>>>>>       ]
>>>>>>>     }
>>>>>>>   ],
>>>>>>>   "user_mentions": []
>>>>>>> },
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks
>>>>>>> —Andries
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>> 
>> 


Re: Filter out empty arrays in JSON

Posted by Hao Zhu <hz...@maprtech.com>.
I am not sure if below is expected behavior.
If we only select "hashtags", and it will return only 1 row ignoring the
null value.
However then if we try to get "hashtags.text", it fails...which means it is
still trying to read the NULL value.
I am thinking it may confuse the SQL developers.


0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ;
+------------+
|   EXPR$0   |
+------------+
| [{"text":"GoPatriots"}] |
+------------+
1 row selected (0.109 seconds)


0: jdbc:drill:> select t.entities.hashtags.text from dfs.tmp.`d.json` t ;

Query failed: Query failed: Failure while running fragment.,
org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be cast to
org.apache.drill.exec.vector.complex.MapVector [
7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
[ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]


Error: exception while executing query: Failure while executing query.
(state=,code=0)

Thanks,
Hao

On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht <
aengelbrecht@maprtech.com> wrote:

> Now try on hashtags with the following:
>
> drilldemo:5181> select t.entities.hashtags.`text` from `/twitter.json` t
> where t.entities.hashtags is not null limit 10;
>
> Query failed: Query failed: Failure while running fragment.,
> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be cast to
> org.apache.drill.exec.vector.complex.MapVector [
> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
>
>
> Error: exception while executing query: Failure while executing query.
> (state=,code=0)
>
>
> {
> "entities": {
>    "trends": [],
>    "symbols": [],
>    "urls": [],
>    "hashtags": [],
>    "user_mentions": []
>  },
> "entities": {
>    "trends": [1,2,3],
>    "symbols": [4,5,6],
>    "urls": [7,8,9],
>    "hashtags": [
>      {
>        "text": "GoPatriots",
>        "indices": []
>      }
>    ],
>    "user_mentions": []
>  }
> }
>
> The issue seems to be that if some records have arrays with maps in them
> and others are empty.
>
> —Andries
>
>
> On Jan 21, 2015, at 2:34 PM, Hao Zhu <hz...@maprtech.com> wrote:
>
> > Seems it works for below json file:
> > {
> > "entities": {
> >    "trends": [],
> >    "symbols": [],
> >    "urls": [],
> >    "hashtags": [
> >      {
> >        "text": "GoPatriots",
> >        "indices": [
> >          83,
> >          94
> >        ]
> >      }
> >    ],
> >    "user_mentions": []
> >  },
> > "entities": {
> >    "trends": [1,2,3],
> >    "symbols": [4,5,6],
> >    "urls": [7,8,9],
> >    "hashtags": [
> >      {
> >        "text": "GoPatriots",
> >        "indices": []
> >      }
> >    ],
> >    "user_mentions": []
> >  }
> > }
> >
> >
> > 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t where
> > t.entities.urls is not null;
> > +------------+
> > |   EXPR$0   |
> > +------------+
> > | [7,8,9]    |
> > +------------+
> > 1 row selected (0.139 seconds)
> > 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t where
> > t.entities.urls is null;
> > +------------+
> > |   EXPR$0   |
> > +------------+
> > +------------+
> > No rows selected (0.158 seconds)
> >
> > Thanks,
> > Hao
> >
> > On Wed, Jan 21, 2015 at 2:01 PM, Aditya <ad...@gmail.com> wrote:
> >
> >> I believe that this works if the array contains homogeneous primitive
> >> types. In your example, it appears from the error, the array field
> 'member'
> >> contained maps for at least one record.
> >>
> >> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <cm...@mapr.com>
> >> wrote:
> >>
> >>> Trying that locally did not work for me (drill 0.7.0):
> >>>
> >>> 0: jdbc:drill:zk=local> select `id`, `name`, `members` from
> >> `Downloads/test.json` where repeated_count(`members`) > 0;
> >>> Query failed: Query stopped., Failure while trying to materialize
> >> incoming schema.  Errors:
> >>>
> >>> Error in expression at index -1.  Error: Missing function
> >> implementation: [repeated_count(MAP-REPEATED)].  Full expression:
> --UNKNOWN
> >> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
> bullseye-3:31010 ]
> >>>
> >>> Error: exception while executing query: Failure while executing query.
> >> (state=,code=0)
> >>>
> >>> ​
> >>>
> >>> Chris Matta
> >>> cmatta@mapr.com
> >>> 215-701-3146
> >>>
> >>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <ad...@gmail.com>
> wrote:
> >>>
> >>>> repeated_count('entities.urls') > 0
> >>>>
> >>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
> >>>> aengelbrecht@maprtech.com> wrote:
> >>>>
> >>>>> How do you filter out records with an empty array in drill?
> >>>>> i.e some records have "url":[]  and some will have an array with data
> >> in
> >>>>> it. When trying to read records with data in the array drill fails
> due
> >>>> to
> >>>>> records missing any data in the array. Trying a filter with/* where
> >>>>> "url":[0] is not null */ fails, also fails if applying url is not
> >> null.
> >>>>>
> >>>>> Note some of the arrays contains maps, using twitter data as an
> >> example
> >>>>> below. Some records have an empty array with “hashtags”:[]  and
> others
> >>>> will
> >>>>> look similar to what is listed below.
> >>>>>
> >>>>> "entities": {
> >>>>>    "trends": [],
> >>>>>    "symbols": [],
> >>>>>    "urls": [],
> >>>>>    "hashtags": [
> >>>>>      {
> >>>>>        "text": "GoPatriots",
> >>>>>        "indices": [
> >>>>>          83,
> >>>>>          94
> >>>>>        ]
> >>>>>      }
> >>>>>    ],
> >>>>>    "user_mentions": []
> >>>>>  },
> >>>>>
> >>>>>
> >>>>> Thanks
> >>>>> —Andries
> >>>>
> >>>
> >>>
> >>
>
>

Re: Filter out empty arrays in JSON

Posted by Andries Engelbrecht <ae...@maprtech.com>.
Now try on hashtags with the following:

drilldemo:5181> select t.entities.hashtags.`text` from `/twitter.json` t where t.entities.hashtags is not null limit 10;

Query failed: Query failed: Failure while running fragment., org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be cast to org.apache.drill.exec.vector.complex.MapVector [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
[ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]


Error: exception while executing query: Failure while executing query. (state=,code=0)


{
"entities": {
   "trends": [],
   "symbols": [],
   "urls": [],
   "hashtags": [],
   "user_mentions": []
 },
"entities": {
   "trends": [1,2,3],
   "symbols": [4,5,6],
   "urls": [7,8,9],
   "hashtags": [
     {
       "text": "GoPatriots",
       "indices": []
     }
   ],
   "user_mentions": []
 }
}

The issue seems to be that if some records have arrays with maps in them and others are empty.

—Andries


On Jan 21, 2015, at 2:34 PM, Hao Zhu <hz...@maprtech.com> wrote:

> Seems it works for below json file:
> {
> "entities": {
>    "trends": [],
>    "symbols": [],
>    "urls": [],
>    "hashtags": [
>      {
>        "text": "GoPatriots",
>        "indices": [
>          83,
>          94
>        ]
>      }
>    ],
>    "user_mentions": []
>  },
> "entities": {
>    "trends": [1,2,3],
>    "symbols": [4,5,6],
>    "urls": [7,8,9],
>    "hashtags": [
>      {
>        "text": "GoPatriots",
>        "indices": []
>      }
>    ],
>    "user_mentions": []
>  }
> }
> 
> 
> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t where
> t.entities.urls is not null;
> +------------+
> |   EXPR$0   |
> +------------+
> | [7,8,9]    |
> +------------+
> 1 row selected (0.139 seconds)
> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t where
> t.entities.urls is null;
> +------------+
> |   EXPR$0   |
> +------------+
> +------------+
> No rows selected (0.158 seconds)
> 
> Thanks,
> Hao
> 
> On Wed, Jan 21, 2015 at 2:01 PM, Aditya <ad...@gmail.com> wrote:
> 
>> I believe that this works if the array contains homogeneous primitive
>> types. In your example, it appears from the error, the array field 'member'
>> contained maps for at least one record.
>> 
>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <cm...@mapr.com>
>> wrote:
>> 
>>> Trying that locally did not work for me (drill 0.7.0):
>>> 
>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members` from
>> `Downloads/test.json` where repeated_count(`members`) > 0;
>>> Query failed: Query stopped., Failure while trying to materialize
>> incoming schema.  Errors:
>>> 
>>> Error in expression at index -1.  Error: Missing function
>> implementation: [repeated_count(MAP-REPEATED)].  Full expression: --UNKNOWN
>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on bullseye-3:31010 ]
>>> 
>>> Error: exception while executing query: Failure while executing query.
>> (state=,code=0)
>>> 
>>> ​
>>> 
>>> Chris Matta
>>> cmatta@mapr.com
>>> 215-701-3146
>>> 
>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <ad...@gmail.com> wrote:
>>> 
>>>> repeated_count('entities.urls') > 0
>>>> 
>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
>>>> aengelbrecht@maprtech.com> wrote:
>>>> 
>>>>> How do you filter out records with an empty array in drill?
>>>>> i.e some records have "url":[]  and some will have an array with data
>> in
>>>>> it. When trying to read records with data in the array drill fails due
>>>> to
>>>>> records missing any data in the array. Trying a filter with/* where
>>>>> "url":[0] is not null */ fails, also fails if applying url is not
>> null.
>>>>> 
>>>>> Note some of the arrays contains maps, using twitter data as an
>> example
>>>>> below. Some records have an empty array with “hashtags”:[]  and others
>>>> will
>>>>> look similar to what is listed below.
>>>>> 
>>>>> "entities": {
>>>>>    "trends": [],
>>>>>    "symbols": [],
>>>>>    "urls": [],
>>>>>    "hashtags": [
>>>>>      {
>>>>>        "text": "GoPatriots",
>>>>>        "indices": [
>>>>>          83,
>>>>>          94
>>>>>        ]
>>>>>      }
>>>>>    ],
>>>>>    "user_mentions": []
>>>>>  },
>>>>> 
>>>>> 
>>>>> Thanks
>>>>> —Andries
>>>> 
>>> 
>>> 
>> 


Re: Filter out empty arrays in JSON

Posted by Hao Zhu <hz...@maprtech.com>.
Seems it works for below json file:
{
"entities": {
    "trends": [],
    "symbols": [],
    "urls": [],
    "hashtags": [
      {
        "text": "GoPatriots",
        "indices": [
          83,
          94
        ]
      }
    ],
    "user_mentions": []
  },
 "entities": {
    "trends": [1,2,3],
    "symbols": [4,5,6],
    "urls": [7,8,9],
    "hashtags": [
      {
        "text": "GoPatriots",
        "indices": []
      }
    ],
    "user_mentions": []
  }
 }


0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t where
t.entities.urls is not null;
+------------+
|   EXPR$0   |
+------------+
| [7,8,9]    |
+------------+
1 row selected (0.139 seconds)
0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t where
t.entities.urls is null;
+------------+
|   EXPR$0   |
+------------+
+------------+
No rows selected (0.158 seconds)

Thanks,
Hao

On Wed, Jan 21, 2015 at 2:01 PM, Aditya <ad...@gmail.com> wrote:

> I believe that this works if the array contains homogeneous primitive
> types. In your example, it appears from the error, the array field 'member'
> contained maps for at least one record.
>
> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <cm...@mapr.com>
> wrote:
>
> > Trying that locally did not work for me (drill 0.7.0):
> >
> > 0: jdbc:drill:zk=local> select `id`, `name`, `members` from
> `Downloads/test.json` where repeated_count(`members`) > 0;
> > Query failed: Query stopped., Failure while trying to materialize
> incoming schema.  Errors:
> >
> > Error in expression at index -1.  Error: Missing function
> implementation: [repeated_count(MAP-REPEATED)].  Full expression: --UNKNOWN
> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on bullseye-3:31010 ]
> >
> > Error: exception while executing query: Failure while executing query.
> (state=,code=0)
> >
> > ​
> >
> > Chris Matta
> > cmatta@mapr.com
> > 215-701-3146
> >
> > On Wed, Jan 21, 2015 at 4:50 PM, Aditya <ad...@gmail.com> wrote:
> >
> >> repeated_count('entities.urls') > 0
> >>
> >> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
> >> aengelbrecht@maprtech.com> wrote:
> >>
> >> > How do you filter out records with an empty array in drill?
> >> > i.e some records have "url":[]  and some will have an array with data
> in
> >> > it. When trying to read records with data in the array drill fails due
> >> to
> >> > records missing any data in the array. Trying a filter with/* where
> >> > "url":[0] is not null */ fails, also fails if applying url is not
> null.
> >> >
> >> > Note some of the arrays contains maps, using twitter data as an
> example
> >> > below. Some records have an empty array with “hashtags”:[]  and others
> >> will
> >> > look similar to what is listed below.
> >> >
> >> > "entities": {
> >> >     "trends": [],
> >> >     "symbols": [],
> >> >     "urls": [],
> >> >     "hashtags": [
> >> >       {
> >> >         "text": "GoPatriots",
> >> >         "indices": [
> >> >           83,
> >> >           94
> >> >         ]
> >> >       }
> >> >     ],
> >> >     "user_mentions": []
> >> >   },
> >> >
> >> >
> >> > Thanks
> >> > —Andries
> >>
> >
> >
>

Re: Filter out empty arrays in JSON

Posted by Aditya <ad...@gmail.com>.
I believe that this works if the array contains homogeneous primitive
types. In your example, it appears from the error, the array field 'member'
contained maps for at least one record.

On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <cm...@mapr.com> wrote:

> Trying that locally did not work for me (drill 0.7.0):
>
> 0: jdbc:drill:zk=local> select `id`, `name`, `members` from `Downloads/test.json` where repeated_count(`members`) > 0;
> Query failed: Query stopped., Failure while trying to materialize incoming schema.  Errors:
>
> Error in expression at index -1.  Error: Missing function implementation: [repeated_count(MAP-REPEATED)].  Full expression: --UNKNOWN EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on bullseye-3:31010 ]
>
> Error: exception while executing query: Failure while executing query. (state=,code=0)
>
> ​
>
> Chris Matta
> cmatta@mapr.com
> 215-701-3146
>
> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <ad...@gmail.com> wrote:
>
>> repeated_count('entities.urls') > 0
>>
>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
>> aengelbrecht@maprtech.com> wrote:
>>
>> > How do you filter out records with an empty array in drill?
>> > i.e some records have "url":[]  and some will have an array with data in
>> > it. When trying to read records with data in the array drill fails due
>> to
>> > records missing any data in the array. Trying a filter with/* where
>> > "url":[0] is not null */ fails, also fails if applying url is not null.
>> >
>> > Note some of the arrays contains maps, using twitter data as an example
>> > below. Some records have an empty array with “hashtags”:[]  and others
>> will
>> > look similar to what is listed below.
>> >
>> > "entities": {
>> >     "trends": [],
>> >     "symbols": [],
>> >     "urls": [],
>> >     "hashtags": [
>> >       {
>> >         "text": "GoPatriots",
>> >         "indices": [
>> >           83,
>> >           94
>> >         ]
>> >       }
>> >     ],
>> >     "user_mentions": []
>> >   },
>> >
>> >
>> > Thanks
>> > —Andries
>>
>
>

Re: Filter out empty arrays in JSON

Posted by Christopher Matta <cm...@mapr.com>.
Trying that locally did not work for me (drill 0.7.0):

0: jdbc:drill:zk=local> select `id`, `name`, `members` from
`Downloads/test.json` where repeated_count(`members`) > 0;
Query failed: Query stopped., Failure while trying to materialize
incoming schema.  Errors:

Error in expression at index -1.  Error: Missing function
implementation: [repeated_count(MAP-REPEATED)].  Full expression:
--UNKNOWN EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
bullseye-3:31010 ]

Error: exception while executing query: Failure while executing query.
(state=,code=0)

​

Chris Matta
cmatta@mapr.com
215-701-3146

On Wed, Jan 21, 2015 at 4:50 PM, Aditya <ad...@gmail.com> wrote:

> repeated_count('entities.urls') > 0
>
> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
> aengelbrecht@maprtech.com> wrote:
>
> > How do you filter out records with an empty array in drill?
> > i.e some records have "url":[]  and some will have an array with data in
> > it. When trying to read records with data in the array drill fails due to
> > records missing any data in the array. Trying a filter with/* where
> > "url":[0] is not null */ fails, also fails if applying url is not null.
> >
> > Note some of the arrays contains maps, using twitter data as an example
> > below. Some records have an empty array with “hashtags”:[]  and others
> will
> > look similar to what is listed below.
> >
> > "entities": {
> >     "trends": [],
> >     "symbols": [],
> >     "urls": [],
> >     "hashtags": [
> >       {
> >         "text": "GoPatriots",
> >         "indices": [
> >           83,
> >           94
> >         ]
> >       }
> >     ],
> >     "user_mentions": []
> >   },
> >
> >
> > Thanks
> > —Andries
>

Re: Filter out empty arrays in JSON

Posted by Aditya <ad...@gmail.com>.
repeated_count('entities.urls') > 0

On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
aengelbrecht@maprtech.com> wrote:

> How do you filter out records with an empty array in drill?
> i.e some records have "url":[]  and some will have an array with data in
> it. When trying to read records with data in the array drill fails due to
> records missing any data in the array. Trying a filter with/* where
> "url":[0] is not null */ fails, also fails if applying url is not null.
>
> Note some of the arrays contains maps, using twitter data as an example
> below. Some records have an empty array with “hashtags”:[]  and others will
> look similar to what is listed below.
>
> "entities": {
>     "trends": [],
>     "symbols": [],
>     "urls": [],
>     "hashtags": [
>       {
>         "text": "GoPatriots",
>         "indices": [
>           83,
>           94
>         ]
>       }
>     ],
>     "user_mentions": []
>   },
>
>
> Thanks
> —Andries