You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by "Lee, David" <Da...@blackrock.com> on 2017/06/28 16:51:05 UTC

Drill Query Engine for nested json or parquet


Is there a fundamental difference between the following queries? I can't get the second example working with parquet files which contain 400,000+ nested records..

It seems like the system wants to flatten every possible record before applying the SQL Where clause to the flattened data structure..

Example 1:

select b.* from dfs.`test1.json` b where b.item  = 3

[
    {
      "item": 1,
      "item_name": "name_for_1"
    },
    {
      "item": 2,
      "item_name": "name_for_2"
    },
    {
      "item": 3,
      "item_name": "name_for_3"
    },
    {
      "item": 4,
      "item_name": "name_for_4"
    }
]

Example 2:

select b.* from 
(select flatten(a.details) as details
from dfs.`test2.json` a) b
where b.details.item  = 3

{
  "header": "my_header_info",
  "details": [
    {
      "item": 1,
      "item_name": "name_for_1"
    },
    {
      "item": 2,
      "item_name": "name_for_2"
    },
    {
      "item": 3,
      "item_name": "name_for_3"
    },
    {
      "item": 4,
      "item_name": "name_for_4"
    }
  ]
}




This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers for further information.  Please refer to http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for more information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.

© 2017 BlackRock, Inc. All rights reserved.

RE: Drill Query Engine for nested json or parquet

Posted by "Lee, David" <Da...@blackrock.com>.

Unfortunately the files A) contain licensed data B) are 403,208,254 bytes of gzipped JSON per file which contain roughly 6,000 JSON records with a mix of ~4,000 nested arrays and objects (up to 7 levels deep).

Running Drill SQL on a sample of individual JSON keys works so it's probably some particular JSON key value out of the ~4,000 which is causing the IOBE. The files which are failing are from Africa, Asia and Australia so I have a feeling it's some regional utf8 / Unicode text values which is the problem.

-----Original Message-----
From: Jinfeng Ni [mailto:jni@apache.org] 
Sent: Monday, July 03, 2017 3:39 PM
To: user <us...@drill.apache.org>
Subject: Re: Drill Query Engine for nested json or parquet

This IOBE is clearly indicates a code bug. From the error message, it seems to be similar to DRILL-1743, which was marked as fixed. Could you please open a JIRA, and post the query / sample data you used (if it does not contain sensitive info)? Having a reproduce for this would make it easier for people to take a look. Thanks.


1. https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_DRILL-2D1743&d=DwIFaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=IZ97vmUqJvkHR_Xi9GS-CmRIrqr1MSe6379wL6zbZbM&s=A-_qhmjmgndvzahLAAdfUZDxUHNEft27SgzsC5OPC_E&e= 

On Mon, Jul 3, 2017 at 10:07 AM, Lee, David <Da...@blackrock.com> wrote:

> Ok thanks, but now I got a new odd problem which I haven't been able 
> to debug..
>
> SYSTEM ERROR: IndexOutOfBoundsException: index: 16384, length: 4
> (expected: range(0, 16384))
>
> After unwrapping the JSON records from a nested dictionary, 90% of my 
> JSON files convert successfully to parquet, but 10% are getting the error above..
>
> I've tried to narrow down which rows in my JSON file are problematic, 
> but it is both consistent and random.. Removing 1000s records here and 
> there fixes the file, but it must be a combination of multiple records 
> causing the error. This error never showed up when I converted the 
> wrapped JSON to parquet..
>
> Any idea how to get to the root of this one?
>
> -----Original Message-----
> From: Jinfeng Ni [mailto:jni@apache.org]
> Sent: Wednesday, June 28, 2017 11:42 AM
> To: user <us...@drill.apache.org>
> Subject: Re: Drill Query Engine for nested json or parquet
>
> You are right that Q2 has to first apply flatten the field "details"
> before applying filter on top of the flattened rows.  I do not think 
> Drill currently supports to push filter down.
>
> The difference between Q1 and Q2:  Q1 is querying a 4-row table, while 
> Q2 is querying a 1-row table; the flatten operator produces 4 rows on 
> top of which the filter is applied.
>
>
> On Wed, Jun 28, 2017 at 9:51 AM, Lee, David <Da...@blackrock.com>
> wrote:
>
> >
> > Is there a fundamental difference between the following queries? I 
> > can't get the second example working with parquet files which 
> > contain 400,000+ nested records..
> >
> > It seems like the system wants to flatten every possible record 
> > before applying the SQL Where clause to the flattened data structure..
> >
> > Example 1:
> >
> > select b.* from dfs.`test1.json` b where b.item  = 3
> >
> > [
> >     {
> >       "item": 1,
> >       "item_name": "name_for_1"
> >     },
> >     {
> >       "item": 2,
> >       "item_name": "name_for_2"
> >     },
> >     {
> >       "item": 3,
> >       "item_name": "name_for_3"
> >     },
> >     {
> >       "item": 4,
> >       "item_name": "name_for_4"
> >     }
> > ]
> >
> > Example 2:
> >
> > select b.* from
> > (select flatten(a.details) as details from dfs.`test2.json` a) b 
> > where b.details.item  = 3
> >
> > {
> >   "header": "my_header_info",
> >   "details": [
> >     {
> >       "item": 1,
> >       "item_name": "name_for_1"
> >     },
> >     {
> >       "item": 2,
> >       "item_name": "name_for_2"
> >     },
> >     {
> >       "item": 3,
> >       "item_name": "name_for_3"
> >     },
> >     {
> >       "item": 4,
> >       "item_name": "name_for_4"
> >     }
> >   ]
> > }
> >
> >
> >
> >
> > This message may contain information that is confidential or privileged.
> > If you are not the intended recipient, please advise the sender 
> > immediately and delete this message. See http://www.blackrock.com/ 
> > corporate/en-us/compliance/email-disclaimers for further information.
> > Please refer to http://www.blackrock.com/corporate/en-us/compliance/
> > privacy-policy for more information about BlackRock’s Privacy Policy.
> >
> > For a list of BlackRock's office addresses worldwide, see 
> > http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.
> >
> > © 2017 BlackRock, Inc. All rights reserved.
> >
>

RE: Drill Query Engine for nested json or parquet

Posted by "Lee, David" <Da...@blackrock.com>.

Ok I ran into the same problem with another set of files..

Opened a ticket and attached test files

https://issues.apache.org/jira/browse/DRILL-5769


-----Original Message-----
From: Lee, David 
Sent: Wednesday, July 05, 2017 5:49 PM
To: user <us...@drill.apache.org>
Subject: RE: Drill Query Engine for nested json or parquet

Unfortunately the files A) contain licensed data B) are 403,208,254 bytes of gzipped JSON per file which contain roughly 6,000 JSON records with a mix of ~4,000 nested arrays and objects (up to 7 levels deep).

Running Drill SQL on a sample of individual JSON keys works so it's probably some particular JSON key value out of the ~4,000 which is causing the IOBE. The files which are failing are from Africa, Asia and Australia so I have a feeling it's some regional utf8 / Unicode text values which is the problem.

-----Original Message-----
From: Jinfeng Ni [mailto:jni@apache.org]
Sent: Monday, July 03, 2017 3:39 PM
To: user <us...@drill.apache.org>
Subject: Re: Drill Query Engine for nested json or parquet

This IOBE is clearly indicates a code bug. From the error message, it seems to be similar to DRILL-1743, which was marked as fixed. Could you please open a JIRA, and post the query / sample data you used (if it does not contain sensitive info)? Having a reproduce for this would make it easier for people to take a look. Thanks.


1. https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_DRILL-2D1743&d=DwIFaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=IZ97vmUqJvkHR_Xi9GS-CmRIrqr1MSe6379wL6zbZbM&s=A-_qhmjmgndvzahLAAdfUZDxUHNEft27SgzsC5OPC_E&e= 

On Mon, Jul 3, 2017 at 10:07 AM, Lee, David <Da...@blackrock.com> wrote:

> Ok thanks, but now I got a new odd problem which I haven't been able 
> to debug..
>
> SYSTEM ERROR: IndexOutOfBoundsException: index: 16384, length: 4
> (expected: range(0, 16384))
>
> After unwrapping the JSON records from a nested dictionary, 90% of my 
> JSON files convert successfully to parquet, but 10% are getting the error above..
>
> I've tried to narrow down which rows in my JSON file are problematic, 
> but it is both consistent and random.. Removing 1000s records here and 
> there fixes the file, but it must be a combination of multiple records 
> causing the error. This error never showed up when I converted the 
> wrapped JSON to parquet..
>
> Any idea how to get to the root of this one?
>
> -----Original Message-----
> From: Jinfeng Ni [mailto:jni@apache.org]
> Sent: Wednesday, June 28, 2017 11:42 AM
> To: user <us...@drill.apache.org>
> Subject: Re: Drill Query Engine for nested json or parquet
>
> You are right that Q2 has to first apply flatten the field "details"
> before applying filter on top of the flattened rows.  I do not think 
> Drill currently supports to push filter down.
>
> The difference between Q1 and Q2:  Q1 is querying a 4-row table, while
> Q2 is querying a 1-row table; the flatten operator produces 4 rows on 
> top of which the filter is applied.
>
>
> On Wed, Jun 28, 2017 at 9:51 AM, Lee, David <Da...@blackrock.com>
> wrote:
>
> >
> > Is there a fundamental difference between the following queries? I 
> > can't get the second example working with parquet files which 
> > contain 400,000+ nested records..
> >
> > It seems like the system wants to flatten every possible record 
> > before applying the SQL Where clause to the flattened data structure..
> >
> > Example 1:
> >
> > select b.* from dfs.`test1.json` b where b.item  = 3
> >
> > [
> >     {
> >       "item": 1,
> >       "item_name": "name_for_1"
> >     },
> >     {
> >       "item": 2,
> >       "item_name": "name_for_2"
> >     },
> >     {
> >       "item": 3,
> >       "item_name": "name_for_3"
> >     },
> >     {
> >       "item": 4,
> >       "item_name": "name_for_4"
> >     }
> > ]
> >
> > Example 2:
> >
> > select b.* from
> > (select flatten(a.details) as details from dfs.`test2.json` a) b 
> > where b.details.item  = 3
> >
> > {
> >   "header": "my_header_info",
> >   "details": [
> >     {
> >       "item": 1,
> >       "item_name": "name_for_1"
> >     },
> >     {
> >       "item": 2,
> >       "item_name": "name_for_2"
> >     },
> >     {
> >       "item": 3,
> >       "item_name": "name_for_3"
> >     },
> >     {
> >       "item": 4,
> >       "item_name": "name_for_4"
> >     }
> >   ]
> > }
> >
> >
> >
> >
> > This message may contain information that is confidential or privileged.
> > If you are not the intended recipient, please advise the sender 
> > immediately and delete this message. See http://www.blackrock.com/ 
> > corporate/en-us/compliance/email-disclaimers for further information.
> > Please refer to http://www.blackrock.com/corporate/en-us/compliance/
> > privacy-policy for more information about BlackRock’s Privacy Policy.
> >
> > For a list of BlackRock's office addresses worldwide, see 
> > http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.
> >
> > © 2017 BlackRock, Inc. All rights reserved.
> >
>

Re: Drill Query Engine for nested json or parquet

Posted by Jinfeng Ni <jn...@apache.org>.

This IOBE is clearly indicates a code bug. From the error message, it seems
to be similar to DRILL-1743, which was marked as fixed. Could you please
open a JIRA, and post the query / sample data you used (if it does not
contain sensitive info)? Having a reproduce for this would make it easier
for people to take a look. Thanks.


1. https://issues.apache.org/jira/browse/DRILL-1743

On Mon, Jul 3, 2017 at 10:07 AM, Lee, David <Da...@blackrock.com> wrote:

> Ok thanks, but now I got a new odd problem which I haven't been able to
> debug..
>
> SYSTEM ERROR: IndexOutOfBoundsException: index: 16384, length: 4
> (expected: range(0, 16384))
>
> After unwrapping the JSON records from a nested dictionary, 90% of my JSON
> files convert successfully to parquet, but 10% are getting the error above..
>
> I've tried to narrow down which rows in my JSON file are problematic, but
> it is both consistent and random.. Removing 1000s records here and there
> fixes the file, but it must be a combination of multiple records causing
> the error. This error never showed up when I converted the wrapped JSON to
> parquet..
>
> Any idea how to get to the root of this one?
>
> -----Original Message-----
> From: Jinfeng Ni [mailto:jni@apache.org]
> Sent: Wednesday, June 28, 2017 11:42 AM
> To: user <us...@drill.apache.org>
> Subject: Re: Drill Query Engine for nested json or parquet
>
> You are right that Q2 has to first apply flatten the field "details"
> before applying filter on top of the flattened rows.  I do not think Drill
> currently supports to push filter down.
>
> The difference between Q1 and Q2:  Q1 is querying a 4-row table, while Q2
> is querying a 1-row table; the flatten operator produces 4 rows on top of
> which the filter is applied.
>
>
> On Wed, Jun 28, 2017 at 9:51 AM, Lee, David <Da...@blackrock.com>
> wrote:
>
> >
> > Is there a fundamental difference between the following queries? I
> > can't get the second example working with parquet files which contain
> > 400,000+ nested records..
> >
> > It seems like the system wants to flatten every possible record before
> > applying the SQL Where clause to the flattened data structure..
> >
> > Example 1:
> >
> > select b.* from dfs.`test1.json` b where b.item  = 3
> >
> > [
> >     {
> >       "item": 1,
> >       "item_name": "name_for_1"
> >     },
> >     {
> >       "item": 2,
> >       "item_name": "name_for_2"
> >     },
> >     {
> >       "item": 3,
> >       "item_name": "name_for_3"
> >     },
> >     {
> >       "item": 4,
> >       "item_name": "name_for_4"
> >     }
> > ]
> >
> > Example 2:
> >
> > select b.* from
> > (select flatten(a.details) as details
> > from dfs.`test2.json` a) b
> > where b.details.item  = 3
> >
> > {
> >   "header": "my_header_info",
> >   "details": [
> >     {
> >       "item": 1,
> >       "item_name": "name_for_1"
> >     },
> >     {
> >       "item": 2,
> >       "item_name": "name_for_2"
> >     },
> >     {
> >       "item": 3,
> >       "item_name": "name_for_3"
> >     },
> >     {
> >       "item": 4,
> >       "item_name": "name_for_4"
> >     }
> >   ]
> > }
> >
> >
> >
> >
> > This message may contain information that is confidential or privileged.
> > If you are not the intended recipient, please advise the sender
> > immediately and delete this message. See http://www.blackrock.com/
> > corporate/en-us/compliance/email-disclaimers for further information.
> > Please refer to http://www.blackrock.com/corporate/en-us/compliance/
> > privacy-policy for more information about BlackRock’s Privacy Policy.
> >
> > For a list of BlackRock's office addresses worldwide, see
> > http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.
> >
> > © 2017 BlackRock, Inc. All rights reserved.
> >
>

RE: Drill Query Engine for nested json or parquet

Posted by "Lee, David" <Da...@blackrock.com>.

Ok thanks, but now I got a new odd problem which I haven't been able to debug..

SYSTEM ERROR: IndexOutOfBoundsException: index: 16384, length: 4 (expected: range(0, 16384))

After unwrapping the JSON records from a nested dictionary, 90% of my JSON files convert successfully to parquet, but 10% are getting the error above..

I've tried to narrow down which rows in my JSON file are problematic, but it is both consistent and random.. Removing 1000s records here and there fixes the file, but it must be a combination of multiple records causing the error. This error never showed up when I converted the wrapped JSON to parquet..

Any idea how to get to the root of this one?

-----Original Message-----
From: Jinfeng Ni [mailto:jni@apache.org] 
Sent: Wednesday, June 28, 2017 11:42 AM
To: user <us...@drill.apache.org>
Subject: Re: Drill Query Engine for nested json or parquet

You are right that Q2 has to first apply flatten the field "details" before applying filter on top of the flattened rows.  I do not think Drill currently supports to push filter down.

The difference between Q1 and Q2:  Q1 is querying a 4-row table, while Q2 is querying a 1-row table; the flatten operator produces 4 rows on top of which the filter is applied.

On Wed, Jun 28, 2017 at 9:51 AM, Lee, David <Da...@blackrock.com> wrote:

>
> Is there a fundamental difference between the following queries? I 
> can't get the second example working with parquet files which contain 
> 400,000+ nested records..
>
> It seems like the system wants to flatten every possible record before 
> applying the SQL Where clause to the flattened data structure..
>
> Example 1:
>
> select b.* from dfs.`test1.json` b where b.item  = 3
>
> [
>     {
>       "item": 1,
>       "item_name": "name_for_1"
>     },
>     {
>       "item": 2,
>       "item_name": "name_for_2"
>     },
>     {
>       "item": 3,
>       "item_name": "name_for_3"
>     },
>     {
>       "item": 4,
>       "item_name": "name_for_4"
>     }
> ]
>
> Example 2:
>
> select b.* from
> (select flatten(a.details) as details
> from dfs.`test2.json` a) b
> where b.details.item  = 3
>
> {
>   "header": "my_header_info",
>   "details": [
>     {
>       "item": 1,
>       "item_name": "name_for_1"
>     },
>     {
>       "item": 2,
>       "item_name": "name_for_2"
>     },
>     {
>       "item": 3,
>       "item_name": "name_for_3"
>     },
>     {
>       "item": 4,
>       "item_name": "name_for_4"
>     }
>   ]
> }
>
>
>
>
> This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender 
> immediately and delete this message. See http://www.blackrock.com/ 
> corporate/en-us/compliance/email-disclaimers for further information.
> Please refer to http://www.blackrock.com/corporate/en-us/compliance/
> privacy-policy for more information about BlackRock’s Privacy Policy.
>
> For a list of BlackRock's office addresses worldwide, see 
> http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.
>
> © 2017 BlackRock, Inc. All rights reserved.
>

Re: Drill Query Engine for nested json or parquet

Posted by Jinfeng Ni <jn...@apache.org>.

You are right that Q2 has to first apply flatten the field "details" before
applying filter on top of the flattened rows.  I do not think Drill
currently supports to push filter down.

The difference between Q1 and Q2:  Q1 is querying a 4-row table, while Q2
is querying a 1-row table; the flatten operator produces 4 rows on top of
which the filter is applied.


On Wed, Jun 28, 2017 at 9:51 AM, Lee, David <Da...@blackrock.com> wrote:

>
> Is there a fundamental difference between the following queries? I can't
> get the second example working with parquet files which contain 400,000+
> nested records..
>
> It seems like the system wants to flatten every possible record before
> applying the SQL Where clause to the flattened data structure..
>
> Example 1:
>
> select b.* from dfs.`test1.json` b where b.item  = 3
>
> [
>     {
>       "item": 1,
>       "item_name": "name_for_1"
>     },
>     {
>       "item": 2,
>       "item_name": "name_for_2"
>     },
>     {
>       "item": 3,
>       "item_name": "name_for_3"
>     },
>     {
>       "item": 4,
>       "item_name": "name_for_4"
>     }
> ]
>
> Example 2:
>
> select b.* from
> (select flatten(a.details) as details
> from dfs.`test2.json` a) b
> where b.details.item  = 3
>
> {
>   "header": "my_header_info",
>   "details": [
>     {
>       "item": 1,
>       "item_name": "name_for_1"
>     },
>     {
>       "item": 2,
>       "item_name": "name_for_2"
>     },
>     {
>       "item": 3,
>       "item_name": "name_for_3"
>     },
>     {
>       "item": 4,
>       "item_name": "name_for_4"
>     }
>   ]
> }
>
>
>
>
> This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender immediately
> and delete this message. See http://www.blackrock.com/
> corporate/en-us/compliance/email-disclaimers for further information.
> Please refer to http://www.blackrock.com/corporate/en-us/compliance/
> privacy-policy for more information about BlackRock’s Privacy Policy.
>
> For a list of BlackRock's office addresses worldwide, see
> http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.
>
> © 2017 BlackRock, Inc. All rights reserved.
>