You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Alexander Reshetov <al...@gmail.com> on 2015/04/01 12:29:11 UTC

Report issues with sensitive data

Hello all,

I have 80GB dataset of JSONs which have many nested arrays.
I'm trying to flatten it and make some calculations, but I got
exceptions after reading about 2/3 of file.

I could (and want) to post an issue in Jira, but I cannot attach my dataset
because it has sensitive data and also it's too large.

It there any way to help to investigate issues without posting my dataset?

To give a hit about issue I've attached file with exception text.

Re: Report issues with sensitive data

Posted by Andries Engelbrecht <ae...@maprtech.com>.
Are you using 0.8 that was just released? I have found it to be much better at handling large JSON data sets.

Also it is handy to use a predicate to filter out JSON docs where you may want use a map or array that is not present on all of the docs. Typically a null value will get assigned to missing objects or arrays.

A simple  WHERE a.b.c IS NOT NULL  will filter out docs that don’t have the specific nested map
Or WHERE a.b.c[0] IS NOT NULL for arrays
Or WHERE a.b.c[0].d IS NOT NULL

This avoids the functions having to deal with NULL values when doing calculations, as the empty sets gets filtered out.

While Drill is extremely powerful, it is always a good idea to apply some logic to avoid NULL values creeping in with complex data like JSON. Sometimes a simple cast for Data type can also go a long way to prevent Drill from estimating the Data type on data that may be inconsistent.

—Andries



On Apr 1, 2015, at 3:29 AM, Alexander Reshetov <al...@gmail.com> wrote:

> Hello all,
> 
> I have 80GB dataset of JSONs which have many nested arrays.
> I'm trying to flatten it and make some calculations, but I got
> exceptions after reading about 2/3 of file.
> 
> I could (and want) to post an issue in Jira, but I cannot attach my dataset
> because it has sensitive data and also it's too large.
> 
> It there any way to help to investigate issues without posting my dataset?
> 
> To give a hit about issue I've attached file with exception text.


Re: Report issues with sensitive data

Posted by Alexander Reshetov <al...@gmail.com>.
Hi,

Andries, Ted, thanks for quick replies.
Yes, I'm using latest official build of 0.8.

I made some investigations of possible issues and also found way to
hide sensitive data.
Please see issue regarding this [1].

In that process I found one strange behavior which I assume lead to this issue.
(if dataset files are missed then they are still uploading)

[1] https://issues.apache.org/jira/browse/DRILL-2677



On Wed, Apr 1, 2015 at 7:46 PM, Ted Dunning <te...@gmail.com> wrote:
> One idea is to post a log-synth [1] schema that generates data the same
> shape as your real data.  If you can generate fake data that causes the
> same problem you give developers a huge head start in solving your problem.
>
> For the record, are you using the recently announced 0.8 version of Drill?
>
>
> [1] https://github.com/tdunning/log-synth
>
>
> On Wed, Apr 1, 2015 at 3:29 AM, Alexander Reshetov <
> alexander.v.reshetov@gmail.com> wrote:
>
>> Hello all,
>>
>> I have 80GB dataset of JSONs which have many nested arrays.
>> I'm trying to flatten it and make some calculations, but I got
>> exceptions after reading about 2/3 of file.
>>
>> I could (and want) to post an issue in Jira, but I cannot attach my dataset
>> because it has sensitive data and also it's too large.
>>
>> It there any way to help to investigate issues without posting my dataset?
>>
>> To give a hit about issue I've attached file with exception text.
>>

Re: Report issues with sensitive data

Posted by Ted Dunning <te...@gmail.com>.
One idea is to post a log-synth [1] schema that generates data the same
shape as your real data.  If you can generate fake data that causes the
same problem you give developers a huge head start in solving your problem.

For the record, are you using the recently announced 0.8 version of Drill?


[1] https://github.com/tdunning/log-synth


On Wed, Apr 1, 2015 at 3:29 AM, Alexander Reshetov <
alexander.v.reshetov@gmail.com> wrote:

> Hello all,
>
> I have 80GB dataset of JSONs which have many nested arrays.
> I'm trying to flatten it and make some calculations, but I got
> exceptions after reading about 2/3 of file.
>
> I could (and want) to post an issue in Jira, but I cannot attach my dataset
> because it has sensitive data and also it's too large.
>
> It there any way to help to investigate issues without posting my dataset?
>
> To give a hit about issue I've attached file with exception text.
>