You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Alexander Reshetov <al...@gmail.com> on 2015/04/01 12:29:11 UTC
Report issues with sensitive data
Hello all,
I have 80GB dataset of JSONs which have many nested arrays.
I'm trying to flatten it and make some calculations, but I got
exceptions after reading about 2/3 of file.
I could (and want) to post an issue in Jira, but I cannot attach my dataset
because it has sensitive data and also it's too large.
It there any way to help to investigate issues without posting my dataset?
To give a hit about issue I've attached file with exception text.
Re: Report issues with sensitive data
Posted by Andries Engelbrecht <ae...@maprtech.com>.
Are you using 0.8 that was just released? I have found it to be much better at handling large JSON data sets.
Also it is handy to use a predicate to filter out JSON docs where you may want use a map or array that is not present on all of the docs. Typically a null value will get assigned to missing objects or arrays.
A simple WHERE a.b.c IS NOT NULL will filter out docs that don’t have the specific nested map
Or WHERE a.b.c[0] IS NOT NULL for arrays
Or WHERE a.b.c[0].d IS NOT NULL
This avoids the functions having to deal with NULL values when doing calculations, as the empty sets gets filtered out.
While Drill is extremely powerful, it is always a good idea to apply some logic to avoid NULL values creeping in with complex data like JSON. Sometimes a simple cast for Data type can also go a long way to prevent Drill from estimating the Data type on data that may be inconsistent.
—Andries
On Apr 1, 2015, at 3:29 AM, Alexander Reshetov <al...@gmail.com> wrote:
> Hello all,
>
> I have 80GB dataset of JSONs which have many nested arrays.
> I'm trying to flatten it and make some calculations, but I got
> exceptions after reading about 2/3 of file.
>
> I could (and want) to post an issue in Jira, but I cannot attach my dataset
> because it has sensitive data and also it's too large.
>
> It there any way to help to investigate issues without posting my dataset?
>
> To give a hit about issue I've attached file with exception text.
Re: Report issues with sensitive data
Posted by Alexander Reshetov <al...@gmail.com>.
Hi,
Andries, Ted, thanks for quick replies.
Yes, I'm using latest official build of 0.8.
I made some investigations of possible issues and also found way to
hide sensitive data.
Please see issue regarding this [1].
In that process I found one strange behavior which I assume lead to this issue.
(if dataset files are missed then they are still uploading)
[1] https://issues.apache.org/jira/browse/DRILL-2677
On Wed, Apr 1, 2015 at 7:46 PM, Ted Dunning <te...@gmail.com> wrote:
> One idea is to post a log-synth [1] schema that generates data the same
> shape as your real data. If you can generate fake data that causes the
> same problem you give developers a huge head start in solving your problem.
>
> For the record, are you using the recently announced 0.8 version of Drill?
>
>
> [1] https://github.com/tdunning/log-synth
>
>
> On Wed, Apr 1, 2015 at 3:29 AM, Alexander Reshetov <
> alexander.v.reshetov@gmail.com> wrote:
>
>> Hello all,
>>
>> I have 80GB dataset of JSONs which have many nested arrays.
>> I'm trying to flatten it and make some calculations, but I got
>> exceptions after reading about 2/3 of file.
>>
>> I could (and want) to post an issue in Jira, but I cannot attach my dataset
>> because it has sensitive data and also it's too large.
>>
>> It there any way to help to investigate issues without posting my dataset?
>>
>> To give a hit about issue I've attached file with exception text.
>>
Re: Report issues with sensitive data
Posted by Ted Dunning <te...@gmail.com>.
One idea is to post a log-synth [1] schema that generates data the same
shape as your real data. If you can generate fake data that causes the
same problem you give developers a huge head start in solving your problem.
For the record, are you using the recently announced 0.8 version of Drill?
[1] https://github.com/tdunning/log-synth
On Wed, Apr 1, 2015 at 3:29 AM, Alexander Reshetov <
alexander.v.reshetov@gmail.com> wrote:
> Hello all,
>
> I have 80GB dataset of JSONs which have many nested arrays.
> I'm trying to flatten it and make some calculations, but I got
> exceptions after reading about 2/3 of file.
>
> I could (and want) to post an issue in Jira, but I cannot attach my dataset
> because it has sensitive data and also it's too large.
>
> It there any way to help to investigate issues without posting my dataset?
>
> To give a hit about issue I've attached file with exception text.
>