You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by John Omernik <jo...@omernik.com> on 2015/09/21 20:15:49 UTC

Troubleshooting JSON File read

I am reading a MongoDB dump file in Drill.   On the surface it seems to be
working well, however, I have some need to trouble shoot, and I was curious
the best way to approach. Here are some "things"


1. It's a large file 1.2 GB compressed. It's named mondodump.json.gz and
drill seems to be (on the surface) handling that correctly
2. It's Drill 1.1. (MapR Package)
3.  select * from `/pathoto/*` limit 10 seems to work, in this case the _id
field is ip addresses (long story)
4. In the select * limit 10, if I do select * from `/pathto/*` where `_id`
= '123.123.123.123' (which was returned in the select * limit 10 query from
#3, it finds the record, all is well.
5. If I take select * from `/pathto/*` where `_id` = '127.0.0.1' which I
know to be in the data (validated with zgrep) it does NOT find the data.
Based on the results from zGrep, it should find it, I am not sure if there
something weird in reading the data, but its not throwing errors.
6. select count(*) from `/pathro/*` returns the same number as zcat
source.json.gz|wc -l This is interesting because it apparently means things
are lined up, but why isn't that IP showing?

So my question is this: Is there anything in Drill that would cause it to
miss that? Weird chars? etc I know it's hard, but with a 1.2 GB compressed
file, how would one trouble shoot this?

Re: Troubleshooting JSON File read

Posted by John Omernik <jo...@omernik.com>.

I didn't see a leading and trailing [] in the data, I think the issue was
one big file. Although when I split it, perhaps due to the 263 dynamic key
names, drill became extremely slow in processing the 25 gzipped json files.
These json files ranged from 14M to 85M compressed, not sure why it didn't
handle that situation better, unless it was just very unhappy with all the
keys.

John

On Mon, Sep 21, 2015 at 7:12 PM, Ted Dunning <te...@gmail.com> wrote:

> Consider just deleting the leading [ and trailing ]. If your objects are on
> a single line, you are good to go at that point.
>
>
>
> On Mon, Sep 21, 2015 at 12:45 PM, John Omernik <jo...@omernik.com> wrote:
>
> > I think I found my issue: (see below). I'd recommend that the Drill
> > includes a warning when querying such data, in that open failure, like I
> > had. Now trying to figure out another issue (query takes forever on 25
> > smaller (50-100 mb gzipped files)... I'll keep posting here...
> >
> >
> >
> >
> >
> > Lengthy JSON objects
> >
> > Currently, Drill cannot manage lengthy JSON objects, such as a gigabit
> JSON
> > file. Finding the beginning and end of records can be time consuming and
> > require scanning the whole file.
> >
> > Workaround: Use a tool to split the JSON file into smaller chunks of
> > 64-128MB or 64-256MB initially until you know the total data size and
> node
> > configuration. Keep the JSON objects intact in each file. A distributed
> > file system, such as HDFS, is recommended over trying to manage file
> > partitions.
> >
> > On Mon, Sep 21, 2015 at 1:15 PM, John Omernik <jo...@omernik.com> wrote:
> >
> > > I am reading a MongoDB dump file in Drill.   On the surface it seems to
> > be
> > > working well, however, I have some need to trouble shoot, and I was
> > curious
> > > the best way to approach. Here are some "things"
> > >
> > >
> > > 1. It's a large file 1.2 GB compressed. It's named mondodump.json.gz
> and
> > > drill seems to be (on the surface) handling that correctly
> > > 2. It's Drill 1.1. (MapR Package)
> > > 3.  select * from `/pathoto/*` limit 10 seems to work, in this case the
> > > _id field is ip addresses (long story)
> > > 4. In the select * limit 10, if I do select * from `/pathto/*` where
> > `_id`
> > > = '123.123.123.123' (which was returned in the select * limit 10 query
> > from
> > > #3, it finds the record, all is well.
> > > 5. If I take select * from `/pathto/*` where `_id` = '127.0.0.1' which
> I
> > > know to be in the data (validated with zgrep) it does NOT find the
> data.
> > > Based on the results from zGrep, it should find it, I am not sure if
> > there
> > > something weird in reading the data, but its not throwing errors.
> > > 6. select count(*) from `/pathro/*` returns the same number as zcat
> > > source.json.gz|wc -l This is interesting because it apparently means
> > things
> > > are lined up, but why isn't that IP showing?
> > >
> > > So my question is this: Is there anything in Drill that would cause it
> to
> > > miss that? Weird chars? etc I know it's hard, but with a 1.2 GB
> > compressed
> > > file, how would one trouble shoot this?
> > >
> >
>

Re: Troubleshooting JSON File read

Posted by Ted Dunning <te...@gmail.com>.

Consider just deleting the leading [ and trailing ]. If your objects are on
a single line, you are good to go at that point.



On Mon, Sep 21, 2015 at 12:45 PM, John Omernik <jo...@omernik.com> wrote:

> I think I found my issue: (see below). I'd recommend that the Drill
> includes a warning when querying such data, in that open failure, like I
> had. Now trying to figure out another issue (query takes forever on 25
> smaller (50-100 mb gzipped files)... I'll keep posting here...
>
>
>
>
>
> Lengthy JSON objects
>
> Currently, Drill cannot manage lengthy JSON objects, such as a gigabit JSON
> file. Finding the beginning and end of records can be time consuming and
> require scanning the whole file.
>
> Workaround: Use a tool to split the JSON file into smaller chunks of
> 64-128MB or 64-256MB initially until you know the total data size and node
> configuration. Keep the JSON objects intact in each file. A distributed
> file system, such as HDFS, is recommended over trying to manage file
> partitions.
>
> On Mon, Sep 21, 2015 at 1:15 PM, John Omernik <jo...@omernik.com> wrote:
>
> > I am reading a MongoDB dump file in Drill.   On the surface it seems to
> be
> > working well, however, I have some need to trouble shoot, and I was
> curious
> > the best way to approach. Here are some "things"
> >
> >
> > 1. It's a large file 1.2 GB compressed. It's named mondodump.json.gz and
> > drill seems to be (on the surface) handling that correctly
> > 2. It's Drill 1.1. (MapR Package)
> > 3.  select * from `/pathoto/*` limit 10 seems to work, in this case the
> > _id field is ip addresses (long story)
> > 4. In the select * limit 10, if I do select * from `/pathto/*` where
> `_id`
> > = '123.123.123.123' (which was returned in the select * limit 10 query
> from
> > #3, it finds the record, all is well.
> > 5. If I take select * from `/pathto/*` where `_id` = '127.0.0.1' which I
> > know to be in the data (validated with zgrep) it does NOT find the data.
> > Based on the results from zGrep, it should find it, I am not sure if
> there
> > something weird in reading the data, but its not throwing errors.
> > 6. select count(*) from `/pathro/*` returns the same number as zcat
> > source.json.gz|wc -l This is interesting because it apparently means
> things
> > are lined up, but why isn't that IP showing?
> >
> > So my question is this: Is there anything in Drill that would cause it to
> > miss that? Weird chars? etc I know it's hard, but with a 1.2 GB
> compressed
> > file, how would one trouble shoot this?
> >
>

Re: Troubleshooting JSON File read

Posted by John Omernik <jo...@omernik.com>.

I think I found my issue: (see below). I'd recommend that the Drill
includes a warning when querying such data, in that open failure, like I
had. Now trying to figure out another issue (query takes forever on 25
smaller (50-100 mb gzipped files)... I'll keep posting here...

Lengthy JSON objects

Currently, Drill cannot manage lengthy JSON objects, such as a gigabit JSON
file. Finding the beginning and end of records can be time consuming and
require scanning the whole file.

Workaround: Use a tool to split the JSON file into smaller chunks of
64-128MB or 64-256MB initially until you know the total data size and node
configuration. Keep the JSON objects intact in each file. A distributed
file system, such as HDFS, is recommended over trying to manage file
partitions.

On Mon, Sep 21, 2015 at 1:15 PM, John Omernik <jo...@omernik.com> wrote:

> I am reading a MongoDB dump file in Drill.   On the surface it seems to be
> working well, however, I have some need to trouble shoot, and I was curious
> the best way to approach. Here are some "things"
>
>
> 1. It's a large file 1.2 GB compressed. It's named mondodump.json.gz and
> drill seems to be (on the surface) handling that correctly
> 2. It's Drill 1.1. (MapR Package)
> 3.  select * from `/pathoto/*` limit 10 seems to work, in this case the
> _id field is ip addresses (long story)
> 4. In the select * limit 10, if I do select * from `/pathto/*` where `_id`
> = '123.123.123.123' (which was returned in the select * limit 10 query from
> #3, it finds the record, all is well.
> 5. If I take select * from `/pathto/*` where `_id` = '127.0.0.1' which I
> know to be in the data (validated with zgrep) it does NOT find the data.
> Based on the results from zGrep, it should find it, I am not sure if there
> something weird in reading the data, but its not throwing errors.
> 6. select count(*) from `/pathro/*` returns the same number as zcat
> source.json.gz|wc -l This is interesting because it apparently means things
> are lined up, but why isn't that IP showing?
>
> So my question is this: Is there anything in Drill that would cause it to
> miss that? Weird chars? etc I know it's hard, but with a 1.2 GB compressed
> file, how would one trouble shoot this?
>