You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by John Omernik <jo...@omernik.com> on 2015/11/03 20:07:45 UTC

Line Parsing Errors and Skipping

I am doing some "active" loading of data into json files on MapRFS.
Basically I have feeds pulling from a message  queue and outputting the
JSON messages.

I have a query that is doing aggregations on all the data that seem to work
90% of the time.

The other 10%, I get this error:

Error: DATA_READ ERROR: Error parsing JSON - Unexpected end-of-input in
VALUE_STRING

File: /path/to/file
Record: someint
Column: someint
Fragment someint:someint

(I replaced the actual record, column, and fragment info obviously)


When I get this error, I can run the same query again, and all is well.

My questions are this:

1. My "gut" is telling me this is because I have files being written in
real time with MapR FS using POSIX tools over NFS and when this is
occuring, it's because the python fh.write() is "in mid stream" when drill
tries to query the file, thus it's not perfectly formatted.  Does this seem
feasible?

2.  Just waiting a bit fixes things, thus because of how Drill works, i.e.
it has to read all the data on an aggregate query,  if it was going to fail
because there was corrupt data permanently written, it would fail every
time. (I.e. I shouldn't be troubleshooting this because if it's working,
the problem is resolved at least until the next time I try to read a half
written json object. Is this accurate?

3.  This is always going to be the case with "realtime" data, or is there a
way to address this?

4. Is there a way to address this type of issues by skipping that
line/record?  I know there was some talk about skipping records in other
posts/JIRAs, but not sure if this would be taken into account there.

5. Am I completely off base and the actual problem is something else?

John

Re: Line Parsing Errors and Skipping

Posted by John Omernik <jo...@omernik.com>.
Great feature and this fixes my problem.  All I do is in my python script
when I open a file, it opens with the .prefix. When I "close" it I rename
it without the . prefix. Easy fix. Thanks for the pointer Andries!

John

On Tue, Nov 3, 2015 at 1:52 PM, Andries Engelbrecht <
aengelbrecht@maprtech.com> wrote:

> See DRILL-2424 and DRILL-1131
> Incomplete records/files can cause issues, in Drill 1.2 hey have added the
> ability to ignore data files with a .prefix.
>
> Perhaps copy files in over NFS using a . prefix and then rename once
> copied on the DFS.
>
> I had the same issue with Flume data streaming in and incomplete records,
> not been able to test with Drill 1.2. However if I copy an existing file to
> the same directory with a . prefix I can see in the query plan that the
> hidden file is being ignored.
>
> —Andries
>
> > On Nov 3, 2015, at 11:07 AM, John Omernik <jo...@omernik.com> wrote:
> >
> > I am doing some "active" loading of data into json files on MapRFS.
> > Basically I have feeds pulling from a message  queue and outputting the
> > JSON messages.
> >
> > I have a query that is doing aggregations on all the data that seem to
> work
> > 90% of the time.
> >
> > The other 10%, I get this error:
> >
> > Error: DATA_READ ERROR: Error parsing JSON - Unexpected end-of-input in
> > VALUE_STRING
> >
> > File: /path/to/file
> > Record: someint
> > Column: someint
> > Fragment someint:someint
> >
> > (I replaced the actual record, column, and fragment info obviously)
> >
> >
> > When I get this error, I can run the same query again, and all is well.
> >
> > My questions are this:
> >
> > 1. My "gut" is telling me this is because I have files being written in
> > real time with MapR FS using POSIX tools over NFS and when this is
> > occuring, it's because the python fh.write() is "in mid stream" when
> drill
> > tries to query the file, thus it's not perfectly formatted.  Does this
> seem
> > feasible?
> >
> > 2.  Just waiting a bit fixes things, thus because of how Drill works,
> i.e.
> > it has to read all the data on an aggregate query,  if it was going to
> fail
> > because there was corrupt data permanently written, it would fail every
> > time. (I.e. I shouldn't be troubleshooting this because if it's working,
> > the problem is resolved at least until the next time I try to read a half
> > written json object. Is this accurate?
> >
> > 3.  This is always going to be the case with "realtime" data, or is
> there a
> > way to address this?
> >
> > 4. Is there a way to address this type of issues by skipping that
> > line/record?  I know there was some talk about skipping records in other
> > posts/JIRAs, but not sure if this would be taken into account there.
> >
> > 5. Am I completely off base and the actual problem is something else?
> >
> > John
>
>

Re: Line Parsing Errors and Skipping

Posted by Andries Engelbrecht <ae...@maprtech.com>.
See DRILL-2424 and DRILL-1131
Incomplete records/files can cause issues, in Drill 1.2 hey have added the ability to ignore data files with a .prefix.

Perhaps copy files in over NFS using a . prefix and then rename once copied on the DFS.

I had the same issue with Flume data streaming in and incomplete records, not been able to test with Drill 1.2. However if I copy an existing file to the same directory with a . prefix I can see in the query plan that the hidden file is being ignored.

—Andries
 
> On Nov 3, 2015, at 11:07 AM, John Omernik <jo...@omernik.com> wrote:
> 
> I am doing some "active" loading of data into json files on MapRFS.
> Basically I have feeds pulling from a message  queue and outputting the
> JSON messages.
> 
> I have a query that is doing aggregations on all the data that seem to work
> 90% of the time.
> 
> The other 10%, I get this error:
> 
> Error: DATA_READ ERROR: Error parsing JSON - Unexpected end-of-input in
> VALUE_STRING
> 
> File: /path/to/file
> Record: someint
> Column: someint
> Fragment someint:someint
> 
> (I replaced the actual record, column, and fragment info obviously)
> 
> 
> When I get this error, I can run the same query again, and all is well.
> 
> My questions are this:
> 
> 1. My "gut" is telling me this is because I have files being written in
> real time with MapR FS using POSIX tools over NFS and when this is
> occuring, it's because the python fh.write() is "in mid stream" when drill
> tries to query the file, thus it's not perfectly formatted.  Does this seem
> feasible?
> 
> 2.  Just waiting a bit fixes things, thus because of how Drill works, i.e.
> it has to read all the data on an aggregate query,  if it was going to fail
> because there was corrupt data permanently written, it would fail every
> time. (I.e. I shouldn't be troubleshooting this because if it's working,
> the problem is resolved at least until the next time I try to read a half
> written json object. Is this accurate?
> 
> 3.  This is always going to be the case with "realtime" data, or is there a
> way to address this?
> 
> 4. Is there a way to address this type of issues by skipping that
> line/record?  I know there was some talk about skipping records in other
> posts/JIRAs, but not sure if this would be taken into account there.
> 
> 5. Am I completely off base and the actual problem is something else?
> 
> John


Re: Line Parsing Errors and Skipping

Posted by John Omernik <jo...@omernik.com>.
Well I have one program writing data via Python to MapRFS in a directory
that Drill is reading, so yes, I have two different programs reading and
writing data. What I am looking for here is knowing I may have this
scenario where  a read may occur before a write is complete, can I just
have Drill ignore that record.

John

On Tue, Nov 3, 2015 at 1:18 PM, mark charts <mc...@yahoo.com.invalid>
wrote:

> Hi.
> I read your dilemma. Would a trap in program to handle this ERROR or
> Exception work for you in this case and address it by skip around the
> trouble? My guess is you have a timing condition gone astray somewhere and
> you need to assure all states are timed correctly.
> But what do I know. Good luck.
>
> Mark Charts
>
>
>      On Tuesday, November 3, 2015 2:08 PM, John Omernik <jo...@omernik.com>
> wrote:
>
>
>  I am doing some "active" loading of data into json files on MapRFS.
> Basically I have feeds pulling from a message  queue and outputting the
> JSON messages.
>
> I have a query that is doing aggregations on all the data that seem to work
> 90% of the time.
>
> The other 10%, I get this error:
>
> Error: DATA_READ ERROR: Error parsing JSON - Unexpected end-of-input in
> VALUE_STRING
>
> File: /path/to/file
> Record: someint
> Column: someint
> Fragment someint:someint
>
> (I replaced the actual record, column, and fragment info obviously)
>
>
> When I get this error, I can run the same query again, and all is well.
>
> My questions are this:
>
> 1. My "gut" is telling me this is because I have files being written in
> real time with MapR FS using POSIX tools over NFS and when this is
> occuring, it's because the python fh.write() is "in mid stream" when drill
> tries to query the file, thus it's not perfectly formatted.  Does this seem
> feasible?
>
> 2.  Just waiting a bit fixes things, thus because of how Drill works, i.e.
> it has to read all the data on an aggregate query,  if it was going to fail
> because there was corrupt data permanently written, it would fail every
> time. (I.e. I shouldn't be troubleshooting this because if it's working,
> the problem is resolved at least until the next time I try to read a half
> written json object. Is this accurate?
>
> 3.  This is always going to be the case with "realtime" data, or is there a
> way to address this?
>
> 4. Is there a way to address this type of issues by skipping that
> line/record?  I know there was some talk about skipping records in other
> posts/JIRAs, but not sure if this would be taken into account there.
>
> 5. Am I completely off base and the actual problem is something else?
>
> John
>
>
>
>

Re: Line Parsing Errors and Skipping

Posted by mark charts <mc...@yahoo.com.INVALID>.
Hi.
I read your dilemma. Would a trap in program to handle this ERROR or Exception work for you in this case and address it by skip around the trouble? My guess is you have a timing condition gone astray somewhere and you need to assure all states are timed correctly.
But what do I know. Good luck.

Mark Charts 


     On Tuesday, November 3, 2015 2:08 PM, John Omernik <jo...@omernik.com> wrote:
   

 I am doing some "active" loading of data into json files on MapRFS.
Basically I have feeds pulling from a message  queue and outputting the
JSON messages.

I have a query that is doing aggregations on all the data that seem to work
90% of the time.

The other 10%, I get this error:

Error: DATA_READ ERROR: Error parsing JSON - Unexpected end-of-input in
VALUE_STRING

File: /path/to/file
Record: someint
Column: someint
Fragment someint:someint

(I replaced the actual record, column, and fragment info obviously)


When I get this error, I can run the same query again, and all is well.

My questions are this:

1. My "gut" is telling me this is because I have files being written in
real time with MapR FS using POSIX tools over NFS and when this is
occuring, it's because the python fh.write() is "in mid stream" when drill
tries to query the file, thus it's not perfectly formatted.  Does this seem
feasible?

2.  Just waiting a bit fixes things, thus because of how Drill works, i.e.
it has to read all the data on an aggregate query,  if it was going to fail
because there was corrupt data permanently written, it would fail every
time. (I.e. I shouldn't be troubleshooting this because if it's working,
the problem is resolved at least until the next time I try to read a half
written json object. Is this accurate?

3.  This is always going to be the case with "realtime" data, or is there a
way to address this?

4. Is there a way to address this type of issues by skipping that
line/record?  I know there was some talk about skipping records in other
posts/JIRAs, but not sure if this would be taken into account there.

5. Am I completely off base and the actual problem is something else?

John