You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by John Omernik <jo...@omernik.com> on 2016/05/23 14:48:32 UTC

Reading and converting Parquet files intended for Impala

I have a largish directory of parquet files generated for use in Impala.
They were created with the CDH version of apache-parquet-mr (not sure on
version at this time)

Some settings:
Compression: snappy
Use Dictionary: true
WRITER_VERION: PARQUET_1_0

I can read them as is in Drill, however, the strings all come through as
binary (see other thread). I can cast all those fields as VARCHAR and read
them but take a bad performance hit (2 seconds to read directly from raw
parquet, limit 10, but showing binary.  25 seconds to use a view that CASTS
all fields into the proper types... data returns accurately, but 10 rows
taking 25 seconds is too long)

So I want to read from this directory (approx 126GB) and CTAS in a way
Drill will be happier.

I've tried this two ways. One was just to ctas directly from view I
created. All else being default. The other was to set the reader
"new_reader" = true. Neither worked, and new_reader actually behaves very
badly (need to restart drill bits)  At least the other default reader
errors :)

store.parquet.use_new_reader = false (the default)
This through the error below (it's a truncated error, lots of fireld names
and other things.  It stored 6 GB of files and died.

store.parquet.use_new_reader = true

1.4 GB of files created and  everything hangs, need to restart drillbits
(is this an issue?)



Error from "non" new_reader:

rror: SYSTEM ERROR: ArrayIndexOutOfBoundsException: 107014



Fragment 1:36



[Error Id: ab5b202f-94cc-4275-b136-537dfbea6b31 on
atl1ctuzeta05.ctu-bo.secureworks.net:20001]



  (org.apache.drill.common.exceptions.DrillRuntimeException) Error in
parquet record reader.

Message:

Hadoop path: /path/to/files/-m-00001.snappy.parquet

Total records read: 393120

Mock records read: 0

Records to read: 32768

Row group index: 0

Records in row group: 536499

Parquet Metadata: ParquetMetaData{FileMetaData{schema: message events {

…




org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.handleAndRaise():352


org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.next():454

    org.apache.drill.exec.physical.impl.ScanBatch.next():191

    org.apache.drill.exec.record.AbstractRecordBatch.next():119

    org.apache.drill.exec.record.AbstractRecordBatch.next():109

    org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51


org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129

    org.apache.drill.exec.record.AbstractRecordBatch.next():162

    org.apache.drill.exec.record.AbstractRecordBatch.next():119

    org.apache.drill.exec.record.AbstractRecordBatch.next():109

    org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51


org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129

    org.apache.drill.exec.record.AbstractRecordBatch.next():162

    org.apache.drill.exec.record.AbstractRecordBatch.next():119

    org.apache.drill.exec.record.AbstractRecordBatch.next():109

    org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():91

    org.apache.drill.exec.record.AbstractRecordBatch.next():162

    org.apache.drill.exec.physical.impl.BaseRootExec.next():104


org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92

    org.apache.drill.exec.physical.impl.BaseRootExec.next():94

    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257

    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251

    java.security.AccessController.doPrivileged():-2

    javax.security.auth.Subject.doAs():422

    org.apache.hadoop.security.UserGroupInformation.doAs():1595

    org.apache.drill.exec.work.fragment.FragmentExecutor.run():251

    org.apache.drill.common.SelfCleaningRunnable.run():38

    java.util.concurrent.ThreadPoolExecutor.runWorker():1142

    java.util.concurrent.ThreadPoolExecutor$Worker.run():617

    java.lang.Thread.run():745

  Caused By (java.lang.ArrayIndexOutOfBoundsException) 107014


org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary.decodeToLong():164


org.apache.parquet.column.values.dictionary.DictionaryValuesReader.readLong():122


org.apache.drill.exec.store.parquet.columnreaders.ParquetFixedWidthDictionaryReaders$DictionaryBigIntReader.readField():161


org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.readValues():120


org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.processPageData():169


org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.determineSize():146


org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.processPages():107


org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.readAllFixedFields():393


org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.next():439

    org.apache.drill.exec.physical.impl.ScanBatch.next():191

    org.apache.drill.exec.record.AbstractRecordBatch.next():119

    org.apache.drill.exec.record.AbstractRecordBatch.next():109

    org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51


org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129

    org.apache.drill.exec.record.AbstractRecordBatch.next():162

    org.apache.drill.exec.record.AbstractRecordBatch.next():119

    org.apache.drill.exec.record.AbstractRecordBatch.next():109

    org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51


org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129

    org.apache.drill.exec.record.AbstractRecordBatch.next():162

    org.apache.drill.exec.record.AbstractRecordBatch.next():119

    org.apache.drill.exec.record.AbstractRecordBatch.next():109

    org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():91

    org.apache.drill.exec.record.AbstractRecordBatch.next():162

    org.apache.drill.exec.physical.impl.BaseRootExec.next():104


org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92

    org.apache.drill.exec.physical.impl.BaseRootExec.next():94

    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257

    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251

    java.security.AccessController.doPrivileged():-2

    javax.security.auth.Subject.doAs():422

    org.apache.hadoop.security.UserGroupInformation.doAs():1595

    org.apache.drill.exec.work.fragment.FragmentExecutor.run():251

    org.apache.drill.common.SelfCleaningRunnable.run():38

    java.util.concurrent.ThreadPoolExecutor.runWorker():1142

    java.util.concurrent.ThreadPoolExecutor$Worker.run():617

    java.lang.Thread.run():745 (state=,code=0)

Re: Reading and converting Parquet files intended for Impala

Posted by John Omernik <jo...@omernik.com>.
So the reader, in Drill 1.6 that is being used by default is the reader
from the standard parquet library then?  (Not the special Drill reader?

On Sat, May 28, 2016 at 7:22 AM, Abdel Hakim Deneche <ad...@maprtech.com>
wrote:

> the new parquet reader, the complex reader, is disabled by default. You can
> enable it by setting the following option to true:
>
> store.parquet.use_new_reader
>
>
>
> On Sat, May 28, 2016 at 4:56 AM, John Omernik <jo...@omernik.com> wrote:
>
> > I remember reading that drill uses two readers. One for certain cases ( I
> > think flat structures) and the other for complex structures.  A. Am I
> > remembering correctly? B. If so, can I determine via the plan or
> something
> > which is being used? And C. Can I force Drill to try the other reader?
> >
> > On Saturday, May 28, 2016, Ted Dunning <te...@gmail.com> wrote:
> >
> > > The Parquet user/dev mailing list might be helpful here. They have a
> real
> > > stake in making sure that all readers/writers can work together. The
> > > problem here really does sound like there is a borderline case that
> isn't
> > > handled as well in the Drill special purpose parquet reader as in the
> > > normal readers.
> > >
> > >
> > >
> > >
> > >
> > > On Fri, May 27, 2016 at 7:23 PM, John Omernik <john@omernik.com
> > > <javascript:;>> wrote:
> > >
> > > > So working with MapR support we tried that with Impala, but it didn't
> > > > produce the desired results because the outputfile worked fine in
> > Drill.
> > > > Theory: Evil file is created in Mapr Reduce, and is using a different
> > > > writer than Impala is using. Impala can read the evil file, but when
> it
> > > > writes it uses it's own writer, "fixing" the issue on the fly.  Thus,
> > > Drill
> > > > can't read evil file, but if we try to reduce with Impala, files is
> no
> > > > longer evil, consider it... chaotic neutral ... (For all you D&D
> fans )
> > > >
> > > > I'd ideally love to extract into badness, but on the phone now with
> > MapR
> > > > support to figure out HOW, that is the question at hand.
> > > >
> > > > John
> > > >
> > > > On Fri, May 27, 2016 at 10:09 AM, Ted Dunning <ted.dunning@gmail.com
> > > <javascript:;>>
> > > > wrote:
> > > >
> > > > > On Thu, May 26, 2016 at 8:50 PM, John Omernik <john@omernik.com
> > > <javascript:;>> wrote:
> > > > >
> > > > > > So, if we have a known "bad" Parquet file (I use quotes, because
> > > > > remember,
> > > > > > Impala queries this file just fine) created in Map Reduce, with a
> > > > column
> > > > > > causing Array Index Out of Bounds problems with a BIGINT typed
> > > column.
> > > > > What
> > > > > > would your next steps be to troubleshoot?
> > > > > >
> > > > >
> > > > > I would start reducing the size of the evil file.
> > > > >
> > > > > If you have a tool that can query the bad parquet and write a new
> one
> > > > > (sounds like Impala might do here) then selecting just the evil
> > column
> > > > is a
> > > > > good first step.
> > > > >
> > > > > After that, I would start bisecting to find a small range that
> still
> > > > causes
> > > > > the problem. There may not be such, but it is good thing to try.
> > > > >
> > > > > At that point, you could easily have the problem down to a few
> > > kilobytes
> > > > of
> > > > > data that can be used in a unit test.
> > > > >
> > > >
> > >
> >
> >
> > --
> > Sent from my iThing
> >
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   <http://www.mapr.com/>
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>

Re: Reading and converting Parquet files intended for Impala

Posted by Abdel Hakim Deneche <ad...@maprtech.com>.
the new parquet reader, the complex reader, is disabled by default. You can
enable it by setting the following option to true:

store.parquet.use_new_reader



On Sat, May 28, 2016 at 4:56 AM, John Omernik <jo...@omernik.com> wrote:

> I remember reading that drill uses two readers. One for certain cases ( I
> think flat structures) and the other for complex structures.  A. Am I
> remembering correctly? B. If so, can I determine via the plan or something
> which is being used? And C. Can I force Drill to try the other reader?
>
> On Saturday, May 28, 2016, Ted Dunning <te...@gmail.com> wrote:
>
> > The Parquet user/dev mailing list might be helpful here. They have a real
> > stake in making sure that all readers/writers can work together. The
> > problem here really does sound like there is a borderline case that isn't
> > handled as well in the Drill special purpose parquet reader as in the
> > normal readers.
> >
> >
> >
> >
> >
> > On Fri, May 27, 2016 at 7:23 PM, John Omernik <john@omernik.com
> > <javascript:;>> wrote:
> >
> > > So working with MapR support we tried that with Impala, but it didn't
> > > produce the desired results because the outputfile worked fine in
> Drill.
> > > Theory: Evil file is created in Mapr Reduce, and is using a different
> > > writer than Impala is using. Impala can read the evil file, but when it
> > > writes it uses it's own writer, "fixing" the issue on the fly.  Thus,
> > Drill
> > > can't read evil file, but if we try to reduce with Impala, files is no
> > > longer evil, consider it... chaotic neutral ... (For all you D&D fans )
> > >
> > > I'd ideally love to extract into badness, but on the phone now with
> MapR
> > > support to figure out HOW, that is the question at hand.
> > >
> > > John
> > >
> > > On Fri, May 27, 2016 at 10:09 AM, Ted Dunning <ted.dunning@gmail.com
> > <javascript:;>>
> > > wrote:
> > >
> > > > On Thu, May 26, 2016 at 8:50 PM, John Omernik <john@omernik.com
> > <javascript:;>> wrote:
> > > >
> > > > > So, if we have a known "bad" Parquet file (I use quotes, because
> > > > remember,
> > > > > Impala queries this file just fine) created in Map Reduce, with a
> > > column
> > > > > causing Array Index Out of Bounds problems with a BIGINT typed
> > column.
> > > > What
> > > > > would your next steps be to troubleshoot?
> > > > >
> > > >
> > > > I would start reducing the size of the evil file.
> > > >
> > > > If you have a tool that can query the bad parquet and write a new one
> > > > (sounds like Impala might do here) then selecting just the evil
> column
> > > is a
> > > > good first step.
> > > >
> > > > After that, I would start bisecting to find a small range that still
> > > causes
> > > > the problem. There may not be such, but it is good thing to try.
> > > >
> > > > At that point, you could easily have the problem down to a few
> > kilobytes
> > > of
> > > > data that can be used in a unit test.
> > > >
> > >
> >
>
>
> --
> Sent from my iThing
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Re: Reading and converting Parquet files intended for Impala

Posted by John Omernik <jo...@omernik.com>.
I remember reading that drill uses two readers. One for certain cases ( I
think flat structures) and the other for complex structures.  A. Am I
remembering correctly? B. If so, can I determine via the plan or something
which is being used? And C. Can I force Drill to try the other reader?

On Saturday, May 28, 2016, Ted Dunning <te...@gmail.com> wrote:

> The Parquet user/dev mailing list might be helpful here. They have a real
> stake in making sure that all readers/writers can work together. The
> problem here really does sound like there is a borderline case that isn't
> handled as well in the Drill special purpose parquet reader as in the
> normal readers.
>
>
>
>
>
> On Fri, May 27, 2016 at 7:23 PM, John Omernik <john@omernik.com
> <javascript:;>> wrote:
>
> > So working with MapR support we tried that with Impala, but it didn't
> > produce the desired results because the outputfile worked fine in Drill.
> > Theory: Evil file is created in Mapr Reduce, and is using a different
> > writer than Impala is using. Impala can read the evil file, but when it
> > writes it uses it's own writer, "fixing" the issue on the fly.  Thus,
> Drill
> > can't read evil file, but if we try to reduce with Impala, files is no
> > longer evil, consider it... chaotic neutral ... (For all you D&D fans )
> >
> > I'd ideally love to extract into badness, but on the phone now with MapR
> > support to figure out HOW, that is the question at hand.
> >
> > John
> >
> > On Fri, May 27, 2016 at 10:09 AM, Ted Dunning <ted.dunning@gmail.com
> <javascript:;>>
> > wrote:
> >
> > > On Thu, May 26, 2016 at 8:50 PM, John Omernik <john@omernik.com
> <javascript:;>> wrote:
> > >
> > > > So, if we have a known "bad" Parquet file (I use quotes, because
> > > remember,
> > > > Impala queries this file just fine) created in Map Reduce, with a
> > column
> > > > causing Array Index Out of Bounds problems with a BIGINT typed
> column.
> > > What
> > > > would your next steps be to troubleshoot?
> > > >
> > >
> > > I would start reducing the size of the evil file.
> > >
> > > If you have a tool that can query the bad parquet and write a new one
> > > (sounds like Impala might do here) then selecting just the evil column
> > is a
> > > good first step.
> > >
> > > After that, I would start bisecting to find a small range that still
> > causes
> > > the problem. There may not be such, but it is good thing to try.
> > >
> > > At that point, you could easily have the problem down to a few
> kilobytes
> > of
> > > data that can be used in a unit test.
> > >
> >
>


-- 
Sent from my iThing

Re: Reading and converting Parquet files intended for Impala

Posted by John Omernik <jo...@omernik.com>.
So on this subject, I believe
https://issues.apache.org/jira/browse/DRILL-4464 maybe related, while the
error messages are slightly different with tweaking of settings, I can
reproduce my problem with the test data that's included on the JIRA.  I do
believe my problem is reproducible with this issue, and I posted to JIRA
the similarities.

Thanks!

John

On Mon, May 30, 2016 at 7:06 PM, John Omernik <jo...@omernik.com> wrote:

> what I don't understand is the substitution in general. Why have
>  export SERVER_GC_OPTS=${SERVER_GC_OPTS/"-Xloggc:<FILE-
> PATH>"/"-Xloggc:${loggc}"}
>
> instead of
>
> export SERVER_GC_OPTS="${SERVER_GC_OPTS/} -Xloggc:${loggc}"
>
> The latter seems much more straight forward and understandable, and less
> prone to odd ball issues. Maybe, one other if test to ensure that ${loggc}
> is set as well.
>
>  if [ -n "$SERVER_GC_OPTS" && -n "${loggc}" ]; then
> export SERVER_GC_OPTS="${SERVER_GC_OPTS/} -Xloggc:${loggc}"
> fi
>
> I guess I am just a big fan of simplification...
>
> On Mon, May 30, 2016 at 5:01 PM, Paul Rogers <pr...@maprtech.com> wrote:
>
>> Hi John,
>>
>> The Drill scripts need quite a bit of TLC. (See DRILL-4581.)
>> drill-config.sh tries to set up both the Drillbit (server) and sqlline
>> (client). Work was needed to fully separate the two. The CLIENT_GC_OPTS are
>> only for sqlline, SERVER_GC_OPTS are for the drillbit.
>>
>> The problem is that SERVER_GC_OPTS does two things that conflict. If it
>> only did logging, it would work:
>>
>> $ loggc=/foo/bar.log
>> $ export SERVER_GC_OPTS="-Xloggc:<FILE-PATH>”
>> $ echo ${SERVER_GC_OPTS/"-Xloggc:<FILE-PATH>"/"-Xloggc:${loggc}”}
>> -Xloggc:/foo/bar.log
>>
>> But, current version of drill-env.sh helpfully adds other stuff to
>> SERVER_GC_OPTS, which makes the substitution fail:
>>
>> export SERVER_GC_OPTS="-XX:+CMSClassUnloadingEnabled -XX:+UseG1GC "
>>
>> Sigh… More bugs to fix… I’ve added this issue as a comment to DRILL-4581.
>>
>> For now, just work around the problem using DRILL_JAVA_OPTS. The
>> following exists today in drill-env.sh:
>>
>> export DRILL_JAVA_OPTS="-Xms$DRILL_HEAP -Xmx$DRILL_HEAP…
>>
>> Add another line:
>>
>> export DRILL_JAVA_OPTS=“$DRILL_JAVA_OPTS -Xloggc:/path/to/gc.log"
>>
>> You’ll have to specify the log path, but it sounds like you do that
>> anyway for your Mesos setup.
>>
>> By the way, another change we’re making for DoY is to split drill-env.sh
>> into three parts: Drill defaults move into drill-config.sh,
>> distribution-specific stuff moves into its own file, and drill-env.sh will
>> contain only site-specific settings.
>>
>> - Paul
>>
>> > On May 30, 2016, at 5:28 AM, John Omernik <jo...@omernik.com> wrote:
>> >
>> > More importantly, I am not sure how the strings inside the curly braces
>> > actually works either, based on testing... (echoing out the
>> SERVER_GC_OPTS
>> > and CLIENT_GC_OPTS) It's not actually working
>> >
>> > If I am reading the bash correctly, than it's looking to, if
>> SERVER_GC_OPTS
>> > (or CLIENT) is set (-n = return true if the length of the string is
>> > nonzero, since the Variable is interpreted, we are checking wether
>> there is
>> > something in the variable)  then we should be adding the xloggc (both of
>> > them) to the SERVER_GC_OPTS (and client).
>> >
>> > As you can see with the testing, the SERVER_GC_OPTS is only the value
>> that
>> > I am setting from my drill-env.sh (default setting)  which is loaded by
>> > drill-config.sh sourced earlier in the drillbit.sh.  Thus, this code in
>> > drillbit.sh is effectively doing nothing ... I guess my thought process
>> > here would be to have someone help decide what is intended here, (I am
>> not
>> > sure "Nothing" is intended based on the amount of code) and then we can
>> do
>> > some updating here to clarify and ensure efficacy.
>> >
>> >
>> > Testing:
>> >
>> > if [ -n "$SERVER_GC_OPTS" ]; then
>> >
>> >  export
>> SERVER_GC_OPTS=${SERVER_GC_OPTS/"-Xloggc:<FILE-PATH>"/"-Xloggc:${loggc}"}
>> >
>> > fi
>> >
>> > if [ -n "$CLIENT_GC_OPTS" ]; then
>> >
>> >  export
>> CLIENT_GC_OPTS=${CLIENT_GC_OPTS/"-Xloggc:<FILE-PATH>"/"-Xloggc:${loggc}"}
>> >
>> > fi
>> >
>> > echo "Server: $SERVER_GC_OPTS"
>> >
>> > echo "Client: $CLIENT_GC_OPTS"
>> >
>> > exit 1
>> >
>> >
>> > Server: -XX:+CMSClassUnloadingEnabled -XX:+UseG1GC
>> >
>> > Client:
>> >
>> > On Mon, May 30, 2016 at 6:43 AM, John Omernik <jo...@omernik.com> wrote:
>> >
>> >> So based on Paul's drilbit.sh comment and this, I decided to go ensure
>> I
>> >> was enabling the proper GC logging because I am skipping the
>> drillbit.sh.
>> >> I looked at the drillbit.sh, and frankly, It looks like a goofy error
>> may
>> >> be in that... the <FILE-PATH> seems to be in documentation for other
>> >> hadoop-ish projects, but I don't think Java or BASH does anything with
>> it.
>> >> Thus having that in the drillbit.sh (which to me shouldn't be changed)
>> >> seems to be a mistake... (Yes the -Xloggc after it may just overwrite
>> what
>> >> was passed in the <FILE-PATH> but am I correct in saying that this is
>> >> actually just a mistake that in the drillbit.sh, and all it does is add
>> >> confusion? I hope I am wrong here and I get to learn something, but I
>> ust
>> >> can't see how <FILE-PATH> is interpreted by bash or java....
>> >>
>> >>
>> >> John
>> >>
>> >>
>> >>
>> >> if [ -n "$SERVER_GC_OPTS" ]; then
>> >>
>> >>  export SERVER_GC_OPTS=${SERVER_GC_OPTS/"-Xloggc:<FILE-PATH>"/
>> >> "-Xloggc:${loggc}"}
>> >>
>> >> fi
>> >>
>> >> if [ -n "$CLIENT_GC_OPTS" ]; then
>> >>
>> >>  export CLIENT_GC_OPTS=${CLIENT_GC_OPTS/"-Xloggc:<FILE-PATH>"/
>> >> "-Xloggc:${loggc}"}
>> >>
>> >> fi
>> >>
>> >> On Mon, May 30, 2016 at 3:42 AM, Ted Dunning <te...@gmail.com>
>> >> wrote:
>> >>
>> >>> On Sun, May 29, 2016 at 2:29 PM, John Omernik <jo...@omernik.com>
>> wrote:
>> >>>
>> >>>> (It's a very weird situation that the bits get into,
>> >>>> everything hangs, somethings work, other things seem to be a in
>> >>>> an in-between between working and not working etc.  Like describe
>> table
>> >>>> operations eventually return but after 10+seconds.  I resolve this by
>> >>>> restarting all bits, and then things are right as rain.
>> >>>>
>> >>>
>> >>> Sounds like GC pressure, possibly.
>> >>>
>> >>> The GC logging that was mentioned in connection with drill.sh would be
>> >>> helpful here.
>> >>>
>> >>
>> >>
>>
>>
>

Re: Reading and converting Parquet files intended for Impala

Posted by John Omernik <jo...@omernik.com>.
what I don't understand is the substitution in general. Why have
 export SERVER_GC_OPTS=${SERVER_GC_OPTS/"-Xloggc:<FILE-
PATH>"/"-Xloggc:${loggc}"}

instead of

export SERVER_GC_OPTS="${SERVER_GC_OPTS/} -Xloggc:${loggc}"

The latter seems much more straight forward and understandable, and less
prone to odd ball issues. Maybe, one other if test to ensure that ${loggc}
is set as well.

 if [ -n "$SERVER_GC_OPTS" && -n "${loggc}" ]; then
export SERVER_GC_OPTS="${SERVER_GC_OPTS/} -Xloggc:${loggc}"
fi

I guess I am just a big fan of simplification...

On Mon, May 30, 2016 at 5:01 PM, Paul Rogers <pr...@maprtech.com> wrote:

> Hi John,
>
> The Drill scripts need quite a bit of TLC. (See DRILL-4581.)
> drill-config.sh tries to set up both the Drillbit (server) and sqlline
> (client). Work was needed to fully separate the two. The CLIENT_GC_OPTS are
> only for sqlline, SERVER_GC_OPTS are for the drillbit.
>
> The problem is that SERVER_GC_OPTS does two things that conflict. If it
> only did logging, it would work:
>
> $ loggc=/foo/bar.log
> $ export SERVER_GC_OPTS="-Xloggc:<FILE-PATH>”
> $ echo ${SERVER_GC_OPTS/"-Xloggc:<FILE-PATH>"/"-Xloggc:${loggc}”}
> -Xloggc:/foo/bar.log
>
> But, current version of drill-env.sh helpfully adds other stuff to
> SERVER_GC_OPTS, which makes the substitution fail:
>
> export SERVER_GC_OPTS="-XX:+CMSClassUnloadingEnabled -XX:+UseG1GC "
>
> Sigh… More bugs to fix… I’ve added this issue as a comment to DRILL-4581.
>
> For now, just work around the problem using DRILL_JAVA_OPTS. The following
> exists today in drill-env.sh:
>
> export DRILL_JAVA_OPTS="-Xms$DRILL_HEAP -Xmx$DRILL_HEAP…
>
> Add another line:
>
> export DRILL_JAVA_OPTS=“$DRILL_JAVA_OPTS -Xloggc:/path/to/gc.log"
>
> You’ll have to specify the log path, but it sounds like you do that anyway
> for your Mesos setup.
>
> By the way, another change we’re making for DoY is to split drill-env.sh
> into three parts: Drill defaults move into drill-config.sh,
> distribution-specific stuff moves into its own file, and drill-env.sh will
> contain only site-specific settings.
>
> - Paul
>
> > On May 30, 2016, at 5:28 AM, John Omernik <jo...@omernik.com> wrote:
> >
> > More importantly, I am not sure how the strings inside the curly braces
> > actually works either, based on testing... (echoing out the
> SERVER_GC_OPTS
> > and CLIENT_GC_OPTS) It's not actually working
> >
> > If I am reading the bash correctly, than it's looking to, if
> SERVER_GC_OPTS
> > (or CLIENT) is set (-n = return true if the length of the string is
> > nonzero, since the Variable is interpreted, we are checking wether there
> is
> > something in the variable)  then we should be adding the xloggc (both of
> > them) to the SERVER_GC_OPTS (and client).
> >
> > As you can see with the testing, the SERVER_GC_OPTS is only the value
> that
> > I am setting from my drill-env.sh (default setting)  which is loaded by
> > drill-config.sh sourced earlier in the drillbit.sh.  Thus, this code in
> > drillbit.sh is effectively doing nothing ... I guess my thought process
> > here would be to have someone help decide what is intended here, (I am
> not
> > sure "Nothing" is intended based on the amount of code) and then we can
> do
> > some updating here to clarify and ensure efficacy.
> >
> >
> > Testing:
> >
> > if [ -n "$SERVER_GC_OPTS" ]; then
> >
> >  export
> SERVER_GC_OPTS=${SERVER_GC_OPTS/"-Xloggc:<FILE-PATH>"/"-Xloggc:${loggc}"}
> >
> > fi
> >
> > if [ -n "$CLIENT_GC_OPTS" ]; then
> >
> >  export
> CLIENT_GC_OPTS=${CLIENT_GC_OPTS/"-Xloggc:<FILE-PATH>"/"-Xloggc:${loggc}"}
> >
> > fi
> >
> > echo "Server: $SERVER_GC_OPTS"
> >
> > echo "Client: $CLIENT_GC_OPTS"
> >
> > exit 1
> >
> >
> > Server: -XX:+CMSClassUnloadingEnabled -XX:+UseG1GC
> >
> > Client:
> >
> > On Mon, May 30, 2016 at 6:43 AM, John Omernik <jo...@omernik.com> wrote:
> >
> >> So based on Paul's drilbit.sh comment and this, I decided to go ensure I
> >> was enabling the proper GC logging because I am skipping the
> drillbit.sh.
> >> I looked at the drillbit.sh, and frankly, It looks like a goofy error
> may
> >> be in that... the <FILE-PATH> seems to be in documentation for other
> >> hadoop-ish projects, but I don't think Java or BASH does anything with
> it.
> >> Thus having that in the drillbit.sh (which to me shouldn't be changed)
> >> seems to be a mistake... (Yes the -Xloggc after it may just overwrite
> what
> >> was passed in the <FILE-PATH> but am I correct in saying that this is
> >> actually just a mistake that in the drillbit.sh, and all it does is add
> >> confusion? I hope I am wrong here and I get to learn something, but I
> ust
> >> can't see how <FILE-PATH> is interpreted by bash or java....
> >>
> >>
> >> John
> >>
> >>
> >>
> >> if [ -n "$SERVER_GC_OPTS" ]; then
> >>
> >>  export SERVER_GC_OPTS=${SERVER_GC_OPTS/"-Xloggc:<FILE-PATH>"/
> >> "-Xloggc:${loggc}"}
> >>
> >> fi
> >>
> >> if [ -n "$CLIENT_GC_OPTS" ]; then
> >>
> >>  export CLIENT_GC_OPTS=${CLIENT_GC_OPTS/"-Xloggc:<FILE-PATH>"/
> >> "-Xloggc:${loggc}"}
> >>
> >> fi
> >>
> >> On Mon, May 30, 2016 at 3:42 AM, Ted Dunning <te...@gmail.com>
> >> wrote:
> >>
> >>> On Sun, May 29, 2016 at 2:29 PM, John Omernik <jo...@omernik.com>
> wrote:
> >>>
> >>>> (It's a very weird situation that the bits get into,
> >>>> everything hangs, somethings work, other things seem to be a in
> >>>> an in-between between working and not working etc.  Like describe
> table
> >>>> operations eventually return but after 10+seconds.  I resolve this by
> >>>> restarting all bits, and then things are right as rain.
> >>>>
> >>>
> >>> Sounds like GC pressure, possibly.
> >>>
> >>> The GC logging that was mentioned in connection with drill.sh would be
> >>> helpful here.
> >>>
> >>
> >>
>
>

Re: Reading and converting Parquet files intended for Impala

Posted by Paul Rogers <pr...@maprtech.com>.
Hi John,

The Drill scripts need quite a bit of TLC. (See DRILL-4581.) drill-config.sh tries to set up both the Drillbit (server) and sqlline (client). Work was needed to fully separate the two. The CLIENT_GC_OPTS are only for sqlline, SERVER_GC_OPTS are for the drillbit.

The problem is that SERVER_GC_OPTS does two things that conflict. If it only did logging, it would work:

$ loggc=/foo/bar.log
$ export SERVER_GC_OPTS="-Xloggc:<FILE-PATH>”
$ echo ${SERVER_GC_OPTS/"-Xloggc:<FILE-PATH>"/"-Xloggc:${loggc}”}
-Xloggc:/foo/bar.log

But, current version of drill-env.sh helpfully adds other stuff to SERVER_GC_OPTS, which makes the substitution fail:

export SERVER_GC_OPTS="-XX:+CMSClassUnloadingEnabled -XX:+UseG1GC "

Sigh… More bugs to fix… I’ve added this issue as a comment to DRILL-4581.

For now, just work around the problem using DRILL_JAVA_OPTS. The following exists today in drill-env.sh:

export DRILL_JAVA_OPTS="-Xms$DRILL_HEAP -Xmx$DRILL_HEAP…

Add another line:

export DRILL_JAVA_OPTS=“$DRILL_JAVA_OPTS -Xloggc:/path/to/gc.log"

You’ll have to specify the log path, but it sounds like you do that anyway for your Mesos setup.

By the way, another change we’re making for DoY is to split drill-env.sh into three parts: Drill defaults move into drill-config.sh, distribution-specific stuff moves into its own file, and drill-env.sh will contain only site-specific settings.

- Paul

> On May 30, 2016, at 5:28 AM, John Omernik <jo...@omernik.com> wrote:
> 
> More importantly, I am not sure how the strings inside the curly braces
> actually works either, based on testing... (echoing out the SERVER_GC_OPTS
> and CLIENT_GC_OPTS) It's not actually working
> 
> If I am reading the bash correctly, than it's looking to, if SERVER_GC_OPTS
> (or CLIENT) is set (-n = return true if the length of the string is
> nonzero, since the Variable is interpreted, we are checking wether there is
> something in the variable)  then we should be adding the xloggc (both of
> them) to the SERVER_GC_OPTS (and client).
> 
> As you can see with the testing, the SERVER_GC_OPTS is only the value that
> I am setting from my drill-env.sh (default setting)  which is loaded by
> drill-config.sh sourced earlier in the drillbit.sh.  Thus, this code in
> drillbit.sh is effectively doing nothing ... I guess my thought process
> here would be to have someone help decide what is intended here, (I am not
> sure "Nothing" is intended based on the amount of code) and then we can do
> some updating here to clarify and ensure efficacy.
> 
> 
> Testing:
> 
> if [ -n "$SERVER_GC_OPTS" ]; then
> 
>  export SERVER_GC_OPTS=${SERVER_GC_OPTS/"-Xloggc:<FILE-PATH>"/"-Xloggc:${loggc}"}
> 
> fi
> 
> if [ -n "$CLIENT_GC_OPTS" ]; then
> 
>  export CLIENT_GC_OPTS=${CLIENT_GC_OPTS/"-Xloggc:<FILE-PATH>"/"-Xloggc:${loggc}"}
> 
> fi
> 
> echo "Server: $SERVER_GC_OPTS"
> 
> echo "Client: $CLIENT_GC_OPTS"
> 
> exit 1
> 
> 
> Server: -XX:+CMSClassUnloadingEnabled -XX:+UseG1GC
> 
> Client:
> 
> On Mon, May 30, 2016 at 6:43 AM, John Omernik <jo...@omernik.com> wrote:
> 
>> So based on Paul's drilbit.sh comment and this, I decided to go ensure I
>> was enabling the proper GC logging because I am skipping the drillbit.sh.
>> I looked at the drillbit.sh, and frankly, It looks like a goofy error may
>> be in that... the <FILE-PATH> seems to be in documentation for other
>> hadoop-ish projects, but I don't think Java or BASH does anything with it.
>> Thus having that in the drillbit.sh (which to me shouldn't be changed)
>> seems to be a mistake... (Yes the -Xloggc after it may just overwrite what
>> was passed in the <FILE-PATH> but am I correct in saying that this is
>> actually just a mistake that in the drillbit.sh, and all it does is add
>> confusion? I hope I am wrong here and I get to learn something, but I ust
>> can't see how <FILE-PATH> is interpreted by bash or java....
>> 
>> 
>> John
>> 
>> 
>> 
>> if [ -n "$SERVER_GC_OPTS" ]; then
>> 
>>  export SERVER_GC_OPTS=${SERVER_GC_OPTS/"-Xloggc:<FILE-PATH>"/
>> "-Xloggc:${loggc}"}
>> 
>> fi
>> 
>> if [ -n "$CLIENT_GC_OPTS" ]; then
>> 
>>  export CLIENT_GC_OPTS=${CLIENT_GC_OPTS/"-Xloggc:<FILE-PATH>"/
>> "-Xloggc:${loggc}"}
>> 
>> fi
>> 
>> On Mon, May 30, 2016 at 3:42 AM, Ted Dunning <te...@gmail.com>
>> wrote:
>> 
>>> On Sun, May 29, 2016 at 2:29 PM, John Omernik <jo...@omernik.com> wrote:
>>> 
>>>> (It's a very weird situation that the bits get into,
>>>> everything hangs, somethings work, other things seem to be a in
>>>> an in-between between working and not working etc.  Like describe table
>>>> operations eventually return but after 10+seconds.  I resolve this by
>>>> restarting all bits, and then things are right as rain.
>>>> 
>>> 
>>> Sounds like GC pressure, possibly.
>>> 
>>> The GC logging that was mentioned in connection with drill.sh would be
>>> helpful here.
>>> 
>> 
>> 


Re: Reading and converting Parquet files intended for Impala

Posted by John Omernik <jo...@omernik.com>.
More importantly, I am not sure how the strings inside the curly braces
actually works either, based on testing... (echoing out the SERVER_GC_OPTS
and CLIENT_GC_OPTS) It's not actually working

If I am reading the bash correctly, than it's looking to, if SERVER_GC_OPTS
(or CLIENT) is set (-n = return true if the length of the string is
nonzero, since the Variable is interpreted, we are checking wether there is
something in the variable)  then we should be adding the xloggc (both of
them) to the SERVER_GC_OPTS (and client).

As you can see with the testing, the SERVER_GC_OPTS is only the value that
I am setting from my drill-env.sh (default setting)  which is loaded by
drill-config.sh sourced earlier in the drillbit.sh.  Thus, this code in
drillbit.sh is effectively doing nothing ... I guess my thought process
here would be to have someone help decide what is intended here, (I am not
sure "Nothing" is intended based on the amount of code) and then we can do
some updating here to clarify and ensure efficacy.


Testing:

if [ -n "$SERVER_GC_OPTS" ]; then

  export SERVER_GC_OPTS=${SERVER_GC_OPTS/"-Xloggc:<FILE-PATH>"/"-Xloggc:${loggc}"}

fi

if [ -n "$CLIENT_GC_OPTS" ]; then

  export CLIENT_GC_OPTS=${CLIENT_GC_OPTS/"-Xloggc:<FILE-PATH>"/"-Xloggc:${loggc}"}

fi

echo "Server: $SERVER_GC_OPTS"

echo "Client: $CLIENT_GC_OPTS"

exit 1


Server: -XX:+CMSClassUnloadingEnabled -XX:+UseG1GC

Client:

On Mon, May 30, 2016 at 6:43 AM, John Omernik <jo...@omernik.com> wrote:

> So based on Paul's drilbit.sh comment and this, I decided to go ensure I
> was enabling the proper GC logging because I am skipping the drillbit.sh.
> I looked at the drillbit.sh, and frankly, It looks like a goofy error may
> be in that... the <FILE-PATH> seems to be in documentation for other
> hadoop-ish projects, but I don't think Java or BASH does anything with it.
> Thus having that in the drillbit.sh (which to me shouldn't be changed)
> seems to be a mistake... (Yes the -Xloggc after it may just overwrite what
> was passed in the <FILE-PATH> but am I correct in saying that this is
> actually just a mistake that in the drillbit.sh, and all it does is add
> confusion? I hope I am wrong here and I get to learn something, but I ust
> can't see how <FILE-PATH> is interpreted by bash or java....
>
>
> John
>
>
>
> if [ -n "$SERVER_GC_OPTS" ]; then
>
>   export SERVER_GC_OPTS=${SERVER_GC_OPTS/"-Xloggc:<FILE-PATH>"/
> "-Xloggc:${loggc}"}
>
> fi
>
> if [ -n "$CLIENT_GC_OPTS" ]; then
>
>   export CLIENT_GC_OPTS=${CLIENT_GC_OPTS/"-Xloggc:<FILE-PATH>"/
> "-Xloggc:${loggc}"}
>
> fi
>
> On Mon, May 30, 2016 at 3:42 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
>> On Sun, May 29, 2016 at 2:29 PM, John Omernik <jo...@omernik.com> wrote:
>>
>> > (It's a very weird situation that the bits get into,
>> > everything hangs, somethings work, other things seem to be a in
>> > an in-between between working and not working etc.  Like describe table
>> > operations eventually return but after 10+seconds.  I resolve this by
>> > restarting all bits, and then things are right as rain.
>> >
>>
>> Sounds like GC pressure, possibly.
>>
>> The GC logging that was mentioned in connection with drill.sh would be
>> helpful here.
>>
>
>

Re: Reading and converting Parquet files intended for Impala

Posted by John Omernik <jo...@omernik.com>.
So based on Paul's drilbit.sh comment and this, I decided to go ensure I
was enabling the proper GC logging because I am skipping the drillbit.sh.
I looked at the drillbit.sh, and frankly, It looks like a goofy error may
be in that... the <FILE-PATH> seems to be in documentation for other
hadoop-ish projects, but I don't think Java or BASH does anything with it.
Thus having that in the drillbit.sh (which to me shouldn't be changed)
seems to be a mistake... (Yes the -Xloggc after it may just overwrite what
was passed in the <FILE-PATH> but am I correct in saying that this is
actually just a mistake that in the drillbit.sh, and all it does is add
confusion? I hope I am wrong here and I get to learn something, but I ust
can't see how <FILE-PATH> is interpreted by bash or java....


John



if [ -n "$SERVER_GC_OPTS" ]; then

  export SERVER_GC_OPTS=${SERVER_GC_OPTS/"-Xloggc:<FILE-PATH>"/
"-Xloggc:${loggc}"}

fi

if [ -n "$CLIENT_GC_OPTS" ]; then

  export CLIENT_GC_OPTS=${CLIENT_GC_OPTS/"-Xloggc:<FILE-PATH>"/
"-Xloggc:${loggc}"}

fi

On Mon, May 30, 2016 at 3:42 AM, Ted Dunning <te...@gmail.com> wrote:

> On Sun, May 29, 2016 at 2:29 PM, John Omernik <jo...@omernik.com> wrote:
>
> > (It's a very weird situation that the bits get into,
> > everything hangs, somethings work, other things seem to be a in
> > an in-between between working and not working etc.  Like describe table
> > operations eventually return but after 10+seconds.  I resolve this by
> > restarting all bits, and then things are right as rain.
> >
>
> Sounds like GC pressure, possibly.
>
> The GC logging that was mentioned in connection with drill.sh would be
> helpful here.
>

Re: Reading and converting Parquet files intended for Impala

Posted by Ted Dunning <te...@gmail.com>.
On Sun, May 29, 2016 at 2:29 PM, John Omernik <jo...@omernik.com> wrote:

> (It's a very weird situation that the bits get into,
> everything hangs, somethings work, other things seem to be a in
> an in-between between working and not working etc.  Like describe table
> operations eventually return but after 10+seconds.  I resolve this by
> restarting all bits, and then things are right as rain.
>

Sounds like GC pressure, possibly.

The GC logging that was mentioned in connection with drill.sh would be
helpful here.

Re: Reading and converting Parquet files intended for Impala

Posted by John Omernik <jo...@omernik.com>.
Ya, but based on my testing, if it's accurate to say that when
store.parquet.use_new_reader
is false, than Drill is NOT using the customer reader, then the bug is not
occurring in the customer reader.  When I do the min/max
with store.parquet.use_new_reader set to true, it actually returns the
values properly. My issue with the store.parquet.use_new_reader = true is
when I do a CTAS I get heap space issues (even at 24 GB) and node seem to
restart or something, (It's a very weird situation that the bits get into,
everything hangs, somethings work, other things seem to be a in
an in-between between working and not working etc.  Like describe table
operations eventually return but after 10+seconds.  I resolve this by
restarting all bits, and then things are right as rain.

That goes to my other question, is there a way to optimize my CTAS
statement with the new_reader set to true?  I'd be ok with "an" option to
do my CTAS, even if I have to set the new reader and some memory options.
 (I am sure folks in the Drill project may want to know what the bug is on
the normal reader, and I would still help with that regardless if I have an
option for my CTAS in my use case)



On Sun, May 29, 2016 at 7:18 AM, Ted Dunning <te...@gmail.com> wrote:

> PARQUET-244 might have a similar bug in the custom Drill reader.
>
>
>
> On Sun, May 29, 2016 at 1:12 PM, John Omernik <jo...@omernik.com> wrote:
>
> > *sigh PARQUET-244 is likely not my issue considering that
> > DeltaByteArrayWriter isn't in my stack trace. (I love learning CS101 type
> > stuff in front of a whole community, it's great for self esteem! :)
> >
> >
> >
> > On Sun, May 29, 2016 at 7:08 AM, John Omernik <jo...@omernik.com> wrote:
> >
> > > Doing more research, I found this:
> > >
> > > https://issues.apache.org/jira/browse/PARQUET-244
> > >
> > > So the version that is being written is 1.5-cdh, thus the writer does
> > have
> > > this bug. Question is A. Could we reproduce this in Drill 1.6 to see if
> > the
> > > default reader has the same error on know bad data, and B. Should Drill
> > be
> > > able to handle reading data created with this bug? (Note: The Parquet
> > > project seemed to implement handling of reading data created with this
> > bug (
> > > https://github.com/apache/parquet-mr/pull/235)  Note: I am not sure
> this
> > > is the same, thing I am seeing, I am just trying to find the things in
> > the
> > > Parquet project that seem close to what I am seeing)
> > >
> > > John
> > >
> > >
> > > On Sat, May 28, 2016 at 9:28 AM, John Omernik <jo...@omernik.com>
> wrote:
> > >
> > >> New Update
> > >>
> > >> Thank you Abdel for giving me an idea to try.
> > >>
> > >> When I was first doing the CTAS, I tried setting
> > store.parquet.use_new_reader
> > >> = true. What occurred when I did that, was Drill effectively "hung"  I
> > am
> > >> not sure why, perhaps Memory issues? (These are fairly beefy bits,
> 24GB
> > of
> > >> Heap, 84 GB of Direct Memory).
> > >>
> > >> But now that I've gotten further in troubleshooting, I have "one" bad
> > >> file, and so I tried the min(row_created_ts)  on the one bad file.
> With
> > the store.parquet.use_new_reader
> > >> set to false (the default) I get the Array Index Out of Bounds, but
> > when I
> > >> set to true, on the one file it now works.  So the "new" reader can
> > handle
> > >> the file, that's interesting. It still leaves me in a bit of a bind,
> > >> because setting the new reader to true on the CTAS doesn't actually
> work
> > >> (like I said, memory issues etc).  Any ideas on the new reader, and
> how
> > I
> > >> could get memory consumption down and actually have that succeed?  I
> am
> > not
> > >> doing any casting from the original Parquet files (would that help?)
> > All I
> > >> am doing in my view is the CONVERT_FROM(field, 'UTF8') on the string
> > fields
> > >> (because for some reason the Parquet string fields are read as binary
> in
> > >> Drill).  I was assuming (hoping?) that on a CTAS from Parquet to
> Parquet
> > >> that the types would be preserved, is that an incorrect assumption?
> > >>
> > >> Given this new piece of information, are there other steps I may want
> to
> > >> try/attempt?
> > >>
> > >> Thanks Abdel for the idea!
> > >>
> > >>
> > >>
> > >> On Sat, May 28, 2016 at 8:50 AM, John Omernik <jo...@omernik.com>
> wrote:
> > >>
> > >>> Thanks Ted, I summarized the problem to the Parquet Dev list.  At
> this
> > >>> point, and I hate that I have the restrictions on sharing the whole
> > file, I
> > >>> am just looking for new ways to troubleshoot the problem. I know the
> > MapR
> > >>> support team is scratching their heads on next steps as well. I did
> > offer
> > >>> to them, (and I offer to others who may want to look into the
> problem)
> > a
> > >>> screen share with me, even allowing control and in depth
> > troubleshooting.
> > >>> The cluster is not yet production, thus I can restart things change
> > debug
> > >>> settings, etc, and work with anyone who may be interested. (I know
> > it's not
> > >>> much to offer, a time consuming phone call to help someone else on a
> > >>> problem) but I do offer it. Any other ideas would also be welcome.
> > >>>
> > >>> John
> > >>>
> > >>>
> > >>> On Sat, May 28, 2016 at 6:25 AM, Ted Dunning <te...@gmail.com>
> > >>> wrote:
> > >>>
> > >>>> The Parquet user/dev mailing list might be helpful here. They have a
> > >>>> real
> > >>>> stake in making sure that all readers/writers can work together. The
> > >>>> problem here really does sound like there is a borderline case that
> > >>>> isn't
> > >>>> handled as well in the Drill special purpose parquet reader as in
> the
> > >>>> normal readers.
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Fri, May 27, 2016 at 7:23 PM, John Omernik <jo...@omernik.com>
> > wrote:
> > >>>>
> > >>>> > So working with MapR support we tried that with Impala, but it
> > didn't
> > >>>> > produce the desired results because the outputfile worked fine in
> > >>>> Drill.
> > >>>> > Theory: Evil file is created in Mapr Reduce, and is using a
> > different
> > >>>> > writer than Impala is using. Impala can read the evil file, but
> when
> > >>>> it
> > >>>> > writes it uses it's own writer, "fixing" the issue on the fly.
> > Thus,
> > >>>> Drill
> > >>>> > can't read evil file, but if we try to reduce with Impala, files
> is
> > no
> > >>>> > longer evil, consider it... chaotic neutral ... (For all you D&D
> > fans
> > >>>> )
> > >>>> >
> > >>>> > I'd ideally love to extract into badness, but on the phone now
> with
> > >>>> MapR
> > >>>> > support to figure out HOW, that is the question at hand.
> > >>>> >
> > >>>> > John
> > >>>> >
> > >>>> > On Fri, May 27, 2016 at 10:09 AM, Ted Dunning <
> > ted.dunning@gmail.com>
> > >>>> > wrote:
> > >>>> >
> > >>>> > > On Thu, May 26, 2016 at 8:50 PM, John Omernik <john@omernik.com
> >
> > >>>> wrote:
> > >>>> > >
> > >>>> > > > So, if we have a known "bad" Parquet file (I use quotes,
> because
> > >>>> > > remember,
> > >>>> > > > Impala queries this file just fine) created in Map Reduce,
> with
> > a
> > >>>> > column
> > >>>> > > > causing Array Index Out of Bounds problems with a BIGINT typed
> > >>>> column.
> > >>>> > > What
> > >>>> > > > would your next steps be to troubleshoot?
> > >>>> > > >
> > >>>> > >
> > >>>> > > I would start reducing the size of the evil file.
> > >>>> > >
> > >>>> > > If you have a tool that can query the bad parquet and write a
> new
> > >>>> one
> > >>>> > > (sounds like Impala might do here) then selecting just the evil
> > >>>> column
> > >>>> > is a
> > >>>> > > good first step.
> > >>>> > >
> > >>>> > > After that, I would start bisecting to find a small range that
> > still
> > >>>> > causes
> > >>>> > > the problem. There may not be such, but it is good thing to try.
> > >>>> > >
> > >>>> > > At that point, you could easily have the problem down to a few
> > >>>> kilobytes
> > >>>> > of
> > >>>> > > data that can be used in a unit test.
> > >>>> > >
> > >>>> >
> > >>>>
> > >>>
> > >>>
> > >>
> > >
> >
>

Re: Reading and converting Parquet files intended for Impala

Posted by Ted Dunning <te...@gmail.com>.
PARQUET-244 might have a similar bug in the custom Drill reader.



On Sun, May 29, 2016 at 1:12 PM, John Omernik <jo...@omernik.com> wrote:

> *sigh PARQUET-244 is likely not my issue considering that
> DeltaByteArrayWriter isn't in my stack trace. (I love learning CS101 type
> stuff in front of a whole community, it's great for self esteem! :)
>
>
>
> On Sun, May 29, 2016 at 7:08 AM, John Omernik <jo...@omernik.com> wrote:
>
> > Doing more research, I found this:
> >
> > https://issues.apache.org/jira/browse/PARQUET-244
> >
> > So the version that is being written is 1.5-cdh, thus the writer does
> have
> > this bug. Question is A. Could we reproduce this in Drill 1.6 to see if
> the
> > default reader has the same error on know bad data, and B. Should Drill
> be
> > able to handle reading data created with this bug? (Note: The Parquet
> > project seemed to implement handling of reading data created with this
> bug (
> > https://github.com/apache/parquet-mr/pull/235)  Note: I am not sure this
> > is the same, thing I am seeing, I am just trying to find the things in
> the
> > Parquet project that seem close to what I am seeing)
> >
> > John
> >
> >
> > On Sat, May 28, 2016 at 9:28 AM, John Omernik <jo...@omernik.com> wrote:
> >
> >> New Update
> >>
> >> Thank you Abdel for giving me an idea to try.
> >>
> >> When I was first doing the CTAS, I tried setting
> store.parquet.use_new_reader
> >> = true. What occurred when I did that, was Drill effectively "hung"  I
> am
> >> not sure why, perhaps Memory issues? (These are fairly beefy bits, 24GB
> of
> >> Heap, 84 GB of Direct Memory).
> >>
> >> But now that I've gotten further in troubleshooting, I have "one" bad
> >> file, and so I tried the min(row_created_ts)  on the one bad file. With
> the store.parquet.use_new_reader
> >> set to false (the default) I get the Array Index Out of Bounds, but
> when I
> >> set to true, on the one file it now works.  So the "new" reader can
> handle
> >> the file, that's interesting. It still leaves me in a bit of a bind,
> >> because setting the new reader to true on the CTAS doesn't actually work
> >> (like I said, memory issues etc).  Any ideas on the new reader, and how
> I
> >> could get memory consumption down and actually have that succeed?  I am
> not
> >> doing any casting from the original Parquet files (would that help?)
> All I
> >> am doing in my view is the CONVERT_FROM(field, 'UTF8') on the string
> fields
> >> (because for some reason the Parquet string fields are read as binary in
> >> Drill).  I was assuming (hoping?) that on a CTAS from Parquet to Parquet
> >> that the types would be preserved, is that an incorrect assumption?
> >>
> >> Given this new piece of information, are there other steps I may want to
> >> try/attempt?
> >>
> >> Thanks Abdel for the idea!
> >>
> >>
> >>
> >> On Sat, May 28, 2016 at 8:50 AM, John Omernik <jo...@omernik.com> wrote:
> >>
> >>> Thanks Ted, I summarized the problem to the Parquet Dev list.  At this
> >>> point, and I hate that I have the restrictions on sharing the whole
> file, I
> >>> am just looking for new ways to troubleshoot the problem. I know the
> MapR
> >>> support team is scratching their heads on next steps as well. I did
> offer
> >>> to them, (and I offer to others who may want to look into the problem)
> a
> >>> screen share with me, even allowing control and in depth
> troubleshooting.
> >>> The cluster is not yet production, thus I can restart things change
> debug
> >>> settings, etc, and work with anyone who may be interested. (I know
> it's not
> >>> much to offer, a time consuming phone call to help someone else on a
> >>> problem) but I do offer it. Any other ideas would also be welcome.
> >>>
> >>> John
> >>>
> >>>
> >>> On Sat, May 28, 2016 at 6:25 AM, Ted Dunning <te...@gmail.com>
> >>> wrote:
> >>>
> >>>> The Parquet user/dev mailing list might be helpful here. They have a
> >>>> real
> >>>> stake in making sure that all readers/writers can work together. The
> >>>> problem here really does sound like there is a borderline case that
> >>>> isn't
> >>>> handled as well in the Drill special purpose parquet reader as in the
> >>>> normal readers.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Fri, May 27, 2016 at 7:23 PM, John Omernik <jo...@omernik.com>
> wrote:
> >>>>
> >>>> > So working with MapR support we tried that with Impala, but it
> didn't
> >>>> > produce the desired results because the outputfile worked fine in
> >>>> Drill.
> >>>> > Theory: Evil file is created in Mapr Reduce, and is using a
> different
> >>>> > writer than Impala is using. Impala can read the evil file, but when
> >>>> it
> >>>> > writes it uses it's own writer, "fixing" the issue on the fly.
> Thus,
> >>>> Drill
> >>>> > can't read evil file, but if we try to reduce with Impala, files is
> no
> >>>> > longer evil, consider it... chaotic neutral ... (For all you D&D
> fans
> >>>> )
> >>>> >
> >>>> > I'd ideally love to extract into badness, but on the phone now with
> >>>> MapR
> >>>> > support to figure out HOW, that is the question at hand.
> >>>> >
> >>>> > John
> >>>> >
> >>>> > On Fri, May 27, 2016 at 10:09 AM, Ted Dunning <
> ted.dunning@gmail.com>
> >>>> > wrote:
> >>>> >
> >>>> > > On Thu, May 26, 2016 at 8:50 PM, John Omernik <jo...@omernik.com>
> >>>> wrote:
> >>>> > >
> >>>> > > > So, if we have a known "bad" Parquet file (I use quotes, because
> >>>> > > remember,
> >>>> > > > Impala queries this file just fine) created in Map Reduce, with
> a
> >>>> > column
> >>>> > > > causing Array Index Out of Bounds problems with a BIGINT typed
> >>>> column.
> >>>> > > What
> >>>> > > > would your next steps be to troubleshoot?
> >>>> > > >
> >>>> > >
> >>>> > > I would start reducing the size of the evil file.
> >>>> > >
> >>>> > > If you have a tool that can query the bad parquet and write a new
> >>>> one
> >>>> > > (sounds like Impala might do here) then selecting just the evil
> >>>> column
> >>>> > is a
> >>>> > > good first step.
> >>>> > >
> >>>> > > After that, I would start bisecting to find a small range that
> still
> >>>> > causes
> >>>> > > the problem. There may not be such, but it is good thing to try.
> >>>> > >
> >>>> > > At that point, you could easily have the problem down to a few
> >>>> kilobytes
> >>>> > of
> >>>> > > data that can be used in a unit test.
> >>>> > >
> >>>> >
> >>>>
> >>>
> >>>
> >>
> >
>

Re: Reading and converting Parquet files intended for Impala

Posted by John Omernik <jo...@omernik.com>.
*sigh PARQUET-244 is likely not my issue considering that
DeltaByteArrayWriter isn't in my stack trace. (I love learning CS101 type
stuff in front of a whole community, it's great for self esteem! :)



On Sun, May 29, 2016 at 7:08 AM, John Omernik <jo...@omernik.com> wrote:

> Doing more research, I found this:
>
> https://issues.apache.org/jira/browse/PARQUET-244
>
> So the version that is being written is 1.5-cdh, thus the writer does have
> this bug. Question is A. Could we reproduce this in Drill 1.6 to see if the
> default reader has the same error on know bad data, and B. Should Drill be
> able to handle reading data created with this bug? (Note: The Parquet
> project seemed to implement handling of reading data created with this bug (
> https://github.com/apache/parquet-mr/pull/235)  Note: I am not sure this
> is the same, thing I am seeing, I am just trying to find the things in the
> Parquet project that seem close to what I am seeing)
>
> John
>
>
> On Sat, May 28, 2016 at 9:28 AM, John Omernik <jo...@omernik.com> wrote:
>
>> New Update
>>
>> Thank you Abdel for giving me an idea to try.
>>
>> When I was first doing the CTAS, I tried setting store.parquet.use_new_reader
>> = true. What occurred when I did that, was Drill effectively "hung"  I am
>> not sure why, perhaps Memory issues? (These are fairly beefy bits, 24GB of
>> Heap, 84 GB of Direct Memory).
>>
>> But now that I've gotten further in troubleshooting, I have "one" bad
>> file, and so I tried the min(row_created_ts)  on the one bad file. With the store.parquet.use_new_reader
>> set to false (the default) I get the Array Index Out of Bounds, but when I
>> set to true, on the one file it now works.  So the "new" reader can handle
>> the file, that's interesting. It still leaves me in a bit of a bind,
>> because setting the new reader to true on the CTAS doesn't actually work
>> (like I said, memory issues etc).  Any ideas on the new reader, and how I
>> could get memory consumption down and actually have that succeed?  I am not
>> doing any casting from the original Parquet files (would that help?) All I
>> am doing in my view is the CONVERT_FROM(field, 'UTF8') on the string fields
>> (because for some reason the Parquet string fields are read as binary in
>> Drill).  I was assuming (hoping?) that on a CTAS from Parquet to Parquet
>> that the types would be preserved, is that an incorrect assumption?
>>
>> Given this new piece of information, are there other steps I may want to
>> try/attempt?
>>
>> Thanks Abdel for the idea!
>>
>>
>>
>> On Sat, May 28, 2016 at 8:50 AM, John Omernik <jo...@omernik.com> wrote:
>>
>>> Thanks Ted, I summarized the problem to the Parquet Dev list.  At this
>>> point, and I hate that I have the restrictions on sharing the whole file, I
>>> am just looking for new ways to troubleshoot the problem. I know the MapR
>>> support team is scratching their heads on next steps as well. I did offer
>>> to them, (and I offer to others who may want to look into the problem) a
>>> screen share with me, even allowing control and in depth troubleshooting.
>>> The cluster is not yet production, thus I can restart things change debug
>>> settings, etc, and work with anyone who may be interested. (I know it's not
>>> much to offer, a time consuming phone call to help someone else on a
>>> problem) but I do offer it. Any other ideas would also be welcome.
>>>
>>> John
>>>
>>>
>>> On Sat, May 28, 2016 at 6:25 AM, Ted Dunning <te...@gmail.com>
>>> wrote:
>>>
>>>> The Parquet user/dev mailing list might be helpful here. They have a
>>>> real
>>>> stake in making sure that all readers/writers can work together. The
>>>> problem here really does sound like there is a borderline case that
>>>> isn't
>>>> handled as well in the Drill special purpose parquet reader as in the
>>>> normal readers.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, May 27, 2016 at 7:23 PM, John Omernik <jo...@omernik.com> wrote:
>>>>
>>>> > So working with MapR support we tried that with Impala, but it didn't
>>>> > produce the desired results because the outputfile worked fine in
>>>> Drill.
>>>> > Theory: Evil file is created in Mapr Reduce, and is using a different
>>>> > writer than Impala is using. Impala can read the evil file, but when
>>>> it
>>>> > writes it uses it's own writer, "fixing" the issue on the fly.  Thus,
>>>> Drill
>>>> > can't read evil file, but if we try to reduce with Impala, files is no
>>>> > longer evil, consider it... chaotic neutral ... (For all you D&D fans
>>>> )
>>>> >
>>>> > I'd ideally love to extract into badness, but on the phone now with
>>>> MapR
>>>> > support to figure out HOW, that is the question at hand.
>>>> >
>>>> > John
>>>> >
>>>> > On Fri, May 27, 2016 at 10:09 AM, Ted Dunning <te...@gmail.com>
>>>> > wrote:
>>>> >
>>>> > > On Thu, May 26, 2016 at 8:50 PM, John Omernik <jo...@omernik.com>
>>>> wrote:
>>>> > >
>>>> > > > So, if we have a known "bad" Parquet file (I use quotes, because
>>>> > > remember,
>>>> > > > Impala queries this file just fine) created in Map Reduce, with a
>>>> > column
>>>> > > > causing Array Index Out of Bounds problems with a BIGINT typed
>>>> column.
>>>> > > What
>>>> > > > would your next steps be to troubleshoot?
>>>> > > >
>>>> > >
>>>> > > I would start reducing the size of the evil file.
>>>> > >
>>>> > > If you have a tool that can query the bad parquet and write a new
>>>> one
>>>> > > (sounds like Impala might do here) then selecting just the evil
>>>> column
>>>> > is a
>>>> > > good first step.
>>>> > >
>>>> > > After that, I would start bisecting to find a small range that still
>>>> > causes
>>>> > > the problem. There may not be such, but it is good thing to try.
>>>> > >
>>>> > > At that point, you could easily have the problem down to a few
>>>> kilobytes
>>>> > of
>>>> > > data that can be used in a unit test.
>>>> > >
>>>> >
>>>>
>>>
>>>
>>
>

Re: Reading and converting Parquet files intended for Impala

Posted by John Omernik <jo...@omernik.com>.
Doing more research, I found this:

https://issues.apache.org/jira/browse/PARQUET-244

So the version that is being written is 1.5-cdh, thus the writer does have
this bug. Question is A. Could we reproduce this in Drill 1.6 to see if the
default reader has the same error on know bad data, and B. Should Drill be
able to handle reading data created with this bug? (Note: The Parquet
project seemed to implement handling of reading data created with this bug (
https://github.com/apache/parquet-mr/pull/235)  Note: I am not sure this is
the same, thing I am seeing, I am just trying to find the things in the
Parquet project that seem close to what I am seeing)

John


On Sat, May 28, 2016 at 9:28 AM, John Omernik <jo...@omernik.com> wrote:

> New Update
>
> Thank you Abdel for giving me an idea to try.
>
> When I was first doing the CTAS, I tried setting store.parquet.use_new_reader
> = true. What occurred when I did that, was Drill effectively "hung"  I am
> not sure why, perhaps Memory issues? (These are fairly beefy bits, 24GB of
> Heap, 84 GB of Direct Memory).
>
> But now that I've gotten further in troubleshooting, I have "one" bad
> file, and so I tried the min(row_created_ts)  on the one bad file. With the store.parquet.use_new_reader
> set to false (the default) I get the Array Index Out of Bounds, but when I
> set to true, on the one file it now works.  So the "new" reader can handle
> the file, that's interesting. It still leaves me in a bit of a bind,
> because setting the new reader to true on the CTAS doesn't actually work
> (like I said, memory issues etc).  Any ideas on the new reader, and how I
> could get memory consumption down and actually have that succeed?  I am not
> doing any casting from the original Parquet files (would that help?) All I
> am doing in my view is the CONVERT_FROM(field, 'UTF8') on the string fields
> (because for some reason the Parquet string fields are read as binary in
> Drill).  I was assuming (hoping?) that on a CTAS from Parquet to Parquet
> that the types would be preserved, is that an incorrect assumption?
>
> Given this new piece of information, are there other steps I may want to
> try/attempt?
>
> Thanks Abdel for the idea!
>
>
>
> On Sat, May 28, 2016 at 8:50 AM, John Omernik <jo...@omernik.com> wrote:
>
>> Thanks Ted, I summarized the problem to the Parquet Dev list.  At this
>> point, and I hate that I have the restrictions on sharing the whole file, I
>> am just looking for new ways to troubleshoot the problem. I know the MapR
>> support team is scratching their heads on next steps as well. I did offer
>> to them, (and I offer to others who may want to look into the problem) a
>> screen share with me, even allowing control and in depth troubleshooting.
>> The cluster is not yet production, thus I can restart things change debug
>> settings, etc, and work with anyone who may be interested. (I know it's not
>> much to offer, a time consuming phone call to help someone else on a
>> problem) but I do offer it. Any other ideas would also be welcome.
>>
>> John
>>
>>
>> On Sat, May 28, 2016 at 6:25 AM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>>> The Parquet user/dev mailing list might be helpful here. They have a real
>>> stake in making sure that all readers/writers can work together. The
>>> problem here really does sound like there is a borderline case that isn't
>>> handled as well in the Drill special purpose parquet reader as in the
>>> normal readers.
>>>
>>>
>>>
>>>
>>>
>>> On Fri, May 27, 2016 at 7:23 PM, John Omernik <jo...@omernik.com> wrote:
>>>
>>> > So working with MapR support we tried that with Impala, but it didn't
>>> > produce the desired results because the outputfile worked fine in
>>> Drill.
>>> > Theory: Evil file is created in Mapr Reduce, and is using a different
>>> > writer than Impala is using. Impala can read the evil file, but when it
>>> > writes it uses it's own writer, "fixing" the issue on the fly.  Thus,
>>> Drill
>>> > can't read evil file, but if we try to reduce with Impala, files is no
>>> > longer evil, consider it... chaotic neutral ... (For all you D&D fans )
>>> >
>>> > I'd ideally love to extract into badness, but on the phone now with
>>> MapR
>>> > support to figure out HOW, that is the question at hand.
>>> >
>>> > John
>>> >
>>> > On Fri, May 27, 2016 at 10:09 AM, Ted Dunning <te...@gmail.com>
>>> > wrote:
>>> >
>>> > > On Thu, May 26, 2016 at 8:50 PM, John Omernik <jo...@omernik.com>
>>> wrote:
>>> > >
>>> > > > So, if we have a known "bad" Parquet file (I use quotes, because
>>> > > remember,
>>> > > > Impala queries this file just fine) created in Map Reduce, with a
>>> > column
>>> > > > causing Array Index Out of Bounds problems with a BIGINT typed
>>> column.
>>> > > What
>>> > > > would your next steps be to troubleshoot?
>>> > > >
>>> > >
>>> > > I would start reducing the size of the evil file.
>>> > >
>>> > > If you have a tool that can query the bad parquet and write a new one
>>> > > (sounds like Impala might do here) then selecting just the evil
>>> column
>>> > is a
>>> > > good first step.
>>> > >
>>> > > After that, I would start bisecting to find a small range that still
>>> > causes
>>> > > the problem. There may not be such, but it is good thing to try.
>>> > >
>>> > > At that point, you could easily have the problem down to a few
>>> kilobytes
>>> > of
>>> > > data that can be used in a unit test.
>>> > >
>>> >
>>>
>>
>>
>

Re: Reading and converting Parquet files intended for Impala

Posted by John Omernik <jo...@omernik.com>.
New Update

Thank you Abdel for giving me an idea to try.

When I was first doing the CTAS, I tried setting store.parquet.use_new_reader
= true. What occurred when I did that, was Drill effectively "hung"  I am
not sure why, perhaps Memory issues? (These are fairly beefy bits, 24GB of
Heap, 84 GB of Direct Memory).

But now that I've gotten further in troubleshooting, I have "one" bad file,
and so I tried the min(row_created_ts)  on the one bad file. With the
store.parquet.use_new_reader
set to false (the default) I get the Array Index Out of Bounds, but when I
set to true, on the one file it now works.  So the "new" reader can handle
the file, that's interesting. It still leaves me in a bit of a bind,
because setting the new reader to true on the CTAS doesn't actually work
(like I said, memory issues etc).  Any ideas on the new reader, and how I
could get memory consumption down and actually have that succeed?  I am not
doing any casting from the original Parquet files (would that help?) All I
am doing in my view is the CONVERT_FROM(field, 'UTF8') on the string fields
(because for some reason the Parquet string fields are read as binary in
Drill).  I was assuming (hoping?) that on a CTAS from Parquet to Parquet
that the types would be preserved, is that an incorrect assumption?

Given this new piece of information, are there other steps I may want to
try/attempt?

Thanks Abdel for the idea!



On Sat, May 28, 2016 at 8:50 AM, John Omernik <jo...@omernik.com> wrote:

> Thanks Ted, I summarized the problem to the Parquet Dev list.  At this
> point, and I hate that I have the restrictions on sharing the whole file, I
> am just looking for new ways to troubleshoot the problem. I know the MapR
> support team is scratching their heads on next steps as well. I did offer
> to them, (and I offer to others who may want to look into the problem) a
> screen share with me, even allowing control and in depth troubleshooting.
> The cluster is not yet production, thus I can restart things change debug
> settings, etc, and work with anyone who may be interested. (I know it's not
> much to offer, a time consuming phone call to help someone else on a
> problem) but I do offer it. Any other ideas would also be welcome.
>
> John
>
>
> On Sat, May 28, 2016 at 6:25 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
>> The Parquet user/dev mailing list might be helpful here. They have a real
>> stake in making sure that all readers/writers can work together. The
>> problem here really does sound like there is a borderline case that isn't
>> handled as well in the Drill special purpose parquet reader as in the
>> normal readers.
>>
>>
>>
>>
>>
>> On Fri, May 27, 2016 at 7:23 PM, John Omernik <jo...@omernik.com> wrote:
>>
>> > So working with MapR support we tried that with Impala, but it didn't
>> > produce the desired results because the outputfile worked fine in Drill.
>> > Theory: Evil file is created in Mapr Reduce, and is using a different
>> > writer than Impala is using. Impala can read the evil file, but when it
>> > writes it uses it's own writer, "fixing" the issue on the fly.  Thus,
>> Drill
>> > can't read evil file, but if we try to reduce with Impala, files is no
>> > longer evil, consider it... chaotic neutral ... (For all you D&D fans )
>> >
>> > I'd ideally love to extract into badness, but on the phone now with MapR
>> > support to figure out HOW, that is the question at hand.
>> >
>> > John
>> >
>> > On Fri, May 27, 2016 at 10:09 AM, Ted Dunning <te...@gmail.com>
>> > wrote:
>> >
>> > > On Thu, May 26, 2016 at 8:50 PM, John Omernik <jo...@omernik.com>
>> wrote:
>> > >
>> > > > So, if we have a known "bad" Parquet file (I use quotes, because
>> > > remember,
>> > > > Impala queries this file just fine) created in Map Reduce, with a
>> > column
>> > > > causing Array Index Out of Bounds problems with a BIGINT typed
>> column.
>> > > What
>> > > > would your next steps be to troubleshoot?
>> > > >
>> > >
>> > > I would start reducing the size of the evil file.
>> > >
>> > > If you have a tool that can query the bad parquet and write a new one
>> > > (sounds like Impala might do here) then selecting just the evil column
>> > is a
>> > > good first step.
>> > >
>> > > After that, I would start bisecting to find a small range that still
>> > causes
>> > > the problem. There may not be such, but it is good thing to try.
>> > >
>> > > At that point, you could easily have the problem down to a few
>> kilobytes
>> > of
>> > > data that can be used in a unit test.
>> > >
>> >
>>
>
>

Re: Reading and converting Parquet files intended for Impala

Posted by John Omernik <jo...@omernik.com>.
Thanks Ted, I summarized the problem to the Parquet Dev list.  At this
point, and I hate that I have the restrictions on sharing the whole file, I
am just looking for new ways to troubleshoot the problem. I know the MapR
support team is scratching their heads on next steps as well. I did offer
to them, (and I offer to others who may want to look into the problem) a
screen share with me, even allowing control and in depth troubleshooting.
The cluster is not yet production, thus I can restart things change debug
settings, etc, and work with anyone who may be interested. (I know it's not
much to offer, a time consuming phone call to help someone else on a
problem) but I do offer it. Any other ideas would also be welcome.

John


On Sat, May 28, 2016 at 6:25 AM, Ted Dunning <te...@gmail.com> wrote:

> The Parquet user/dev mailing list might be helpful here. They have a real
> stake in making sure that all readers/writers can work together. The
> problem here really does sound like there is a borderline case that isn't
> handled as well in the Drill special purpose parquet reader as in the
> normal readers.
>
>
>
>
>
> On Fri, May 27, 2016 at 7:23 PM, John Omernik <jo...@omernik.com> wrote:
>
> > So working with MapR support we tried that with Impala, but it didn't
> > produce the desired results because the outputfile worked fine in Drill.
> > Theory: Evil file is created in Mapr Reduce, and is using a different
> > writer than Impala is using. Impala can read the evil file, but when it
> > writes it uses it's own writer, "fixing" the issue on the fly.  Thus,
> Drill
> > can't read evil file, but if we try to reduce with Impala, files is no
> > longer evil, consider it... chaotic neutral ... (For all you D&D fans )
> >
> > I'd ideally love to extract into badness, but on the phone now with MapR
> > support to figure out HOW, that is the question at hand.
> >
> > John
> >
> > On Fri, May 27, 2016 at 10:09 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > On Thu, May 26, 2016 at 8:50 PM, John Omernik <jo...@omernik.com>
> wrote:
> > >
> > > > So, if we have a known "bad" Parquet file (I use quotes, because
> > > remember,
> > > > Impala queries this file just fine) created in Map Reduce, with a
> > column
> > > > causing Array Index Out of Bounds problems with a BIGINT typed
> column.
> > > What
> > > > would your next steps be to troubleshoot?
> > > >
> > >
> > > I would start reducing the size of the evil file.
> > >
> > > If you have a tool that can query the bad parquet and write a new one
> > > (sounds like Impala might do here) then selecting just the evil column
> > is a
> > > good first step.
> > >
> > > After that, I would start bisecting to find a small range that still
> > causes
> > > the problem. There may not be such, but it is good thing to try.
> > >
> > > At that point, you could easily have the problem down to a few
> kilobytes
> > of
> > > data that can be used in a unit test.
> > >
> >
>

Re: Reading and converting Parquet files intended for Impala

Posted by Ted Dunning <te...@gmail.com>.
The Parquet user/dev mailing list might be helpful here. They have a real
stake in making sure that all readers/writers can work together. The
problem here really does sound like there is a borderline case that isn't
handled as well in the Drill special purpose parquet reader as in the
normal readers.





On Fri, May 27, 2016 at 7:23 PM, John Omernik <jo...@omernik.com> wrote:

> So working with MapR support we tried that with Impala, but it didn't
> produce the desired results because the outputfile worked fine in Drill.
> Theory: Evil file is created in Mapr Reduce, and is using a different
> writer than Impala is using. Impala can read the evil file, but when it
> writes it uses it's own writer, "fixing" the issue on the fly.  Thus, Drill
> can't read evil file, but if we try to reduce with Impala, files is no
> longer evil, consider it... chaotic neutral ... (For all you D&D fans )
>
> I'd ideally love to extract into badness, but on the phone now with MapR
> support to figure out HOW, that is the question at hand.
>
> John
>
> On Fri, May 27, 2016 at 10:09 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > On Thu, May 26, 2016 at 8:50 PM, John Omernik <jo...@omernik.com> wrote:
> >
> > > So, if we have a known "bad" Parquet file (I use quotes, because
> > remember,
> > > Impala queries this file just fine) created in Map Reduce, with a
> column
> > > causing Array Index Out of Bounds problems with a BIGINT typed column.
> > What
> > > would your next steps be to troubleshoot?
> > >
> >
> > I would start reducing the size of the evil file.
> >
> > If you have a tool that can query the bad parquet and write a new one
> > (sounds like Impala might do here) then selecting just the evil column
> is a
> > good first step.
> >
> > After that, I would start bisecting to find a small range that still
> causes
> > the problem. There may not be such, but it is good thing to try.
> >
> > At that point, you could easily have the problem down to a few kilobytes
> of
> > data that can be used in a unit test.
> >
>

Re: Reading and converting Parquet files intended for Impala

Posted by John Omernik <jo...@omernik.com>.
So working with MapR support we tried that with Impala, but it didn't
produce the desired results because the outputfile worked fine in Drill.
Theory: Evil file is created in Mapr Reduce, and is using a different
writer than Impala is using. Impala can read the evil file, but when it
writes it uses it's own writer, "fixing" the issue on the fly.  Thus, Drill
can't read evil file, but if we try to reduce with Impala, files is no
longer evil, consider it... chaotic neutral ... (For all you D&D fans )

I'd ideally love to extract into badness, but on the phone now with MapR
support to figure out HOW, that is the question at hand.

John

On Fri, May 27, 2016 at 10:09 AM, Ted Dunning <te...@gmail.com> wrote:

> On Thu, May 26, 2016 at 8:50 PM, John Omernik <jo...@omernik.com> wrote:
>
> > So, if we have a known "bad" Parquet file (I use quotes, because
> remember,
> > Impala queries this file just fine) created in Map Reduce, with a column
> > causing Array Index Out of Bounds problems with a BIGINT typed column.
> What
> > would your next steps be to troubleshoot?
> >
>
> I would start reducing the size of the evil file.
>
> If you have a tool that can query the bad parquet and write a new one
> (sounds like Impala might do here) then selecting just the evil column is a
> good first step.
>
> After that, I would start bisecting to find a small range that still causes
> the problem. There may not be such, but it is good thing to try.
>
> At that point, you could easily have the problem down to a few kilobytes of
> data that can be used in a unit test.
>

Re: Reading and converting Parquet files intended for Impala

Posted by Ted Dunning <te...@gmail.com>.
On Thu, May 26, 2016 at 8:50 PM, John Omernik <jo...@omernik.com> wrote:

> So, if we have a known "bad" Parquet file (I use quotes, because remember,
> Impala queries this file just fine) created in Map Reduce, with a column
> causing Array Index Out of Bounds problems with a BIGINT typed column. What
> would your next steps be to troubleshoot?
>

I would start reducing the size of the evil file.

If you have a tool that can query the bad parquet and write a new one
(sounds like Impala might do here) then selecting just the evil column is a
good first step.

After that, I would start bisecting to find a small range that still causes
the problem. There may not be such, but it is good thing to try.

At that point, you could easily have the problem down to a few kilobytes of
data that can be used in a unit test.

Re: Reading and converting Parquet files intended for Impala

Posted by John Omernik <jo...@omernik.com>.
The MapR support folks gave me a good idea to troubleshoot: their can you
hone in on which columns are the problem?  Basically I have near 100 fields
in this table, and the hunch was that only a fields may be at issue. I took
this idea, and wrote a script using Python that took the field list, and
using the REST API, would do the CTAS of a known bad day of data.  It would
fail, I would record that, as well as the file that was failing.   (For
some reason I couldn't get a CTAS on specified files to fail, only when
queried all together). Every iteration, I would take the last field off ,
and try the CTAS again. Eventually, I found the field. It was a BIGINT
field the we will call bad_field. Now, what if I did a select
min(bad_field), max(bad_field from `path/to/knownbad`  Boom that would fail
as well with the same array out of bounds error. Cool. What if I did the
CTAS without that field, Boom, that worked.  (We need a JIRA filed to get
me to stop saying boom).

Ok, I think I am on to something here.   Next step, could I make the
min/max query fail when ONLY querying the file. Yes!  Ok, we are getting
close. This is great, because now instead of 120GB of data, I can look at
240mb of data.  Now, the same min/max in impala work fine, and and I am
unsure what to look at next. I will be doing a webex with MapR Support
tomorrow, but I thought I'd multi thread this too, mainly because if
someone is having a similar problem, I want to keep what I am doing to
solve the problem out in the open.

So, if we have a known "bad" Parquet file (I use quotes, because remember,
Impala queries this file just fine) created in Map Reduce, with a column
causing Array Index Out of Bounds problems with a BIGINT typed column. What
would your next steps be to troubleshoot?



On Mon, May 23, 2016 at 4:16 PM, John Omernik <jo...@omernik.com> wrote:

> Troubleshooting this is made more difficult by the fact that the file that
> gives the error works fine when I select directly from it into a new
> table... this makes it very tricky to troubleshoot, any assistance on this
> would be appreciated, I've opened a ticket with MapR as well, but I am
> stumped, and this is our primary use case right now, thus this is a
> blocker. (Note I've tried three different days, two fail, one works)
>
> John
>
> On Mon, May 23, 2016 at 9:48 AM, John Omernik <jo...@omernik.com> wrote:
>
>> I have a largish directory of parquet files generated for use in Impala.
>> They were created with the CDH version of apache-parquet-mr (not sure on
>> version at this time)
>>
>> Some settings:
>> Compression: snappy
>> Use Dictionary: true
>> WRITER_VERION: PARQUET_1_0
>>
>> I can read them as is in Drill, however, the strings all come through as
>> binary (see other thread). I can cast all those fields as VARCHAR and read
>> them but take a bad performance hit (2 seconds to read directly from raw
>> parquet, limit 10, but showing binary.  25 seconds to use a view that CASTS
>> all fields into the proper types... data returns accurately, but 10 rows
>> taking 25 seconds is too long)
>>
>> So I want to read from this directory (approx 126GB) and CTAS in a way
>> Drill will be happier.
>>
>> I've tried this two ways. One was just to ctas directly from view I
>> created. All else being default. The other was to set the reader
>> "new_reader" = true. Neither worked, and new_reader actually behaves very
>> badly (need to restart drill bits)  At least the other default reader
>> errors :)
>>
>> store.parquet.use_new_reader = false (the default)
>> This through the error below (it's a truncated error, lots of fireld
>> names and other things.  It stored 6 GB of files and died.
>>
>> store.parquet.use_new_reader = true
>>
>> 1.4 GB of files created and  everything hangs, need to restart drillbits
>> (is this an issue?)
>>
>>
>>
>> Error from "non" new_reader:
>>
>> rror: SYSTEM ERROR: ArrayIndexOutOfBoundsException: 107014
>>
>>
>>
>> Fragment 1:36
>>
>>
>>
>> [Error Id: ab5b202f-94cc-4275-b136-537dfbea6b31 on
>> atl1ctuzeta05.ctu-bo.secureworks.net:20001]
>>
>>
>>
>>   (org.apache.drill.common.exceptions.DrillRuntimeException) Error in
>> parquet record reader.
>>
>> Message:
>>
>> Hadoop path: /path/to/files/-m-00001.snappy.parquet
>>
>> Total records read: 393120
>>
>> Mock records read: 0
>>
>> Records to read: 32768
>>
>> Row group index: 0
>>
>> Records in row group: 536499
>>
>> Parquet Metadata: ParquetMetaData{FileMetaData{schema: message events {
>>
>> …
>>
>>
>>
>>
>> org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.handleAndRaise():352
>>
>>
>> org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.next():454
>>
>>     org.apache.drill.exec.physical.impl.ScanBatch.next():191
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>
>>     org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>>
>>
>> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>
>>     org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>>
>>
>> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>
>>     org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():91
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>>
>>     org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>>
>>
>> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
>>
>>     org.apache.drill.exec.physical.impl.BaseRootExec.next():94
>>
>>     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257
>>
>>     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251
>>
>>     java.security.AccessController.doPrivileged():-2
>>
>>     javax.security.auth.Subject.doAs():422
>>
>>     org.apache.hadoop.security.UserGroupInformation.doAs():1595
>>
>>     org.apache.drill.exec.work.fragment.FragmentExecutor.run():251
>>
>>     org.apache.drill.common.SelfCleaningRunnable.run():38
>>
>>     java.util.concurrent.ThreadPoolExecutor.runWorker():1142
>>
>>     java.util.concurrent.ThreadPoolExecutor$Worker.run():617
>>
>>     java.lang.Thread.run():745
>>
>>   Caused By (java.lang.ArrayIndexOutOfBoundsException) 107014
>>
>>
>> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary.decodeToLong():164
>>
>>
>> org.apache.parquet.column.values.dictionary.DictionaryValuesReader.readLong():122
>>
>>
>> org.apache.drill.exec.store.parquet.columnreaders.ParquetFixedWidthDictionaryReaders$DictionaryBigIntReader.readField():161
>>
>>
>> org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.readValues():120
>>
>>
>> org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.processPageData():169
>>
>>
>> org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.determineSize():146
>>
>>
>> org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.processPages():107
>>
>>
>> org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.readAllFixedFields():393
>>
>>
>> org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.next():439
>>
>>     org.apache.drill.exec.physical.impl.ScanBatch.next():191
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>
>>     org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>>
>>
>> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>
>>     org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>>
>>
>> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>
>>     org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():91
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>>
>>     org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>>
>>
>> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
>>
>>     org.apache.drill.exec.physical.impl.BaseRootExec.next():94
>>
>>     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257
>>
>>     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251
>>
>>     java.security.AccessController.doPrivileged():-2
>>
>>     javax.security.auth.Subject.doAs():422
>>
>>     org.apache.hadoop.security.UserGroupInformation.doAs():1595
>>
>>     org.apache.drill.exec.work.fragment.FragmentExecutor.run():251
>>
>>     org.apache.drill.common.SelfCleaningRunnable.run():38
>>
>>     java.util.concurrent.ThreadPoolExecutor.runWorker():1142
>>
>>     java.util.concurrent.ThreadPoolExecutor$Worker.run():617
>>
>>     java.lang.Thread.run():745 (state=,code=0)
>>
>
>

Re: Reading and converting Parquet files intended for Impala

Posted by John Omernik <jo...@omernik.com>.
Troubleshooting this is made more difficult by the fact that the file that
gives the error works fine when I select directly from it into a new
table... this makes it very tricky to troubleshoot, any assistance on this
would be appreciated, I've opened a ticket with MapR as well, but I am
stumped, and this is our primary use case right now, thus this is a
blocker. (Note I've tried three different days, two fail, one works)

John

On Mon, May 23, 2016 at 9:48 AM, John Omernik <jo...@omernik.com> wrote:

> I have a largish directory of parquet files generated for use in Impala.
> They were created with the CDH version of apache-parquet-mr (not sure on
> version at this time)
>
> Some settings:
> Compression: snappy
> Use Dictionary: true
> WRITER_VERION: PARQUET_1_0
>
> I can read them as is in Drill, however, the strings all come through as
> binary (see other thread). I can cast all those fields as VARCHAR and read
> them but take a bad performance hit (2 seconds to read directly from raw
> parquet, limit 10, but showing binary.  25 seconds to use a view that CASTS
> all fields into the proper types... data returns accurately, but 10 rows
> taking 25 seconds is too long)
>
> So I want to read from this directory (approx 126GB) and CTAS in a way
> Drill will be happier.
>
> I've tried this two ways. One was just to ctas directly from view I
> created. All else being default. The other was to set the reader
> "new_reader" = true. Neither worked, and new_reader actually behaves very
> badly (need to restart drill bits)  At least the other default reader
> errors :)
>
> store.parquet.use_new_reader = false (the default)
> This through the error below (it's a truncated error, lots of fireld names
> and other things.  It stored 6 GB of files and died.
>
> store.parquet.use_new_reader = true
>
> 1.4 GB of files created and  everything hangs, need to restart drillbits
> (is this an issue?)
>
>
>
> Error from "non" new_reader:
>
> rror: SYSTEM ERROR: ArrayIndexOutOfBoundsException: 107014
>
>
>
> Fragment 1:36
>
>
>
> [Error Id: ab5b202f-94cc-4275-b136-537dfbea6b31 on
> atl1ctuzeta05.ctu-bo.secureworks.net:20001]
>
>
>
>   (org.apache.drill.common.exceptions.DrillRuntimeException) Error in
> parquet record reader.
>
> Message:
>
> Hadoop path: /path/to/files/-m-00001.snappy.parquet
>
> Total records read: 393120
>
> Mock records read: 0
>
> Records to read: 32768
>
> Row group index: 0
>
> Records in row group: 536499
>
> Parquet Metadata: ParquetMetaData{FileMetaData{schema: message events {
>
> …
>
>
>
>
> org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.handleAndRaise():352
>
>
> org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.next():454
>
>     org.apache.drill.exec.physical.impl.ScanBatch.next():191
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>
>     org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>
>
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>
>     org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>
>
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>
>     org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():91
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>
>     org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>
>
> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
>
>     org.apache.drill.exec.physical.impl.BaseRootExec.next():94
>
>     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257
>
>     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251
>
>     java.security.AccessController.doPrivileged():-2
>
>     javax.security.auth.Subject.doAs():422
>
>     org.apache.hadoop.security.UserGroupInformation.doAs():1595
>
>     org.apache.drill.exec.work.fragment.FragmentExecutor.run():251
>
>     org.apache.drill.common.SelfCleaningRunnable.run():38
>
>     java.util.concurrent.ThreadPoolExecutor.runWorker():1142
>
>     java.util.concurrent.ThreadPoolExecutor$Worker.run():617
>
>     java.lang.Thread.run():745
>
>   Caused By (java.lang.ArrayIndexOutOfBoundsException) 107014
>
>
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary.decodeToLong():164
>
>
> org.apache.parquet.column.values.dictionary.DictionaryValuesReader.readLong():122
>
>
> org.apache.drill.exec.store.parquet.columnreaders.ParquetFixedWidthDictionaryReaders$DictionaryBigIntReader.readField():161
>
>
> org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.readValues():120
>
>
> org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.processPageData():169
>
>
> org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.determineSize():146
>
>
> org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.processPages():107
>
>
> org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.readAllFixedFields():393
>
>
> org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.next():439
>
>     org.apache.drill.exec.physical.impl.ScanBatch.next():191
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>
>     org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>
>
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>
>     org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>
>
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>
>     org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():91
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>
>     org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>
>
> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
>
>     org.apache.drill.exec.physical.impl.BaseRootExec.next():94
>
>     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257
>
>     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251
>
>     java.security.AccessController.doPrivileged():-2
>
>     javax.security.auth.Subject.doAs():422
>
>     org.apache.hadoop.security.UserGroupInformation.doAs():1595
>
>     org.apache.drill.exec.work.fragment.FragmentExecutor.run():251
>
>     org.apache.drill.common.SelfCleaningRunnable.run():38
>
>     java.util.concurrent.ThreadPoolExecutor.runWorker():1142
>
>     java.util.concurrent.ThreadPoolExecutor$Worker.run():617
>
>     java.lang.Thread.run():745 (state=,code=0)
>