You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Matt Keranen (JIRA)" <ji...@apache.org> on 2016/03/10 20:38:40 UTC
[jira] [Comment Edited] (DRILL-4317) Exceptions on SELECT and CTAS with large CSV files

    [ https://issues.apache.org/jira/browse/DRILL-4317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189824#comment-15189824 ] 

Matt Keranen edited comment on DRILL-4317 at 3/10/16 7:37 PM:
--------------------------------------------------------------

Testing with "split -l 100000 test.csv test_" on the file, added the header with column names to each subset, and imported them as test_?? and it appears the exception is not triggered.

This perhaps the issue is not with the contents of the file, but the size of the data or number of rows. In this test the source file size was 1,432,857 lines for 147MB.


was (Author: mattk):
Testing with "split -l 100000 test.csv test_" on the file, added the header with column names to each subset, and imported them as test_?? and it appears the exception is not triggered.

This perhaps the issue is not with the contents of the file, but the size of the data or number of rows.

> Exceptions on SELECT and CTAS with large CSV files
> --------------------------------------------------
>
>                 Key: DRILL-4317
>                 URL: https://issues.apache.org/jira/browse/DRILL-4317
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Text & CSV
>    Affects Versions: 1.4.0, 1.5.0
>         Environment: 4 node cluster, Hadoop 2.7.0, 14.04.1-Ubuntu
>            Reporter: Matt Keranen
>
> Selecting from a CSV file or running a CTAS into Parquet generates exceptions.
> Source file is ~650MB, a table of 4 key columns followed by 39 numeric data columns, otherwise a fairly simple format. Example:
> {noformat}
> 2015-10-17 00:00,f5e9v8u2,err,fr7,226020793,76.094,26307,226020793,76.094,26307,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
> 2015-10-17 00:00,c3f9x5z2,err,mi1,1339159295,216.004,177690,1339159295,216.004,177690,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
> 2015-10-17 00:00,r5z2f2i9,err,mi1,7159994629,39718.011,65793,6142021303,30687.811,64630,143777403,40.521,146,75503742,41.905,89,170771174,168.165,198,192565529,370.475,222,97577280,318.068,120,62631452,288.253,68,32371173,189.527,39,41712265,299.184,46,39046408,363.418,47,34182318,465.343,43,127834582,6485.341,145
> 2015-10-17 00:00,j9s6i8t2,err,fr7,20580443899,277445.055,67826,2814893469,85447.816,54275,2584757097,608.001,2044,1395571268,769.113,1051,3070616988,3000.005,2284,3413811671,6489.060,2569,1772235156,5806.214,1339,1097879284,5064.120,858,691884865,4035.397,511,672967845,4815.875,518,789163614,7306.684,599,813910495,10632.464,627,1462752147,143470.306,1151
> {noformat}
> A "SELECT from `/path/to/file.csv`" runs for 10's of minutes and eventually results in:
> {noformat}
> java.lang.IndexOutOfBoundsException: index: 547681, length: 1 (expected: range(0, 547681))
>         at io.netty.buffer.AbstractByteBuf.checkIndex(AbstractByteBuf.java:1134)
>         at io.netty.buffer.PooledUnsafeDirectByteBuf.getBytes(PooledUnsafeDirectByteBuf.java:136)
>         at io.netty.buffer.WrappedByteBuf.getBytes(WrappedByteBuf.java:289)
>         at io.netty.buffer.UnsafeDirectLittleEndian.getBytes(UnsafeDirectLittleEndian.java:26)
>         at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
>         at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
>         at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
>         at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
>         at org.apache.drill.exec.vector.VarCharVector$Accessor.get(VarCharVector.java:443)
>         at org.apache.drill.exec.vector.accessor.VarCharAccessor.getBytes(VarCharAccessor.java:125)
>         at org.apache.drill.exec.vector.accessor.VarCharAccessor.getString(VarCharAccessor.java:146)
>         at org.apache.drill.exec.vector.accessor.VarCharAccessor.getObject(VarCharAccessor.java:136)
>         at org.apache.drill.exec.vector.accessor.VarCharAccessor.getObject(VarCharAccessor.java:94)
>         at org.apache.drill.exec.vector.accessor.BoundCheckingAccessor.getObject(BoundCheckingAccessor.java:148)
>         at org.apache.drill.jdbc.impl.TypeConvertingSqlAccessor.getObject(TypeConvertingSqlAccessor.java:795)
>         at org.apache.drill.jdbc.impl.AvaticaDrillSqlAccessor.getObject(AvaticaDrillSqlAccessor.java:179)
>         at net.hydromatic.avatica.AvaticaResultSet.getObject(AvaticaResultSet.java:351)
>         at org.apache.drill.jdbc.impl.DrillResultSetImpl.getObject(DrillResultSetImpl.java:420)
>         at sqlline.Rows$Row.<init>(Rows.java:157)
>         at sqlline.IncrementalRows.hasNext(IncrementalRows.java:63)
>         at sqlline.TableOutputFormat$ResizingRowsProvider.next(TableOutputFormat.java:87)
>         at sqlline.TableOutputFormat.print(TableOutputFormat.java:118)
>         at sqlline.SqlLine.print(SqlLine.java:1593)
>         at sqlline.Commands.execute(Commands.java:852)
>         at sqlline.Commands.sql(Commands.java:751)
>         at sqlline.SqlLine.dispatch(SqlLine.java:746)
>         at sqlline.SqlLine.begin(SqlLine.java:621)
>         at sqlline.SqlLine.start(SqlLine.java:375)
>         at sqlline.SqlLine.main(SqlLine.java:268)
> {noformat}
> A CTAS on the same file with storage as Parquet results in:
> {noformat}
> Error: SYSTEM ERROR: IllegalArgumentException: length: -260 (expected: >= 0)
> Fragment 1:2
> [Error Id: 1807615e-4385-4f85-8402-5900aaa568e9 on es07:31010]
>   (java.lang.IllegalArgumentException) length: -260 (expected: >= 0)
>     io.netty.buffer.AbstractByteBuf.checkIndex():1131
>     io.netty.buffer.PooledUnsafeDirectByteBuf.nioBuffer():344
>     io.netty.buffer.WrappedByteBuf.nioBuffer():727
>     io.netty.buffer.UnsafeDirectLittleEndian.nioBuffer():26
>     io.netty.buffer.DrillBuf.nioBuffer():356
>     org.apache.drill.exec.store.ParquetOutputRecordWriter$VarCharParquetConverter.writeField():1842
>     org.apache.drill.exec.store.EventBasedRecordWriter.write():62
>     org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():106
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>     org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>     org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():93
>     org.apache.drill.exec.physical.impl.BaseRootExec.next():94
>     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():256
>     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():250
>     java.security.AccessController.doPrivileged():-2
>     javax.security.auth.Subject.doAs():415
>     org.apache.hadoop.security.UserGroupInformation.doAs():1657
>     org.apache.drill.exec.work.fragment.FragmentExecutor.run():250
>     org.apache.drill.common.SelfCleaningRunnable.run():38
>     java.util.concurrent.ThreadPoolExecutor.runWorker():1145
>     java.util.concurrent.ThreadPoolExecutor$Worker.run():615
>     java.lang.Thread.run():745 (state=,code=0)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)