You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@orc.apache.org by GitBox <gi...@apache.org> on 2021/03/03 16:18:02 UTC

[GitHub] [orc] bilbingham opened a new pull request #649: [ORC-756] Check null readerschema prior to setting readerSchemaIsAcid && readerColumnOffset

bilbingham opened a new pull request #649:
URL: https://github.com/apache/orc/pull/649


   -- Add a check for null readerSchema and using the fileschema before calculating readerSchemaIsAcid && readerColumnOffset
   -- resolves "Include vector the wrong length" error for ACID files/tables that don't include OrcConf.MAPRED_INPUT_SCHEMA
   
   ### What changes were proposed in this pull request?
   Setting readerSchema to the fileSchema prior to setting final properties for readerschema. 
   
   ### Why are the changes needed?
   - Currently SchemaEvolution from a recordreader without OrcConf.MAPRED_INPUT_SCHEMA causes 
   "Include vector the wrong length" error 
   
   ### How was this patch tested?
   Limited Testing on local workstation. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] bilbingham edited a comment on pull request #649: [ORC-756] Check null readerschema prior to setting readerSchemaIsAcid && readerColumnOffset

Posted by GitBox <gi...@apache.org>.

bilbingham edited a comment on pull request #649:
URL: https://github.com/apache/orc/pull/649#issuecomment-790733843


   To reproduce 
   The Attached file was created with the hivev2 streaming api into the following table (5000k random values) 
   CREATE TABLE acidorc ( i int, j int, k int)
   STORED AS ORC
   [acid.orc.zip](https://github.com/apache/orc/files/6085038/acid.orc.zip)
   
   
   tblproperties( "transactional"="true", "orc.compress"="SNAPPY", "orc.bloom.filter.columns"="i,j,k");
   
   ```
   package orc.apache.orc.test;
   import com.google.gson.stream.JsonWriter;
   import org.apache.hadoop.fs.FileSystem;
   import org.apache.hadoop.fs.Path;
   import org.apache.hadoop.io.NullWritable;
   import org.apache.hadoop.mapred.JobConf;
   import org.apache.hadoop.mapreduce.RecordReader;
   import org.apache.hadoop.mapreduce.TaskAttemptContext;
   import org.apache.hadoop.mapreduce.TaskAttemptID;
   import org.apache.hadoop.mapreduce.TaskType;
   import org.apache.hadoop.mapreduce.lib.input.FileSplit;
   import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl;
   import org.apache.hive.streaming.StreamingException;
   import org.apache.orc.OrcConf;
   import org.apache.orc.mapred.OrcStruct;
   import org.apache.orc.mapreduce.OrcInputFormat;
   
   import java.io.*;
   
   public class Break {
       public static void main(String[] args) throws StreamingException, IOException, InterruptedException {
   //        CREATE TABLE acidorc ( i int, j int, k int)
   //        STORED AS ORC
   //        tblproperties( "transactional"="true", "orc.compress"="SNAPPY", "orc.bloom.filter.columns"="i,j,k");
   
           Path workDir = new Path(System.getProperty("test.tmp.dir", "target" + File.separator + "test" + File.separator + "tmp"));
           JobConf conf = new JobConf();
           FileSystem fs;
   //        String typeStr = "struct<i:int,j:int,k:int>";
   //        OrcConf.MAPRED_OUTPUT_SCHEMA.setString(conf, typeStr);
   //        conf.set("mapreduce.output.fileoutputformat.outputdir", "workDir.toStworkDir.toString()ring()");
   //        conf.setInt(OrcConf.ROW_INDEX_STRIDE.getAttribute(), 1000);
   //        conf.setBoolean(OrcOutputFormat.SKIP_TEMP_DIRECTORY, true);
           TaskAttemptID id = new TaskAttemptID("jt", 0, TaskType.MAP, 0, 1);
           TaskAttemptContext attemptContext = new TaskAttemptContextImpl(conf, id);
   //        OutputFormat<NullWritable, OrcStruct> outputFormat =
   //                new OrcOutputFormat<OrcStruct>();
   //        RecordWriter<NullWritable, OrcStruct> writer = outputFormat.getRecordWriter(attemptContext);
   //
   //        // write 4000 rows with the integer and the binary string
   //        TypeDescription type = TypeDescription.fromString(typeStr);
   //        OrcStruct row = (OrcStruct) OrcStruct.createValue(type);
   //        NullWritable nada = NullWritable.get();
   //        for(int r=0; r < 3000; ++r) {
   //            row.setFieldValue(0, new IntWritable(r));
   //            row.setFieldValue(1, new IntWritable(r * 2));
   //            row.setFieldValue(2, new IntWritable(r * 3));
   //            writer.write(nada, row);
   //        }
   //        writer.close(attemptContext);
           conf.set(OrcConf.INCLUDE_COLUMNS.getAttribute(), "5");
           FileSplit split = new FileSplit(new Path("/tmp", "acid.orc"),
                   0, 1000000, new String[0]);
           RecordReader<NullWritable, OrcStruct> reader =
                   new OrcInputFormat<OrcStruct>().createRecordReader(split,
                           attemptContext);
           // the sarg should cause it to skip over the rows except 1000 to 2000
           int count = 0;
           while (reader.nextKeyValue() && count < 5) {
               count++;
               OutputStream outputStream = new ByteArrayOutputStream();
               JsonWriter jw = new JsonWriter(new OutputStreamWriter(outputStream, "UTF-8"));
               OrcStruct row = (OrcStruct)reader.getCurrentValue();
               jw.beginObject();
               for (int i = 0; i < row.getNumFields(); i++) {
                   jw.name(row.getSchema().getFieldNames().get(i));
                   jw.value(String.valueOf(row.getFieldValue(i)));
               }
               jw.endObject();
               jw.close();
               String x = outputStream.toString();
               System.out.println(outputStream.toString());
           }
       }
   }
   
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] omalley closed pull request #649: [ORC-756] Check null readerschema prior to setting readerSchemaIsAcid && readerColumnOffset

Posted by GitBox <gi...@apache.org>.

omalley closed pull request #649:
URL: https://github.com/apache/orc/pull/649


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] pgaref edited a comment on pull request #649: [ORC-756] Check null readerschema prior to setting readerSchemaIsAcid && readerColumnOffset

Posted by GitBox <gi...@apache.org>.

pgaref edited a comment on pull request #649:
URL: https://github.com/apache/orc/pull/649#issuecomment-789930186


   Hey @bilbingham  thats for reporting this! Can you please help me reproduce this?
   It would also make sense to add a TestCase targeting this change but this can be the next step :) 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] pgaref commented on pull request #649: [ORC-756] Check null readerschema prior to setting readerSchemaIsAcid && readerColumnOffset

Posted by GitBox <gi...@apache.org>.

pgaref commented on pull request #649:
URL: https://github.com/apache/orc/pull/649#issuecomment-789930186


   Hey @bilbingham  thats for reporting this! Can you please add a code sample to reproduce this?
   It would also make sense to add a TestCase targeting this change but this can be the next step :) 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] omalley commented on pull request #649: [ORC-756] Check null readerschema prior to setting readerSchemaIsAcid && readerColumnOffset

Posted by GitBox <gi...@apache.org>.

omalley commented on pull request #649:
URL: https://github.com/apache/orc/pull/649#issuecomment-805274428


   This was resolved by f3cd500bbe.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] pgaref commented on pull request #649: [ORC-756] Check null readerschema prior to setting readerSchemaIsAcid && readerColumnOffset

Posted by GitBox <gi...@apache.org>.

pgaref commented on pull request #649:
URL: https://github.com/apache/orc/pull/649#issuecomment-790052712


   Hey @bilbingham I guess a [MR test like this](https://github.com/apache/orc/blob/master/java/mapreduce/src/test/org/apache/orc/mapreduce/TestMapreduceOrcOutputFormat.java#L116) could be used to repro this -- even better if you have some ORC files already.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] pgaref commented on pull request #649: [ORC-756] Check null readerschema prior to setting readerSchemaIsAcid && readerColumnOffset

Posted by GitBox <gi...@apache.org>.

pgaref commented on pull request #649:
URL: https://github.com/apache/orc/pull/649#issuecomment-792722046


   Hey @bilbingham  thanks the the detailed example, this was super helpful.
   Indeed seems that the MR Record reader can not process ACID files when schema is not set as part of the options.
   Do believe we should default to fileSchema when its not set -- create a PR following that logic as part of #650 
   
   Please let me know what you think!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] bilbingham edited a comment on pull request #649: [ORC-756] Check null readerschema prior to setting readerSchemaIsAcid && readerColumnOffset

Posted by GitBox <gi...@apache.org>.

bilbingham edited a comment on pull request #649:
URL: https://github.com/apache/orc/pull/649#issuecomment-790733843


   To reproduce 
   The Attached file was created with the hivev2 streaming api into the following table (5000k random values) 
   CREATE TABLE acidorc ( i int, j int, k int)
   STORED AS ORC
   tblproperties( "transactional"="true", "orc.compress"="SNAPPY", "orc.bloom.filter.columns"="i,j,k");
   
   [acid.orc.zip](https://github.com/apache/orc/files/6085038/acid.orc.zip)
   
   ```
   package orc.apache.orc.test;
   import com.google.gson.stream.JsonWriter;
   import org.apache.hadoop.fs.FileSystem;
   import org.apache.hadoop.fs.Path;
   import org.apache.hadoop.io.NullWritable;
   import org.apache.hadoop.mapred.JobConf;
   import org.apache.hadoop.mapreduce.RecordReader;
   import org.apache.hadoop.mapreduce.TaskAttemptContext;
   import org.apache.hadoop.mapreduce.TaskAttemptID;
   import org.apache.hadoop.mapreduce.TaskType;
   import org.apache.hadoop.mapreduce.lib.input.FileSplit;
   import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl;
   import org.apache.hive.streaming.StreamingException;
   import org.apache.orc.OrcConf;
   import org.apache.orc.mapred.OrcStruct;
   import org.apache.orc.mapreduce.OrcInputFormat;
   
   import java.io.*;
   
   public class Break {
       public static void main(String[] args) throws StreamingException, IOException, InterruptedException {
   //        CREATE TABLE acidorc ( i int, j int, k int)
   //        STORED AS ORC
   //        tblproperties( "transactional"="true", "orc.compress"="SNAPPY", "orc.bloom.filter.columns"="i,j,k");
   
           Path workDir = new Path(System.getProperty("test.tmp.dir", "target" + File.separator + "test" + File.separator + "tmp"));
           JobConf conf = new JobConf();
           FileSystem fs;
   //        String typeStr = "struct<i:int,j:int,k:int>";
   //        OrcConf.MAPRED_OUTPUT_SCHEMA.setString(conf, typeStr);
   //        conf.set("mapreduce.output.fileoutputformat.outputdir", "workDir.toStworkDir.toString()ring()");
   //        conf.setInt(OrcConf.ROW_INDEX_STRIDE.getAttribute(), 1000);
   //        conf.setBoolean(OrcOutputFormat.SKIP_TEMP_DIRECTORY, true);
           TaskAttemptID id = new TaskAttemptID("jt", 0, TaskType.MAP, 0, 1);
           TaskAttemptContext attemptContext = new TaskAttemptContextImpl(conf, id);
   //        OutputFormat<NullWritable, OrcStruct> outputFormat =
   //                new OrcOutputFormat<OrcStruct>();
   //        RecordWriter<NullWritable, OrcStruct> writer = outputFormat.getRecordWriter(attemptContext);
   //
   //        // write 4000 rows with the integer and the binary string
   //        TypeDescription type = TypeDescription.fromString(typeStr);
   //        OrcStruct row = (OrcStruct) OrcStruct.createValue(type);
   //        NullWritable nada = NullWritable.get();
   //        for(int r=0; r < 3000; ++r) {
   //            row.setFieldValue(0, new IntWritable(r));
   //            row.setFieldValue(1, new IntWritable(r * 2));
   //            row.setFieldValue(2, new IntWritable(r * 3));
   //            writer.write(nada, row);
   //        }
   //        writer.close(attemptContext);
           conf.set(OrcConf.INCLUDE_COLUMNS.getAttribute(), "5");
           FileSplit split = new FileSplit(new Path("/tmp", "acid.orc"),
                   0, 1000000, new String[0]);
           RecordReader<NullWritable, OrcStruct> reader =
                   new OrcInputFormat<OrcStruct>().createRecordReader(split,
                           attemptContext);
           // the sarg should cause it to skip over the rows except 1000 to 2000
           int count = 0;
           while (reader.nextKeyValue() && count < 5) {
               count++;
               OutputStream outputStream = new ByteArrayOutputStream();
               JsonWriter jw = new JsonWriter(new OutputStreamWriter(outputStream, "UTF-8"));
               OrcStruct row = (OrcStruct)reader.getCurrentValue();
               jw.beginObject();
               for (int i = 0; i < row.getNumFields(); i++) {
                   jw.name(row.getSchema().getFieldNames().get(i));
                   jw.value(String.valueOf(row.getFieldValue(i)));
               }
               jw.endObject();
               jw.close();
               String x = outputStream.toString();
               System.out.println(outputStream.toString());
           }
       }
   }
   
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] bilbingham commented on pull request #649: [ORC-756] Check null readerschema prior to setting readerSchemaIsAcid && readerColumnOffset

Posted by GitBox <gi...@apache.org>.

bilbingham commented on pull request #649:
URL: https://github.com/apache/orc/pull/649#issuecomment-789971351


   To reproduce. (Sorry this is only a fragment) my test code is currently tied to a specific hdp cluster (hive and conf variables.,  I don't have a generic test case at the moment) 
   
   WIth a hive table. based on Hive APIV2 Streaming.  (https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest+V2 )
   CREATE TABLE acidorc ( col1 string, col2 string, col3 string, col4 string, col5 string )
   PARTITIONED BY (part1 string)
   STORED AS ORC 
   tblproperties( "transactional"="true", "orc.compress"="SNAPPY",  "orc.bloom.filter.columns"="col1,col2,col3");
   
   
   
   Path p = "/path/to/acidorc/files";
   OrcInputFormat oif = new OrcInputFormat();
   JobConf jc = new JobConf();
   // Added the schema to the conf resolves as well 
   //OrcConf.MAPRED_INPUT_SCHEMA.setString(jc,"struct<operation:int,originalTransaction:bigint,bucket:int,rowId:bigint,currentTransaction:bigint,row:struct<col1:string,col2:string,col3:string,col4:string,col5:string>>");
   //Works fine until you try to INCLUDE_COLUMNS 
   OrcConf.INCLUDE_COLUMNS.setString(jc,"5");
   Job theJob = new Job(jc);
   org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(theJob, p);
   List<org.apache.hadoop.mapreduce.InputSplit> splits = oif.getSplits(theJob);
   splits.forEach(split -> {
               try {
                   org.apache.hadoop.mapreduce.RecordReader rr = oif.createRecordReader(split, tac);
                   rr.initialize(split,tac);
                   while (rr.nextKeyValue()) {
                       OutputStream outputStream = new ByteArrayOutputStream();
                       JsonWriter jw = new JsonWriter(new OutputStreamWriter(outputStream, "UTF-8"));
                       OrcStruct row = (OrcStruct)(((OrcStruct) rr.getCurrentValue()).getFieldValue(5));
                       jw.beginObject();
                       for (int i = 0; i < row.getNumFields(); i++) {
                           jw.name(row.getSchema().getFieldNames().get(i));
                           jw.value(String.valueOf(row.getFieldValue(i)));
                       }
                       jw.endObject();
                       jw.close();
                   }
               } catch (Exception ex) {
                   System.out.println("Bummer " + ex.getMessage());
               }
   }


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] pgaref edited a comment on pull request #649: [ORC-756] Check null readerschema prior to setting readerSchemaIsAcid && readerColumnOffset

Posted by GitBox <gi...@apache.org>.

pgaref edited a comment on pull request #649:
URL: https://github.com/apache/orc/pull/649#issuecomment-790052712


   Hey @bilbingham I guess a [MR test like this](https://github.com/apache/orc/blob/master/java/mapreduce/src/test/org/apache/orc/mapreduce/TestMapreduceOrcOutputFormat.java#L116) could be used to repro the above scenario -- even better if you have some ORC files already.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] bilbingham commented on pull request #649: [ORC-756] Check null readerschema prior to setting readerSchemaIsAcid && readerColumnOffset

Posted by GitBox <gi...@apache.org>.

bilbingham commented on pull request #649:
URL: https://github.com/apache/orc/pull/649#issuecomment-790733843


   To reproduce 
   The Attached file was creating with the hivev2 streaming api into the following table (5000k random values) 
   CREATE TABLE acidorc ( i int, j int, k int)
   STORED AS ORC
   [acid.orc.zip](https://github.com/apache/orc/files/6085038/acid.orc.zip)
   
   
   tblproperties( "transactional"="true", "orc.compress"="SNAPPY", "orc.bloom.filter.columns"="i,j,k");
   
   ```
   package orc.apache.orc.test;
   import com.google.gson.stream.JsonWriter;
   import org.apache.hadoop.fs.FileSystem;
   import org.apache.hadoop.fs.Path;
   import org.apache.hadoop.io.NullWritable;
   import org.apache.hadoop.mapred.JobConf;
   import org.apache.hadoop.mapreduce.RecordReader;
   import org.apache.hadoop.mapreduce.TaskAttemptContext;
   import org.apache.hadoop.mapreduce.TaskAttemptID;
   import org.apache.hadoop.mapreduce.TaskType;
   import org.apache.hadoop.mapreduce.lib.input.FileSplit;
   import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl;
   import org.apache.hive.streaming.StreamingException;
   import org.apache.orc.OrcConf;
   import org.apache.orc.mapred.OrcStruct;
   import org.apache.orc.mapreduce.OrcInputFormat;
   
   import java.io.*;
   
   public class Break {
       public static void main(String[] args) throws StreamingException, IOException, InterruptedException {
   //        CREATE TABLE acidorc ( i int, j int, k int)
   //        STORED AS ORC
   //        tblproperties( "transactional"="true", "orc.compress"="SNAPPY", "orc.bloom.filter.columns"="i,j,k");
   
           Path workDir = new Path(System.getProperty("test.tmp.dir", "target" + File.separator + "test" + File.separator + "tmp"));
           JobConf conf = new JobConf();
           FileSystem fs;
   //        String typeStr = "struct<i:int,j:int,k:int>";
   //        OrcConf.MAPRED_OUTPUT_SCHEMA.setString(conf, typeStr);
   //        conf.set("mapreduce.output.fileoutputformat.outputdir", "workDir.toStworkDir.toString()ring()");
   //        conf.setInt(OrcConf.ROW_INDEX_STRIDE.getAttribute(), 1000);
   //        conf.setBoolean(OrcOutputFormat.SKIP_TEMP_DIRECTORY, true);
           TaskAttemptID id = new TaskAttemptID("jt", 0, TaskType.MAP, 0, 1);
           TaskAttemptContext attemptContext = new TaskAttemptContextImpl(conf, id);
   //        OutputFormat<NullWritable, OrcStruct> outputFormat =
   //                new OrcOutputFormat<OrcStruct>();
   //        RecordWriter<NullWritable, OrcStruct> writer = outputFormat.getRecordWriter(attemptContext);
   //
   //        // write 4000 rows with the integer and the binary string
   //        TypeDescription type = TypeDescription.fromString(typeStr);
   //        OrcStruct row = (OrcStruct) OrcStruct.createValue(type);
   //        NullWritable nada = NullWritable.get();
   //        for(int r=0; r < 3000; ++r) {
   //            row.setFieldValue(0, new IntWritable(r));
   //            row.setFieldValue(1, new IntWritable(r * 2));
   //            row.setFieldValue(2, new IntWritable(r * 3));
   //            writer.write(nada, row);
   //        }
   //        writer.close(attemptContext);
           conf.set(OrcConf.INCLUDE_COLUMNS.getAttribute(), "5");
           FileSplit split = new FileSplit(new Path("/tmp", "acid.orc"),
                   0, 1000000, new String[0]);
           RecordReader<NullWritable, OrcStruct> reader =
                   new OrcInputFormat<OrcStruct>().createRecordReader(split,
                           attemptContext);
           // the sarg should cause it to skip over the rows except 1000 to 2000
           int count = 0;
           while (reader.nextKeyValue() && count < 5) {
               count++;
               OutputStream outputStream = new ByteArrayOutputStream();
               JsonWriter jw = new JsonWriter(new OutputStreamWriter(outputStream, "UTF-8"));
               OrcStruct row = (OrcStruct)reader.getCurrentValue();
               jw.beginObject();
               for (int i = 0; i < row.getNumFields(); i++) {
                   jw.name(row.getSchema().getFieldNames().get(i));
                   jw.value(String.valueOf(row.getFieldValue(i)));
               }
               jw.endObject();
               jw.close();
               String x = outputStream.toString();
               System.out.println(outputStream.toString());
           }
       }
   }
   
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org