You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Peter Marron <Pe...@trilliumsoftware.com> on 2013/05/15 12:38:23 UTC

Filtering

Hi,

I'm using Hive 0.10.0 and Hadoop 1.0.4.

I would like to create a normal table but have some of my code run so that I can remove filtering
parts of the query and limit the output in the splits of the InputFormat. I believe that this is
"Filtering Pushdown" as described in https://cwiki.apache.org/Hive/filterpushdowndev.html
I have tried various approaches and run into problems and I was wondering
if anyone had any suggestions as to how I might proceed.

Firstly although that page mentions InputFormat there doesn't seem to be any way (that I can find)
to perform filter passing to InputFormats and so I gave up on that approach.

I see that in method pushFilterToStorageHandler around line 776 of file OpProcFactory.java
there is a call to the decomposePredicate method of the storage handler if it's an instance of HiveStoragePredicateHandler.
This looks exactly like what I want. So my first approach is to provide my own StorageHandler.

However when I create a table with my own custom storage handler I get this behaviour:

                > DROP TABLE ordinals2;
                OK
                Time taken: 0.136 seconds
                hive> CREATE TABLE ordinals2 (english STRING, number INT, italian STRING)
                >  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
                >  STORED BY 'StorageWindow';
                getInputFormatClass
                getOutputFormatClass
                OK
                Time taken: 0.075 seconds
                hive> LOAD DATA LOCAL INPATH 'ordinals.txt' OVERWRITE INTO TABLE ordinals2;
                FAILED: SemanticException [Error 10101]: A non-native table cannot be used as target for LOAD
                hive>

However I don't care if the table is "non-native" or not, I just want to use it in the same
way that I use a normal table. Is there some way that I can get the normal behaviour
and still use a custom storage handler?

OK, given that the StorageHandler approach doesn't seem to working out for me I tried
the next obvious approach - Indexes. So I created my own index class called "DummyIndex'.
Because I wanted to keep this index as simple as possible I had it return 'false' in the
method usesIndexTable. However when I then try and create an index of this type I get
this error.

                > CREATE INDEX dummy_ordinals ON TABLE ordinals (italian)
                >  AS 'DummyIndex' WITH DEFERRED REBUILD;
                FAILED: Error in metadata: java.lang.NullPointerException
                FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
                hive>

A little investigation suggests that this is failing in method add_Index
lines 2796-2822 of file HiveMetaStore.java. Here we can see that the
argument indexTable is de-referenced on line 2798 for what looks like a tracing
statement. This gives the NPE if the argument is null.

If we look at method createIndex lines 614-759 of file Hive.java
we can see that the variable tt is initialised to null on line 722 and passed
to the CreateIndex method on line 754. But it can only change to a non-null
value if the indexHandler returns true in the usesIndexTable method on line 725.

The bottom line seems to be that the code can't work if you have an index
that returns false for usesIndexTable. Isn't this a bug? Do I need to raise a JIRA?

If I change my index so that it returns true for usesIndexTable but still try and keep
all the other methods minimal I get this error when I try to create an index.

                > CREATE INDEX dummy_ordinals ON TABLE ordinals (italian)
                >  AS 'DummyIndex' WITH DEFERRED REBUILD;
                FAILED: Error in metadata: at least one column must be specified for the table
                FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

So, at the moment, it looks like I'm going to have my index class derive from one of the reference
implementations (org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler) in order to get it to work at all.

Is it worth trying to build Hive from source so that I can hack out the tracing that causes the NPE?
Or is it likely to start failing somewhere else?

Any comments welcome.

Peter Marron
Trillium Software UK Limited

Tel : +44 (0) 118 940 7609
Fax : +44 (0) 118 940 7699
E: Peter.Marron@TrilliumSoftware.com<ma...@TrilliumSoftware.com>

Re: OrcFile writing failing with multiple threads

Posted by Andrew Psaltis <An...@Webtrends.com>.

Yes I am using the orc.writer interface directly. Unless I synchronize the call to addRow it still fails.

In good health,
Andrew

Sent from my GPU powered iPhone

On Jun 11, 2013, at 11:00, "Owen O'Malley" <om...@apache.org>> wrote:

Sorry for dropping this thread for so long. Are you using the orc.Writer interface directly or going through the OrcOutputFormat? I forgot that I had updated the Writer to synchronize things. It looks like all of the methods in Writer are synchronized. I haven't had a chance to investigate this further yet.

-- Owen


On Fri, May 24, 2013 at 1:28 PM, Andrew Psaltis <An...@webtrends.com>> wrote:
Here is a snippet from the file header comment the WriterImpl for ORC:

/**
 …………
 * This class is synchronized so that multi-threaded access is ok. In
 * particular, because the MemoryManager is shared between writers, this class
 * assumes that checkMemory may be called from a separate thread.
 */

And then the addRow looks like this:

 public void addRow(Object row) throws IOException {
    synchronized (this) {
      treeWriter.write(row);
      rowsInStripe += 1;
      if (buildIndex) {
        rowsInIndex += 1;

        if (rowsInIndex >= rowIndexStride) {
          createRowIndexEntry();
        }
      }
    }
    memoryManager.addedRow();
  }

Am I missing something here about the synchronized(this) ?  Perhaps I am looking in the wrong place.

Thanks,
agp


From: Owen O'Malley <om...@apache.org>>
Reply-To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Date: Friday, May 24, 2013 2:15 PM
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: Re: OrcFile writing failing with multiple threads

Currently, ORC writers, like the Java collections API don't lock themselves. You should synchronize on the writer before adding a row. I'm open to making the writers synchronized.

-- Owen


On Fri, May 24, 2013 at 11:39 AM, Andrew Psaltis <An...@webtrends.com>> wrote:
All,
I have a test application that is attempting to add rows to an OrcFile from multiple threads, however, every time I do I get exceptions with stack traces like the following:

java.lang.IndexOutOfBoundsException: Index 4 is outside of 0..5
at org.apache.hadoop.hive.ql.io.orc.DynamicIntArray.get(DynamicIntArray.java:73)
at org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.compareValue(StringRedBlackTree.java:55)
at org.apache.hadoop.hive.ql.io.orc.RedBlackTree.add(RedBlackTree.java:192)
at org.apache.hadoop.hive.ql.io.orc.RedBlackTree.add(RedBlackTree.java:199)
at org.apache.hadoop.hive.ql.io.orc.RedBlackTree.add(RedBlackTree.java:300)
at org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.add(StringRedBlackTree.java:45)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.write(WriterImpl.java:723)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl$MapTreeWriter.write(WriterImpl.java:1093)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl$StructTreeWriter.write(WriterImpl.java:996)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:1450)
at OrcFileTester$BigRowWriter.run(OrcFileTester.java:129)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)


Below is the source code for my sample app that is heavily based on the TestOrcFile test case using BigRow. Is there something I am doing wrong here, or is this a legitimate bug in the Orc writing?

Thanks in advance,
Andrew


------------------------- Java app code follows ---------------------------------
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hive.ql.io.orc.CompressionKind;
import org.apache.hadoop.hive.ql.io.orc.OrcFile;
import org.apache.hadoop.hive.ql.io.orc.Writer;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;

import java.io.File;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.LinkedBlockingQueue;

public class OrcFileTester {

    private Writer writer;
    private LinkedBlockingQueue<BigRow> bigRowQueue = new LinkedBlockingQueue<BigRow>();
    public OrcFileTester(){

      try{
        Path workDir = new Path(System.getProperty("test.tmp.dir",
                "target" + File.separator + "test" + File.separator + "tmp"));

        Configuration conf;
        FileSystem fs;
        Path testFilePath;

        conf = new Configuration();
        fs = FileSystem.getLocal(conf);
        testFilePath = new Path(workDir, "TestOrcFile.OrcFileTester.orc");
        fs.delete(testFilePath, false);


        ObjectInspector inspector = ObjectInspectorFactory.getReflectionObjectInspector
                (BigRow.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
        writer = OrcFile.createWriter(fs, testFilePath, conf, inspector,
                100000, CompressionKind.ZLIB, 10000, 10000);

        final ExecutorService bigRowWorkerPool = Executors.newFixedThreadPool(10);

        //Changing this to more than 1 causes exceptions when writing rows.
        for (int i = 0; i < 1; i++) {
            bigRowWorkerPool.submit(new BigRowWriter());
        }
          for(int i =0; i < 100; i++){
              if(0 == i % 2){
                 bigRowQueue.put(new BigRow(false, (byte) 1, (short) 1024, 65536,
                         Long.MAX_VALUE, (float) 1.0, -15.0, bytes(0,1,2,3,4), "hi",map("hey","orc")));
              } else{
                   bigRowQueue.put(new BigRow(false, null, (short) 1024, 65536,
                           Long.MAX_VALUE, (float) 1.0, -15.0, bytes(0,1,2,3,4), "hi",map("hey","orc")));
              }
          }

          while (!bigRowQueue.isEmpty()) {
              Thread.sleep(2000);
          }
          bigRowWorkerPool.shutdownNow();
      }catch(Exception ex){
          ex.printStackTrace();
      }
    }
    public void WriteBigRow(){

    }

    private static Map<Text, Text> map(String... items)  {
        Map<Text, Text> result = new HashMap<Text, Text>();
        for(String i: items) {
            result.put(new Text(i), new Text(i));
        }
        return result;
    }
    private static BytesWritable bytes(int... items) {
        BytesWritable result = new BytesWritable();
        result.setSize(items.length);
        for(int i=0; i < items.length; ++i) {
            result.getBytes()[i] = (byte) items[i];
        }
        return result;
    }

    public static class BigRow {
        Boolean boolean1;
        Byte byte1;
        Short short1;
        Integer int1;
        Long long1;
        Float float1;
        Double double1;
        BytesWritable bytes1;
        Text string1;
        Map<Text, Text> map = new HashMap<Text, Text>();

        BigRow(Boolean b1, Byte b2, Short s1, Integer i1, Long l1, Float f1,
               Double d1,
               BytesWritable b3, String s2, Map<Text, Text> m2) {
            this.boolean1 = b1;
            this.byte1 = b2;
            this.short1 = s1;
            this.int1 = i1;
            this.long1 = l1;
            this.float1 = f1;
            this.double1 = d1;
            this.bytes1 = b3;
            if (s2 == null) {
                this.string1 = null;
            } else {
                this.string1 = new Text(s2);
            }
            this.map = m2;
        }
    }



    class BigRowWriter implements Runnable{

        @Override
        public void run() {
                try {
                    BigRow bigRow = bigRowQueue.take();
                    writer.addRow(bigRow);
                } catch (Exception e) {
                    e.printStackTrace();
                }

        }
    }

    public static void main(String[] args) throws IOException {
        OrcFileTester  orcFileTester = new OrcFileTester();
        orcFileTester.WriteBigRow();
    }



}

-----------------------------end of Java source ------------------------------

----------------------------- pom file start ----------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>ORCTester</groupId>
    <artifactId>ORCTester</artifactId>
    <version>1.0-SNAPSHOT</version>
  <dependencies>
      <dependency>
          <groupId>org.apache.hive</groupId>
          <artifactId>hive-exec</artifactId>
          <version>0.11.0</version>
      </dependency>
      <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-core</artifactId>
          <version>0.20.2</version>
      </dependency>
  </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.codehaus.mojo</groupId>
                <artifactId>exec-maven-plugin</artifactId>
                <version>1.1.1</version>
                <executions>
                    <execution>
                        <phase>test</phase>
                        <goals>
                            <goal>java</goal>
                        </goals>
                        <configuration>
                            <mainClass>OrcFileTester</mainClass>
                            <arguments/>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>
----------------------------- pom file end ----------------------------------

Re: OrcFile writing failing with multiple threads

Posted by Owen O'Malley <om...@apache.org>.

Sorry for dropping this thread for so long. Are you using the orc.Writer
interface directly or going through the OrcOutputFormat? I forgot that I
had updated the Writer to synchronize things. It looks like all of the
methods in Writer are synchronized. I haven't had a chance to investigate
this further yet.

-- Owen


On Fri, May 24, 2013 at 1:28 PM, Andrew Psaltis <
Andrew.Psaltis@webtrends.com> wrote:

>  Here is a snippet from the file header comment the WriterImpl for ORC:
>
>  /**
>  …………
>  * This class is synchronized so that multi-threaded access is ok. In
>  * particular, because the MemoryManager is shared between writers, this
> class
>  * assumes that checkMemory may be called from a separate thread.
>  */
>
>  And then the addRow looks like this:
>
>   public void addRow(Object row) throws IOException {
>     synchronized (this) {
>       treeWriter.write(row);
>       rowsInStripe += 1;
>       if (buildIndex) {
>         rowsInIndex += 1;
>
>          if (rowsInIndex >= rowIndexStride) {
>           createRowIndexEntry();
>         }
>       }
>     }
>     memoryManager.addedRow();
>   }
>
>  Am I missing something here about the synchronized(this) ?  Perhaps I am
> looking in the wrong place.
>
>  Thanks,
> agp
>
>
>   From: Owen O'Malley <om...@apache.org>
> Reply-To: "user@hive.apache.org" <us...@hive.apache.org>
> Date: Friday, May 24, 2013 2:15 PM
> To: "user@hive.apache.org" <us...@hive.apache.org>
> Subject: Re: OrcFile writing failing with multiple threads
>
>   Currently, ORC writers, like the Java collections API don't lock
> themselves. You should synchronize on the writer before adding a row. I'm
> open to making the writers synchronized.
>
>  -- Owen
>
>
> On Fri, May 24, 2013 at 11:39 AM, Andrew Psaltis <
> Andrew.Psaltis@webtrends.com> wrote:
>
>>  All,
>> I have a test application that is attempting to add rows to an OrcFile
>> from multiple threads, however, every time I do I get exceptions with stack
>> traces like the following:
>>
>>  java.lang.IndexOutOfBoundsException: Index 4 is outside of 0..5
>> at
>> org.apache.hadoop.hive.ql.io.orc.DynamicIntArray.get(DynamicIntArray.java:73)
>> at
>> org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.compareValue(StringRedBlackTree.java:55)
>> at
>> org.apache.hadoop.hive.ql.io.orc.RedBlackTree.add(RedBlackTree.java:192)
>> at
>> org.apache.hadoop.hive.ql.io.orc.RedBlackTree.add(RedBlackTree.java:199)
>> at
>> org.apache.hadoop.hive.ql.io.orc.RedBlackTree.add(RedBlackTree.java:300)
>> at
>> org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.add(StringRedBlackTree.java:45)
>> at
>> org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.write(WriterImpl.java:723)
>> at
>> org.apache.hadoop.hive.ql.io.orc.WriterImpl$MapTreeWriter.write(WriterImpl.java:1093)
>> at
>> org.apache.hadoop.hive.ql.io.orc.WriterImpl$StructTreeWriter.write(WriterImpl.java:996)
>> at
>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:1450)
>> at OrcFileTester$BigRowWriter.run(OrcFileTester.java:129)
>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:722)
>>
>>
>>  Below is the source code for my sample app that is heavily based on the
>> TestOrcFile test case using BigRow. Is there something I am doing wrong
>> here, or is this a legitimate bug in the Orc writing?
>>
>>  Thanks in advance,
>> Andrew
>>
>>
>>  ------------------------- Java app code follows
>> ---------------------------------
>>  import org.apache.hadoop.conf.Configuration;
>> import org.apache.hadoop.fs.FileSystem;
>> import org.apache.hadoop.fs.Path;
>> import org.apache.hadoop.hive.ql.io.orc.CompressionKind;
>> import org.apache.hadoop.hive.ql.io.orc.OrcFile;
>> import org.apache.hadoop.hive.ql.io.orc.Writer;
>> import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
>> import
>> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
>> import org.apache.hadoop.io.BytesWritable;
>> import org.apache.hadoop.io.Text;
>>
>>  import java.io.File;
>> import java.io.IOException;
>> import java.util.HashMap;
>> import java.util.Map;
>> import java.util.concurrent.ExecutorService;
>> import java.util.concurrent.Executors;
>> import java.util.concurrent.LinkedBlockingQueue;
>>
>>  public class OrcFileTester {
>>
>>      private Writer writer;
>>     private LinkedBlockingQueue<BigRow> bigRowQueue = new
>> LinkedBlockingQueue<BigRow>();
>>     public OrcFileTester(){
>>
>>        try{
>>         Path workDir = new Path(System.getProperty("test.tmp.dir",
>>                 "target" + File.separator + "test" + File.separator +
>> "tmp"));
>>
>>          Configuration conf;
>>         FileSystem fs;
>>         Path testFilePath;
>>
>>          conf = new Configuration();
>>         fs = FileSystem.getLocal(conf);
>>         testFilePath = new Path(workDir, "TestOrcFile.OrcFileTester.orc");
>>         fs.delete(testFilePath, false);
>>
>>
>>          ObjectInspector inspector =
>> ObjectInspectorFactory.getReflectionObjectInspector
>>                 (BigRow.class,
>> ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
>>         writer = OrcFile.createWriter(fs, testFilePath, conf, inspector,
>>                 100000, CompressionKind.ZLIB, 10000, 10000);
>>
>>          final ExecutorService bigRowWorkerPool =
>> Executors.newFixedThreadPool(10);
>>
>>          //Changing this to more than 1 causes exceptions when writing
>> rows.
>>         for (int i = 0; i < 1; i++) {
>>             bigRowWorkerPool.submit(new BigRowWriter());
>>         }
>>           for(int i =0; i < 100; i++){
>>               if(0 == i % 2){
>>                  bigRowQueue.put(new BigRow(false, (byte) 1, (short)
>> 1024, 65536,
>>                          Long.MAX_VALUE, (float) 1.0, -15.0,
>> bytes(0,1,2,3,4), "hi",map("hey","orc")));
>>               } else{
>>                    bigRowQueue.put(new BigRow(false, null, (short) 1024,
>> 65536,
>>                            Long.MAX_VALUE, (float) 1.0, -15.0,
>> bytes(0,1,2,3,4), "hi",map("hey","orc")));
>>               }
>>           }
>>
>>            while (!bigRowQueue.isEmpty()) {
>>               Thread.sleep(2000);
>>           }
>>           bigRowWorkerPool.shutdownNow();
>>       }catch(Exception ex){
>>           ex.printStackTrace();
>>       }
>>     }
>>     public void WriteBigRow(){
>>
>>      }
>>
>>      private static Map<Text, Text> map(String... items)  {
>>         Map<Text, Text> result = new HashMap<Text, Text>();
>>         for(String i: items) {
>>             result.put(new Text(i), new Text(i));
>>         }
>>         return result;
>>     }
>>     private static BytesWritable bytes(int... items) {
>>         BytesWritable result = new BytesWritable();
>>         result.setSize(items.length);
>>         for(int i=0; i < items.length; ++i) {
>>             result.getBytes()[i] = (byte) items[i];
>>         }
>>         return result;
>>     }
>>
>>      public static class BigRow {
>>         Boolean boolean1;
>>         Byte byte1;
>>         Short short1;
>>         Integer int1;
>>         Long long1;
>>         Float float1;
>>         Double double1;
>>         BytesWritable bytes1;
>>         Text string1;
>>         Map<Text, Text> map = new HashMap<Text, Text>();
>>
>>          BigRow(Boolean b1, Byte b2, Short s1, Integer i1, Long l1,
>> Float f1,
>>                Double d1,
>>                BytesWritable b3, String s2, Map<Text, Text> m2) {
>>             this.boolean1 = b1;
>>             this.byte1 = b2;
>>             this.short1 = s1;
>>             this.int1 = i1;
>>             this.long1 = l1;
>>             this.float1 = f1;
>>             this.double1 = d1;
>>             this.bytes1 = b3;
>>             if (s2 == null) {
>>                 this.string1 = null;
>>             } else {
>>                 this.string1 = new Text(s2);
>>             }
>>             this.map = m2;
>>         }
>>     }
>>
>>
>>
>>      class BigRowWriter implements Runnable{
>>
>>          @Override
>>         public void run() {
>>                 try {
>>                     BigRow bigRow = bigRowQueue.take();
>>                     writer.addRow(bigRow);
>>                 } catch (Exception e) {
>>                     e.printStackTrace();
>>                 }
>>
>>          }
>>     }
>>
>>      public static void main(String[] args) throws IOException {
>>         OrcFileTester  orcFileTester = new OrcFileTester();
>>         orcFileTester.WriteBigRow();
>>     }
>>
>>
>>
>>  }
>>
>>  -----------------------------end of Java source
>> ------------------------------
>>
>>  ----------------------------- pom file start
>> ----------------------------------
>>  <?xml version="1.0" encoding="UTF-8"?>
>> <project xmlns="http://maven.apache.org/POM/4.0.0"
>>          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>>          xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
>> http://maven.apache.org/xsd/maven-4.0.0.xsd">
>>     <modelVersion>4.0.0</modelVersion>
>>
>>      <groupId>ORCTester</groupId>
>>     <artifactId>ORCTester</artifactId>
>>     <version>1.0-SNAPSHOT</version>
>>   <dependencies>
>>       <dependency>
>>           <groupId>org.apache.hive</groupId>
>>           <artifactId>hive-exec</artifactId>
>>           <version>0.11.0</version>
>>       </dependency>
>>       <dependency>
>>           <groupId>org.apache.hadoop</groupId>
>>           <artifactId>hadoop-core</artifactId>
>>           <version>0.20.2</version>
>>       </dependency>
>>   </dependencies>
>>     <build>
>>         <plugins>
>>             <plugin>
>>                 <groupId>org.codehaus.mojo</groupId>
>>                 <artifactId>exec-maven-plugin</artifactId>
>>                 <version>1.1.1</version>
>>                 <executions>
>>                     <execution>
>>                         <phase>test</phase>
>>                         <goals>
>>                             <goal>java</goal>
>>                         </goals>
>>                         <configuration>
>>                             <mainClass>OrcFileTester</mainClass>
>>                             <arguments/>
>>                         </configuration>
>>                     </execution>
>>                 </executions>
>>             </plugin>
>>         </plugins>
>>     </build>
>> </project>
>>    ----------------------------- pom file end
>> ----------------------------------
>>
>
>

Re: OrcFile writing failing with multiple threads

Posted by Andrew Psaltis <An...@Webtrends.com>.

Here is a snippet from the file header comment the WriterImpl for ORC:

/**
 …………
 * This class is synchronized so that multi-threaded access is ok. In
 * particular, because the MemoryManager is shared between writers, this class
 * assumes that checkMemory may be called from a separate thread.
 */

And then the addRow looks like this:

 public void addRow(Object row) throws IOException {
    synchronized (this) {
      treeWriter.write(row);
      rowsInStripe += 1;
      if (buildIndex) {
        rowsInIndex += 1;

        if (rowsInIndex >= rowIndexStride) {
          createRowIndexEntry();
        }
      }
    }
    memoryManager.addedRow();
  }

Am I missing something here about the synchronized(this) ?  Perhaps I am looking in the wrong place.

Thanks,
agp


From: Owen O'Malley <om...@apache.org>>
Reply-To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Date: Friday, May 24, 2013 2:15 PM
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: Re: OrcFile writing failing with multiple threads

Currently, ORC writers, like the Java collections API don't lock themselves. You should synchronize on the writer before adding a row. I'm open to making the writers synchronized.

-- Owen


On Fri, May 24, 2013 at 11:39 AM, Andrew Psaltis <An...@webtrends.com>> wrote:
All,
I have a test application that is attempting to add rows to an OrcFile from multiple threads, however, every time I do I get exceptions with stack traces like the following:

java.lang.IndexOutOfBoundsException: Index 4 is outside of 0..5
at org.apache.hadoop.hive.ql.io.orc.DynamicIntArray.get(DynamicIntArray.java:73)
at org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.compareValue(StringRedBlackTree.java:55)
at org.apache.hadoop.hive.ql.io.orc.RedBlackTree.add(RedBlackTree.java:192)
at org.apache.hadoop.hive.ql.io.orc.RedBlackTree.add(RedBlackTree.java:199)
at org.apache.hadoop.hive.ql.io.orc.RedBlackTree.add(RedBlackTree.java:300)
at org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.add(StringRedBlackTree.java:45)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.write(WriterImpl.java:723)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl$MapTreeWriter.write(WriterImpl.java:1093)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl$StructTreeWriter.write(WriterImpl.java:996)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:1450)
at OrcFileTester$BigRowWriter.run(OrcFileTester.java:129)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)


Below is the source code for my sample app that is heavily based on the TestOrcFile test case using BigRow. Is there something I am doing wrong here, or is this a legitimate bug in the Orc writing?

Thanks in advance,
Andrew


------------------------- Java app code follows ---------------------------------
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hive.ql.io.orc.CompressionKind;
import org.apache.hadoop.hive.ql.io.orc.OrcFile;
import org.apache.hadoop.hive.ql.io.orc.Writer;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;

import java.io.File;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.LinkedBlockingQueue;

public class OrcFileTester {

    private Writer writer;
    private LinkedBlockingQueue<BigRow> bigRowQueue = new LinkedBlockingQueue<BigRow>();
    public OrcFileTester(){

      try{
        Path workDir = new Path(System.getProperty("test.tmp.dir",
                "target" + File.separator + "test" + File.separator + "tmp"));

        Configuration conf;
        FileSystem fs;
        Path testFilePath;

        conf = new Configuration();
        fs = FileSystem.getLocal(conf);
        testFilePath = new Path(workDir, "TestOrcFile.OrcFileTester.orc");
        fs.delete(testFilePath, false);


        ObjectInspector inspector = ObjectInspectorFactory.getReflectionObjectInspector
                (BigRow.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
        writer = OrcFile.createWriter(fs, testFilePath, conf, inspector,
                100000, CompressionKind.ZLIB, 10000, 10000);

        final ExecutorService bigRowWorkerPool = Executors.newFixedThreadPool(10);

        //Changing this to more than 1 causes exceptions when writing rows.
        for (int i = 0; i < 1; i++) {
            bigRowWorkerPool.submit(new BigRowWriter());
        }
          for(int i =0; i < 100; i++){
              if(0 == i % 2){
                 bigRowQueue.put(new BigRow(false, (byte) 1, (short) 1024, 65536,
                         Long.MAX_VALUE, (float) 1.0, -15.0, bytes(0,1,2,3,4), "hi",map("hey","orc")));
              } else{
                   bigRowQueue.put(new BigRow(false, null, (short) 1024, 65536,
                           Long.MAX_VALUE, (float) 1.0, -15.0, bytes(0,1,2,3,4), "hi",map("hey","orc")));
              }
          }

          while (!bigRowQueue.isEmpty()) {
              Thread.sleep(2000);
          }
          bigRowWorkerPool.shutdownNow();
      }catch(Exception ex){
          ex.printStackTrace();
      }
    }
    public void WriteBigRow(){

    }

    private static Map<Text, Text> map(String... items)  {
        Map<Text, Text> result = new HashMap<Text, Text>();
        for(String i: items) {
            result.put(new Text(i), new Text(i));
        }
        return result;
    }
    private static BytesWritable bytes(int... items) {
        BytesWritable result = new BytesWritable();
        result.setSize(items.length);
        for(int i=0; i < items.length; ++i) {
            result.getBytes()[i] = (byte) items[i];
        }
        return result;
    }

    public static class BigRow {
        Boolean boolean1;
        Byte byte1;
        Short short1;
        Integer int1;
        Long long1;
        Float float1;
        Double double1;
        BytesWritable bytes1;
        Text string1;
        Map<Text, Text> map = new HashMap<Text, Text>();

        BigRow(Boolean b1, Byte b2, Short s1, Integer i1, Long l1, Float f1,
               Double d1,
               BytesWritable b3, String s2, Map<Text, Text> m2) {
            this.boolean1 = b1;
            this.byte1 = b2;
            this.short1 = s1;
            this.int1 = i1;
            this.long1 = l1;
            this.float1 = f1;
            this.double1 = d1;
            this.bytes1 = b3;
            if (s2 == null) {
                this.string1 = null;
            } else {
                this.string1 = new Text(s2);
            }
            this.map = m2;
        }
    }



    class BigRowWriter implements Runnable{

        @Override
        public void run() {
                try {
                    BigRow bigRow = bigRowQueue.take();
                    writer.addRow(bigRow);
                } catch (Exception e) {
                    e.printStackTrace();
                }

        }
    }

    public static void main(String[] args) throws IOException {
        OrcFileTester  orcFileTester = new OrcFileTester();
        orcFileTester.WriteBigRow();
    }



}

-----------------------------end of Java source ------------------------------

----------------------------- pom file start ----------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>ORCTester</groupId>
    <artifactId>ORCTester</artifactId>
    <version>1.0-SNAPSHOT</version>
  <dependencies>
      <dependency>
          <groupId>org.apache.hive</groupId>
          <artifactId>hive-exec</artifactId>
          <version>0.11.0</version>
      </dependency>
      <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-core</artifactId>
          <version>0.20.2</version>
      </dependency>
  </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.codehaus.mojo</groupId>
                <artifactId>exec-maven-plugin</artifactId>
                <version>1.1.1</version>
                <executions>
                    <execution>
                        <phase>test</phase>
                        <goals>
                            <goal>java</goal>
                        </goals>
                        <configuration>
                            <mainClass>OrcFileTester</mainClass>
                            <arguments/>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>
----------------------------- pom file end ----------------------------------

Re: OrcFile writing failing with multiple threads

Posted by Owen O'Malley <om...@apache.org>.

Currently, ORC writers, like the Java collections API don't lock
themselves. You should synchronize on the writer before adding a row. I'm
open to making the writers synchronized.

-- Owen


On Fri, May 24, 2013 at 11:39 AM, Andrew Psaltis <
Andrew.Psaltis@webtrends.com> wrote:

>  All,
> I have a test application that is attempting to add rows to an OrcFile
> from multiple threads, however, every time I do I get exceptions with stack
> traces like the following:
>
>  java.lang.IndexOutOfBoundsException: Index 4 is outside of 0..5
> at
> org.apache.hadoop.hive.ql.io.orc.DynamicIntArray.get(DynamicIntArray.java:73)
> at
> org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.compareValue(StringRedBlackTree.java:55)
> at org.apache.hadoop.hive.ql.io.orc.RedBlackTree.add(RedBlackTree.java:192)
> at org.apache.hadoop.hive.ql.io.orc.RedBlackTree.add(RedBlackTree.java:199)
> at org.apache.hadoop.hive.ql.io.orc.RedBlackTree.add(RedBlackTree.java:300)
> at
> org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.add(StringRedBlackTree.java:45)
> at
> org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.write(WriterImpl.java:723)
> at
> org.apache.hadoop.hive.ql.io.orc.WriterImpl$MapTreeWriter.write(WriterImpl.java:1093)
> at
> org.apache.hadoop.hive.ql.io.orc.WriterImpl$StructTreeWriter.write(WriterImpl.java:996)
> at org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:1450)
> at OrcFileTester$BigRowWriter.run(OrcFileTester.java:129)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:722)
>
>
>  Below is the source code for my sample app that is heavily based on the
> TestOrcFile test case using BigRow. Is there something I am doing wrong
> here, or is this a legitimate bug in the Orc writing?
>
>  Thanks in advance,
> Andrew
>
>
>  ------------------------- Java app code follows
> ---------------------------------
>  import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.hive.ql.io.orc.CompressionKind;
> import org.apache.hadoop.hive.ql.io.orc.OrcFile;
> import org.apache.hadoop.hive.ql.io.orc.Writer;
> import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
> import
> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
> import org.apache.hadoop.io.BytesWritable;
> import org.apache.hadoop.io.Text;
>
>  import java.io.File;
> import java.io.IOException;
> import java.util.HashMap;
> import java.util.Map;
> import java.util.concurrent.ExecutorService;
> import java.util.concurrent.Executors;
> import java.util.concurrent.LinkedBlockingQueue;
>
>  public class OrcFileTester {
>
>      private Writer writer;
>     private LinkedBlockingQueue<BigRow> bigRowQueue = new
> LinkedBlockingQueue<BigRow>();
>     public OrcFileTester(){
>
>        try{
>         Path workDir = new Path(System.getProperty("test.tmp.dir",
>                 "target" + File.separator + "test" + File.separator +
> "tmp"));
>
>          Configuration conf;
>         FileSystem fs;
>         Path testFilePath;
>
>          conf = new Configuration();
>         fs = FileSystem.getLocal(conf);
>         testFilePath = new Path(workDir, "TestOrcFile.OrcFileTester.orc");
>         fs.delete(testFilePath, false);
>
>
>          ObjectInspector inspector =
> ObjectInspectorFactory.getReflectionObjectInspector
>                 (BigRow.class,
> ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
>         writer = OrcFile.createWriter(fs, testFilePath, conf, inspector,
>                 100000, CompressionKind.ZLIB, 10000, 10000);
>
>          final ExecutorService bigRowWorkerPool =
> Executors.newFixedThreadPool(10);
>
>          //Changing this to more than 1 causes exceptions when writing
> rows.
>         for (int i = 0; i < 1; i++) {
>             bigRowWorkerPool.submit(new BigRowWriter());
>         }
>           for(int i =0; i < 100; i++){
>               if(0 == i % 2){
>                  bigRowQueue.put(new BigRow(false, (byte) 1, (short) 1024,
> 65536,
>                          Long.MAX_VALUE, (float) 1.0, -15.0,
> bytes(0,1,2,3,4), "hi",map("hey","orc")));
>               } else{
>                    bigRowQueue.put(new BigRow(false, null, (short) 1024,
> 65536,
>                            Long.MAX_VALUE, (float) 1.0, -15.0,
> bytes(0,1,2,3,4), "hi",map("hey","orc")));
>               }
>           }
>
>            while (!bigRowQueue.isEmpty()) {
>               Thread.sleep(2000);
>           }
>           bigRowWorkerPool.shutdownNow();
>       }catch(Exception ex){
>           ex.printStackTrace();
>       }
>     }
>     public void WriteBigRow(){
>
>      }
>
>      private static Map<Text, Text> map(String... items)  {
>         Map<Text, Text> result = new HashMap<Text, Text>();
>         for(String i: items) {
>             result.put(new Text(i), new Text(i));
>         }
>         return result;
>     }
>     private static BytesWritable bytes(int... items) {
>         BytesWritable result = new BytesWritable();
>         result.setSize(items.length);
>         for(int i=0; i < items.length; ++i) {
>             result.getBytes()[i] = (byte) items[i];
>         }
>         return result;
>     }
>
>      public static class BigRow {
>         Boolean boolean1;
>         Byte byte1;
>         Short short1;
>         Integer int1;
>         Long long1;
>         Float float1;
>         Double double1;
>         BytesWritable bytes1;
>         Text string1;
>         Map<Text, Text> map = new HashMap<Text, Text>();
>
>          BigRow(Boolean b1, Byte b2, Short s1, Integer i1, Long l1, Float
> f1,
>                Double d1,
>                BytesWritable b3, String s2, Map<Text, Text> m2) {
>             this.boolean1 = b1;
>             this.byte1 = b2;
>             this.short1 = s1;
>             this.int1 = i1;
>             this.long1 = l1;
>             this.float1 = f1;
>             this.double1 = d1;
>             this.bytes1 = b3;
>             if (s2 == null) {
>                 this.string1 = null;
>             } else {
>                 this.string1 = new Text(s2);
>             }
>             this.map = m2;
>         }
>     }
>
>
>
>      class BigRowWriter implements Runnable{
>
>          @Override
>         public void run() {
>                 try {
>                     BigRow bigRow = bigRowQueue.take();
>                     writer.addRow(bigRow);
>                 } catch (Exception e) {
>                     e.printStackTrace();
>                 }
>
>          }
>     }
>
>      public static void main(String[] args) throws IOException {
>         OrcFileTester  orcFileTester = new OrcFileTester();
>         orcFileTester.WriteBigRow();
>     }
>
>
>
>  }
>
>  -----------------------------end of Java source
> ------------------------------
>
>  ----------------------------- pom file start
> ----------------------------------
>  <?xml version="1.0" encoding="UTF-8"?>
> <project xmlns="http://maven.apache.org/POM/4.0.0"
>          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>          xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
> http://maven.apache.org/xsd/maven-4.0.0.xsd">
>     <modelVersion>4.0.0</modelVersion>
>
>      <groupId>ORCTester</groupId>
>     <artifactId>ORCTester</artifactId>
>     <version>1.0-SNAPSHOT</version>
>   <dependencies>
>       <dependency>
>           <groupId>org.apache.hive</groupId>
>           <artifactId>hive-exec</artifactId>
>           <version>0.11.0</version>
>       </dependency>
>       <dependency>
>           <groupId>org.apache.hadoop</groupId>
>           <artifactId>hadoop-core</artifactId>
>           <version>0.20.2</version>
>       </dependency>
>   </dependencies>
>     <build>
>         <plugins>
>             <plugin>
>                 <groupId>org.codehaus.mojo</groupId>
>                 <artifactId>exec-maven-plugin</artifactId>
>                 <version>1.1.1</version>
>                 <executions>
>                     <execution>
>                         <phase>test</phase>
>                         <goals>
>                             <goal>java</goal>
>                         </goals>
>                         <configuration>
>                             <mainClass>OrcFileTester</mainClass>
>                             <arguments/>
>                         </configuration>
>                     </execution>
>                 </executions>
>             </plugin>
>         </plugins>
>     </build>
> </project>
>    ----------------------------- pom file end
> ----------------------------------
>

OrcFile writing failing with multiple threads

Posted by Andrew Psaltis <An...@Webtrends.com>.

All,
I have a test application that is attempting to add rows to an OrcFile from multiple threads, however, every time I do I get exceptions with stack traces like the following:

java.lang.IndexOutOfBoundsException: Index 4 is outside of 0..5
at org.apache.hadoop.hive.ql.io.orc.DynamicIntArray.get(DynamicIntArray.java:73)
at org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.compareValue(StringRedBlackTree.java:55)
at org.apache.hadoop.hive.ql.io.orc.RedBlackTree.add(RedBlackTree.java:192)
at org.apache.hadoop.hive.ql.io.orc.RedBlackTree.add(RedBlackTree.java:199)
at org.apache.hadoop.hive.ql.io.orc.RedBlackTree.add(RedBlackTree.java:300)
at org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.add(StringRedBlackTree.java:45)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.write(WriterImpl.java:723)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl$MapTreeWriter.write(WriterImpl.java:1093)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl$StructTreeWriter.write(WriterImpl.java:996)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:1450)
at OrcFileTester$BigRowWriter.run(OrcFileTester.java:129)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)


Below is the source code for my sample app that is heavily based on the TestOrcFile test case using BigRow. Is there something I am doing wrong here, or is this a legitimate bug in the Orc writing?

Thanks in advance,
Andrew


------------------------- Java app code follows ---------------------------------
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hive.ql.io.orc.CompressionKind;
import org.apache.hadoop.hive.ql.io.orc.OrcFile;
import org.apache.hadoop.hive.ql.io.orc.Writer;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;

import java.io.File;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.LinkedBlockingQueue;

public class OrcFileTester {

    private Writer writer;
    private LinkedBlockingQueue<BigRow> bigRowQueue = new LinkedBlockingQueue<BigRow>();
    public OrcFileTester(){

      try{
        Path workDir = new Path(System.getProperty("test.tmp.dir",
                "target" + File.separator + "test" + File.separator + "tmp"));

        Configuration conf;
        FileSystem fs;
        Path testFilePath;

        conf = new Configuration();
        fs = FileSystem.getLocal(conf);
        testFilePath = new Path(workDir, "TestOrcFile.OrcFileTester.orc");
        fs.delete(testFilePath, false);


        ObjectInspector inspector = ObjectInspectorFactory.getReflectionObjectInspector
                (BigRow.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
        writer = OrcFile.createWriter(fs, testFilePath, conf, inspector,
                100000, CompressionKind.ZLIB, 10000, 10000);

        final ExecutorService bigRowWorkerPool = Executors.newFixedThreadPool(10);

        //Changing this to more than 1 causes exceptions when writing rows.
        for (int i = 0; i < 1; i++) {
            bigRowWorkerPool.submit(new BigRowWriter());
        }
          for(int i =0; i < 100; i++){
              if(0 == i % 2){
                 bigRowQueue.put(new BigRow(false, (byte) 1, (short) 1024, 65536,
                         Long.MAX_VALUE, (float) 1.0, -15.0, bytes(0,1,2,3,4), "hi",map("hey","orc")));
              } else{
                   bigRowQueue.put(new BigRow(false, null, (short) 1024, 65536,
                           Long.MAX_VALUE, (float) 1.0, -15.0, bytes(0,1,2,3,4), "hi",map("hey","orc")));
              }
          }

          while (!bigRowQueue.isEmpty()) {
              Thread.sleep(2000);
          }
          bigRowWorkerPool.shutdownNow();
      }catch(Exception ex){
          ex.printStackTrace();
      }
    }
    public void WriteBigRow(){

    }

    private static Map<Text, Text> map(String... items)  {
        Map<Text, Text> result = new HashMap<Text, Text>();
        for(String i: items) {
            result.put(new Text(i), new Text(i));
        }
        return result;
    }
    private static BytesWritable bytes(int... items) {
        BytesWritable result = new BytesWritable();
        result.setSize(items.length);
        for(int i=0; i < items.length; ++i) {
            result.getBytes()[i] = (byte) items[i];
        }
        return result;
    }

    public static class BigRow {
        Boolean boolean1;
        Byte byte1;
        Short short1;
        Integer int1;
        Long long1;
        Float float1;
        Double double1;
        BytesWritable bytes1;
        Text string1;
        Map<Text, Text> map = new HashMap<Text, Text>();

        BigRow(Boolean b1, Byte b2, Short s1, Integer i1, Long l1, Float f1,
               Double d1,
               BytesWritable b3, String s2, Map<Text, Text> m2) {
            this.boolean1 = b1;
            this.byte1 = b2;
            this.short1 = s1;
            this.int1 = i1;
            this.long1 = l1;
            this.float1 = f1;
            this.double1 = d1;
            this.bytes1 = b3;
            if (s2 == null) {
                this.string1 = null;
            } else {
                this.string1 = new Text(s2);
            }
            this.map = m2;
        }
    }



    class BigRowWriter implements Runnable{

        @Override
        public void run() {
                try {
                    BigRow bigRow = bigRowQueue.take();
                    writer.addRow(bigRow);
                } catch (Exception e) {
                    e.printStackTrace();
                }

        }
    }

    public static void main(String[] args) throws IOException {
        OrcFileTester  orcFileTester = new OrcFileTester();
        orcFileTester.WriteBigRow();
    }



}

-----------------------------end of Java source ------------------------------

----------------------------- pom file start ----------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>ORCTester</groupId>
    <artifactId>ORCTester</artifactId>
    <version>1.0-SNAPSHOT</version>
  <dependencies>
      <dependency>
          <groupId>org.apache.hive</groupId>
          <artifactId>hive-exec</artifactId>
          <version>0.11.0</version>
      </dependency>
      <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-core</artifactId>
          <version>0.20.2</version>
      </dependency>
  </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.codehaus.mojo</groupId>
                <artifactId>exec-maven-plugin</artifactId>
                <version>1.1.1</version>
                <executions>
                    <execution>
                        <phase>test</phase>
                        <goals>
                            <goal>java</goal>
                        </goals>
                        <configuration>
                            <mainClass>OrcFileTester</mainClass>
                            <arguments/>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>
----------------------------- pom file end ----------------------------------

Re: Orc file format examples

Posted by ๏̯͡๏ <ÐΞ€ρ@Ҝ>, de...@gmail.com.

Found TestORCFile , will try that out.


On Mon, May 20, 2013 at 4:36 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <de...@gmail.com> wrote:

> I found the package, Looking for examples may be unit test cases.
>
>
> On Mon, May 20, 2013 at 4:32 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <de...@gmail.com>wrote:
>
>> Any suggestions on where can i find ORCFileFormat APIs ? I included
>> 0.11.0 v of Hive libraries, but am not finding it there.
>>
>>
>> On Mon, May 20, 2013 at 6:37 AM, Deepak Jain <de...@gmail.com> wrote:
>>
>>> Could anyone please point me to examples, on how to store data in Orc
>>> file format and then read it back row wise and column wise. Using mr job
>>> and pig scripts.
>>> Regards,
>>> Deepak
>>>
>>>
>>
>>
>> --
>> Deepak
>>
>>
>
>
> --
> Deepak
>
>


-- 
Deepak

Re: Orc file format examples

Posted by ๏̯͡๏ <ÐΞ€ρ@Ҝ>, de...@gmail.com.

I found the package, Looking for examples may be unit test cases.


On Mon, May 20, 2013 at 4:32 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <de...@gmail.com> wrote:

> Any suggestions on where can i find ORCFileFormat APIs ? I included 0.11.0
> v of Hive libraries, but am not finding it there.
>
>
> On Mon, May 20, 2013 at 6:37 AM, Deepak Jain <de...@gmail.com> wrote:
>
>> Could anyone please point me to examples, on how to store data in Orc
>> file format and then read it back row wise and column wise. Using mr job
>> and pig scripts.
>> Regards,
>> Deepak
>>
>>
>
>
> --
> Deepak
>
>


-- 
Deepak

Re: Orc file format examples

Posted by ๏̯͡๏ <ÐΞ€ρ@Ҝ>, de...@gmail.com.

Any suggestions on where can i find ORCFileFormat APIs ? I included 0.11.0
v of Hive libraries, but am not finding it there.

On Mon, May 20, 2013 at 6:37 AM, Deepak Jain <de...@gmail.com> wrote:

> Could anyone please point me to examples, on how to store data in Orc file
> format and then read it back row wise and column wise. Using mr job and pig
> scripts.
> Regards,
> Deepak
>
>

-- 
Deepak

Orc file format examples

Posted by Deepak Jain <de...@gmail.com>.

Could anyone please point me to examples, on how to store data in Orc file format and then read it back row wise and column wise. Using mr job and pig scripts. 
Regards,
Deepak

Re: Filtering

Posted by Owen O'Malley <om...@apache.org>.

On Sun, May 19, 2013 at 3:11 PM, Peter Marron <
Peter.Marron@trilliumsoftware.com> wrote:

>    Hi Owen,****
>
> ** **
>
> Firstly I want to say a huge thank you. You have really helped me
> enormously.
>

You're welcome.

****
>
> OK. I think that I get it now. In my custom InputFormat I can read the
> config settings
>
** **
>
> JobConf .get(“"hive.io.filter.text"”);****
>
> JobConf .get(“"hive.io.filter.expr.serialized"”);
>

well, you don't need double quotes, but yes.


> ****
>
> ** **
>
> And so I can then find the predicate that I need to do the filtering.****
>
> In particular I can set the input splits so that it just reads the right
> records.
>

Right. You want the serialized one, because there is an API to convert it
back to a data structure.


> ****
>
> 1)      **I didn’t know about HIVE-2925 and I would never have thought
> that suppressing the
>
> Map/Reduce would be controlled by something called
> “hive.fetch.task.conversion”****
>
> So maybe I’m missing a trick. How should I have found out about HIVE-2925?
>
There isn't a "trick" other than being willing to ask on the user lists and
use your favorite search engine. As Hive developers, we absolutely need to
make more things happen automatically and reduce the need to know specific
magic incantations. Or at least document the magic incantations. *smile*

> ****
>
> **2)      **I would like to parse the filter.expr.serialized XML and I
> assume that there’s some
> SAX, DOM or even XLST already in HIVE. Could you give me a pointer to
> which classes
> are used (JAXP, Xerces, Xalan?) or where they are being used? Not
> important,
> I’m just being lazy.
>
 If you look at pushFilters, it is using Utilities.serializeExpression, so
Utilities.deserializeExpression will reverse it.

> ****
>
> **3)      **I really want to do my filtering in the getSplits of my
> custom InputFormat. However
> I have found that my getSplits is not being called. (And I asked about
> this on the list
> before.) I have found that if I do this
> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat
> then my method is invoked. It seems to be something to do with avoiding
> the use of the org.apache.hadoop.hive.ql.io.CombineHiveInputFormat class.
> However I don’t know whether there are any other bad things that will
> happen
> if I make this change as I don’t really know what I’m doing.
> Is this a safe thing to do?
>
Yes, that is a fine thing to do. It does mean that you'll need to ensure
you don't have too many maps, but other than that you should be ok. The
primary purpose of CombineHiveInputFormat is to allow Mappers to read from
multiple files.

> However I would like to say thanks again. If we ever meet in the real world
>
> I’ll stand you a beer (or equivalent).
>

Sounds good, although I'll take the equivalent, since I don't enjoy alcohol.


> ****
>
> ** **
>
> Congratulations on version 0.11.0.
>

Thanks!

-- Owen

RE: Filtering

Posted by Peter Marron <Pe...@trilliumsoftware.com>.

Hi Owen,

Firstly I want to say a huge thank you. You have really helped me enormously.
I realize that you have been busy with other things (like release 0.11.0) and
so I can understand that it must have been a pain to take time out to help me.

>The critical piece is in OpProcFactory where the setFilterExpression is called.
>
>OpProcFactory.pushFilterToStorageHandler
>  calls tableScanDesc.setFilterExpr
>    passes to TableScanDesc.getFilterExpr
>    which is called by HiveInputFormat.pushFilters
>
>HiveInputFormat.pushFilters uses Utilities.serializeExpression to put it into the configuration.
>
>Unless something is screwing it up, it looks like it hangs together.

OK. I think that I get it now. In my custom InputFormat I can read the config settings

JobConf .get(“"hive.io.filter.text"”);
JobConf .get(“"hive.io.filter.expr.serialized"”);

And so I can then find the predicate that I need to do the filtering.
In particular I can set the input splits so that it just reads the right records.

>Really? With ORC, allowing the reader to skip over rows that don't matter is very important. Keeping Hive from rechecking the predicate is a nice to have.

Of course, you’re right. It doesn’t matter if the predicate is applied again to the records that are
already filtered. I meant that I couldn’t afford to leave the filter in place as it would mean that
a Map/Reduce would occur. But…

>There has been some work to add additional queries (https://issues.apache.org/jira/browse/HIVE-2925),
> but if what you want is to run locally without MR, yeah, getting the predicate into the RecordReader isn't enough.
>I haven't looked through HIVE-2925 to see what is supported, but that is where I'd start.
>-- Owen

You’re right! HIVE-2925 is exactly what I want and now that I have found out how to make it work
set hive.fetch.task.conversion=more;
I am really in good shape. Thanks.

There a couple of quick questions that I would like to know the answers to though.


1)      I didn’t know about HIVE-2925 and I would never have thought that suppressing the

Map/Reduce would be controlled by something called “hive.fetch.task.conversion”

So maybe I’m missing a trick. How should I have found out about HIVE-2925?

2)      I would like to parse the filter.expr.serialized XML and I assume that there’s some
SAX, DOM or even XLST already in HIVE. Could you give me a pointer to which classes
are used (JAXP, Xerces, Xalan?) or where they are being used? Not important,
I’m just being lazy.

3)      I really want to do my filtering in the getSplits of my custom InputFormat. However
I have found that my getSplits is not being called. (And I asked about this on the list
before.) I have found that if I do this
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat
then my method is invoked. It seems to be something to do with avoiding
the use of the org.apache.hadoop.hive.ql.io.CombineHiveInputFormat class.
However I don’t know whether there are any other bad things that will happen
if I make this change as I don’t really know what I’m doing.
Is this a safe thing to do?


There are some other (less important) problems which I will ask about under separate cover.

However I would like to say thanks again. If we ever meet in the real world
I’ll stand you a beer (or equivalent).

Congratulations on version 0.11.0.

Z
aka
Peter Marron
Trillium Software UK Limited

Tel : +44 (0) 118 940 7609
Fax : +44 (0) 118 940 7699
E: Peter.Marron@TrilliumSoftware.com<ma...@TrilliumSoftware.com>

RE: Filtering

Posted by Peter Marron <Pe...@trilliumsoftware.com>.

>>On Wed, May 15, 2013 at 3:38 AM, Peter Marron <Pe...@trilliumsoftware.com> wrote:
…
>I've started doing similar work for the ORC reader.

I guess that I’m glad that I’m not completely alone here.

>>
>>Firstly although that page mentions InputFormat there doesn’t seem to be any way (that I can find)
>>to perform filter passing to InputFormats and so I gave up on that approach.
>>
>There is. You just need to set  hive.optimize.index.filter to true. See https://issues.apache.org/jira/browse/HIVE-4242.

This is a little confusing. When I look through the code for the use of this configuration
I see that it’s effectively used in two places.
Firstly it’s used on line 55 of file PhysicalOptimizer.java to add a “IndexWhereResolver”
Secondly it’s used on line 766 of file OpProcFactory.java to set a filter expression

But I don’t see any point where the predicate is passed to the InputFormat class.
I guess that you’re saying that there’s some way that the InputFormat can retrieve the
predicate once it’s been stored. But it’s not clear to me how I do that.

>>
>>That said, we really need to create a better interface that allows inputformats to negotiate what parts of the predicate they can process.

Ah, yes, sorry. I really want to be able to remove part of the predicate and subsume the filtering into the InputFormat class.
There’s little point in me going down this route if I can’t do that.

>>
>>-- Owen
>>

Thanks for prodding me into looking at the code, because now I see a big problem.

To recap what I really want to do is to be able to effect filtering on the case where I do a
                select * from table;
query. This is the only query that I’m interested in because it seems to run without any
Map/Reduce overhead (either locally or in the cluster) it’s effectively just performing
some HDFS calls and that’s what I desire.

What I really want to be able to do is to issue a query like this:
                select * from table where <predicate>
where I filter out the predicate and do the filtering in the InputFormat and then hive
effectively sees the query
                select * from table;
and runs it directly (no Map/Reduce) and I’m a happy bunny.

Now, as I say, I can’t see any way to effect this in the InputFormat directly.
If I use a storage handler then I am in “non-native table” terrority and I
can’t LOAD my tables with data.

However I have just noticed that line 111 of file IndexWhereProcessor.java
seems to suggest that indexes are only ever user when the query is going
to run Map/Reduce. Is this so? So I seem to be in the position where I
can’t use InputFormat, StorageHandler or Indexes. What can I do?

Is there any way to filter the query without having to run Map/Reduce?

Any suggestions welcomed.

Peter Marron
Trillium Software UK Limited

Tel : +44 (0) 118 940 7609
Fax : +44 (0) 118 940 7699
E: Peter.Marron@TrilliumSoftware.com<ma...@TrilliumSoftware.com>

Re: Filtering

Posted by Owen O'Malley <om...@apache.org>.

On Wed, May 15, 2013 at 3:38 AM, Peter Marron <
Peter.Marron@trilliumsoftware.com> wrote:

>  Hi,****
>
> ** **
>
> I’m using Hive 0.10.0 and Hadoop 1.0.4.****
>
> ** **
>
> I would like to create a normal table but have some of my code run so that
> I can remove filtering****
>
> parts of the query and limit the output in the splits of the InputFormat.
> I believe that this is****
>
> “Filtering Pushdown” as described in
> https://cwiki.apache.org/Hive/filterpushdowndev.html****
>
> I have tried various approaches and run into problems and I was wondering*
> ***
>
> if anyone had any suggestions as to how I might proceed.
>
I've started doing similar work for the ORC reader.

> ****
>
> ** **
>
> Firstly although that page mentions InputFormat there doesn’t seem to be
> any way (that I can find)****
>
> to perform filter passing to InputFormats and so I gave up on that
> approach.
>

There is. You just need to set  hive.optimize.index.filter to true. See
https://issues.apache.org/jira/browse/HIVE-4242.

That said, we really need to create a better interface that allows
inputformats to negotiate what parts of the predicate they can process.

-- Owen

> **
>