You are viewing a plain text version of this content. The canonical link for it is here.
Posted to by Peter Marron <> on 2014/01/27 15:29:42 UTC

Indexes, again


I am using Hadoop 1.0.4 and Hive 0.11.0.

I am trying to create my own indexes. Given the problems that I have had in the past I thought
it best to try and do things slowly. So I created my own class which derived from TableBasedIndexHandler
I copied all the methods from CompactIndexHandler but I added lots of System.out.printlns so that I
could check and see what was going on. So this is, effectively, an instrumented copy of CompactIndexHandler.

When I try to create an index using compact most things seem to be working:

> DROP INDEX champions_attendance ON champions;
Time taken: 0.139 seconds
hive> CREATE INDEX champions_attendance ON TABLE champions(attendance) AS 'compact' WITH DEFERRED REBUILD;
Time taken: 0.173 seconds
hive> SHOW INDEX ON champions;
champions_attendance    champions               attendance              default__champions_champions_attendance__       compact
Time taken: 0.073 seconds, Fetched: 1 row(s)
hive> SHOW FORMATTED INDEX ON champions;
idx_name                tab_name                col_names               idx_tab_name            idx_type                comment

champions_attendance    champions               attendance              default__champions_champions_attendance__       compact
Time taken: 0.067 seconds, Fetched: 4 row(s)

However when I try the same thing with my class things start promising:

Time taken: 0.149 seconds
hive> CREATE INDEX champions_attendance ON TABLE champions (attendance) AS 'com.trilliumsoftware.profiling.index.ProfilerIndex' WITH DEFERRED REBUILD;
My usesIndexTable - returning true!
My analyzeIndexDefinitionYYY
table ->Table(tableName:champions, dbName:default, owner:pmarron, createTime:1390214100, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:year, type:string, comment:null), FieldSchema(name:home, type:string, comment:null), FieldSchema(name:away, type:string, comment:null), FieldSchema(name:score, type:string, comment:null), FieldSchema(name:venue, type:string, comment:null), FieldSchema(name:attendance, type:string, comment:null)], location:hdfs://hpcluster1/user/pmarron/Ex/data, inputFormat:org.apache.hadoop.mapred.TextInputFormat,, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, field.delim=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1390214100}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)<-
index ->Index(indexName:champions_attendance, indexHandlerClass:com.trilliumsoftware.profiling.index.ProfilerIndex, dbName:default, origTableName:champions, createTime:1390832429, lastAccessTime:1390832429, indexTableName:default__champions_champions_attendance__, sd:StorageDescriptor(cols:[FieldSchema(name:attendance, type:string, comment:null)], location:null, inputFormat:org.apache.hadoop.mapred.TextInputFormat,, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, field.delim=,}), bucketCols:null, sortCols:[Order(col:attendance, order:1)], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), parameters:{}, deferredRebuild:true)<-
My usesIndexTable - returning true!
usesIndexTable ->true<-
indexTable ->Table(tableName:default__champions_champions_attendance__, dbName:default, owner:null, createTime:0, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[], location:null, inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat,, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{})), partitionKeys:[], parameters:{}, viewOriginalText:null, viewExpandedText:null, tableType:INDEX_TABLE)<-
storageDesc ->StorageDescriptor(cols:[FieldSchema(name:attendance, type:string, comment:null)], location:null, inputFormat:org.apache.hadoop.mapred.TextInputFormat,, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, field.delim=,}), bucketCols:null, sortCols:[Order(col:attendance, order:1)], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false)<-
My usesIndexTable - returning true!
Going into the branch
My analyzeIndexDefinition OUT
My usesIndexTable - returning true!
Time taken: 0.263 seconds
But then things seem to go wrong.
Time taken: 0.149 seconds
    > SHOW INDEX ON champions;
FAILED: Error in metadata: java.lang.NullPointerException
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

I have instrumented all of the method calls, so the fact that I don't see any tracing suggests that there isn't
of my code on the path that makes this fail. So I am at a loss to know where to start.
Is there some other sort of registration of my index handler class that I have to make somewhere?

If I ignore this error and carry on then the command

                ALTER INDEX champions_attendance ON champions REBUILD;

seems to succeed _and_ build an index. However when I issue a query on my indexed table:

    > SELECT * FROM champions WHERE attendance=50000;
2000    Real Madrid     Valencia        3-0     Paris   50000
1980    Nottingham Forest       Hamburg 1-0     Madrid  50000
1975    Bayern Munich   Leeds Utd       2-0     Paris   50000
1970    Feyenoord       Celtic  2-1 (aet)       Milan   50000
1969    AC Milan        Ajax    04-Jan  Madrid  50000
Time taken: 0.158 seconds, Fetched: 5 row(s)

it doesn't seem to go into my index method generateIndexQuery
which was what I was hoping to achieve. Maybe this is for the same
reason that the SHOW INDEX fails?

I guess that I could build Hive and try and debug it, but I haven't built Hive
before and I'm worried that they will mean that I will have to move to the
latest version and then move to Hadoop 2 and that that will mean that I
will spend some time upgrading my cluster.

Is there anyone who can through any light on my problems? Or suggest
any way forward?

All feedback welcome.


Peter Marron

Office: +44 (0) 118-940-7609<>
Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK




Be Certain About Your Data. Be Trillium Certain.