You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Lior Chaga <li...@taboola.com> on 2015/07/15 09:09:44 UTC

spark sql - group by constant column

Hi,

Facing a bug with group by in SparkSQL (version 1.4).
Registered a JavaRDD with object containing integer fields as a table.

Then I'm trying to do a group by, with a constant value in the group by
fields:

SELECT primary_one, primary_two, 10 as num, SUM(measure) as total_measures
FROM tbl
GROUP BY primary_one, primary_two, num


I get the following exception:
org.apache.spark.sql.AnalysisException: cannot resolve 'num' given input
columns measure, primary_one, primary_two

Tried both with HiveContext and SqlContext.
The odd thing is that this kind of query actually works for me in a project
I'm working on, but I have there another bug (the group by does not yield
expected results).

The only reason I can think of is that maybe in my real project, the
context configuration is different.
In my above example the configuration of the HiveContext is empty.

In my real project, the configuration is shown below.
Any ideas?

Thanks,
Lior

Hive context configuration in project:
"(mapreduce.jobtracker.jobhistory.task.numberprogresssplits,12)"
"(nfs3.mountd.port,4242)"
"(mapreduce.tasktracker.healthchecker.script.timeout,600000)"
"(yarn.app.mapreduce.am.scheduler.heartbeat.interval-ms,1000)"
"(mapreduce.input.fileinputformat.input.dir.recursive,false)"
"(hive.orc.compute.splits.num.threads,10)"
"(mapreduce.job.classloader.system.classes,java.,javax.,org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop.)"
"(hive.auto.convert.sortmerge.join.to.mapjoin,false)"
"(hadoop.http.authentication.kerberos.principal,HTTP/_HOST@LOCALHOST)"
"(hive.exec.perf.logger,org.apache.hadoop.hive.ql.log.PerfLogger)"
 "(hive.mapjoin.lazy.hashtable,true)"
 "(mapreduce.framework.name,local)"
 "(hive.exec.script.maxerrsize,100000)"
 "(dfs.namenode.checkpoint.txns,1000000)"
 "(tfile.fs.output.buffer.size,262144)"
 "(yarn.app.mapreduce.am.job.task.listener.thread-count,30)"
 "(mapreduce.tasktracker.local.dir.minspacekill,0)"
 "(hive.support.concurrency,false)"
 "(fs.s3.block.size,67108864)"
 "(hive.script.recordwriter,org.apache.hadoop.hive.ql.exec.TextRecordWriter)"
 "(hive.stats.retries.max,0)"
 "(hadoop.hdfs.configuration.version,1)"
 "(dfs.bytes-per-checksum,512)"
 "(fs.s3.buffer.dir,${hadoop.tmp.dir}/s3)"
 "(mapreduce.job.acl-view-job, )"
 "(hive.typecheck.on.insert,true)"
 "(mapreduce.jobhistory.loadedjobs.cache.size,5)"
 "(mapreduce.jobtracker.persist.jobstatus.hours,1)"
 "(hive.unlock.numretries,10)"
 "(dfs.namenode.handler.count,10)"
 "(mapreduce.input.fileinputformat.split.minsize,1)"
 "(hive.plan.serialization.format,kryo)"
 "(dfs.datanode.failed.volumes.tolerated,0)"
 "(yarn.resourcemanager.container.liveness-monitor.interval-ms,600000)"
 "(yarn.resourcemanager.amliveliness-monitor.interval-ms,1000)"
 "(yarn.resourcemanager.client.thread-count,50)"
 "(io.seqfile.compress.blocksize,1000000)"
 "(mapreduce.tasktracker.http.threads,40)"
 "(hive.explain.dependency.append.tasktype,false)"
 "(dfs.namenode.retrycache.expirytime.millis,600000)"
 "(dfs.namenode.backup.address,0.0.0.0:50100)"
 "(hive.hwi.listen.host,0.0.0.0)"
 "(dfs.datanode.data.dir,file://${hadoop.tmp.dir}/dfs/data)"
 "(dfs.replication,3)"
 "(mapreduce.jobtracker.jobhistory.block.size,3145728)"
 "(dfs.secondary.namenode.kerberos.internal.spnego.principal,${dfs.web.authentication.kerberos.principal})"
 "(mapreduce.task.profile.maps,0-2)"
 "(fs.har.impl,org.apache.hadoop.hive.shims.HiveHarFileSystem)"
 "(hive.stats.reliable,false)"
 "(yarn.nodemanager.admin-env,MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX)"

Re: spark sql - group by constant column

Posted by Lior Chaga <li...@taboola.com>.
I found out the problem. Grouping by a constant column value is indeed
impossible.
The reason it was "working" in my project is that I gave the constant
column an alias that exists in the schema of the dataframe. The dataframe
contained a "data_timestamp" representing an hour, and I added to the
select a constant "data_timestamp" that represented the timestamp of the
day. And that was the cause for my original bug - I thought I was grouping
by the day timestamp, when I was actually grouping by each hour, and
therefore I got multiple rows for each of the group by combinations.

On Wed, Jul 15, 2015 at 10:09 AM, Lior Chaga <li...@taboola.com> wrote:

> Hi,
>
> Facing a bug with group by in SparkSQL (version 1.4).
> Registered a JavaRDD with object containing integer fields as a table.
>
> Then I'm trying to do a group by, with a constant value in the group by
> fields:
>
> SELECT primary_one, primary_two, 10 as num, SUM(measure) as total_measures
> FROM tbl
> GROUP BY primary_one, primary_two, num
>
>
> I get the following exception:
> org.apache.spark.sql.AnalysisException: cannot resolve 'num' given input
> columns measure, primary_one, primary_two
>
> Tried both with HiveContext and SqlContext.
> The odd thing is that this kind of query actually works for me in a
> project I'm working on, but I have there another bug (the group by does not
> yield expected results).
>
> The only reason I can think of is that maybe in my real project, the
> context configuration is different.
> In my above example the configuration of the HiveContext is empty.
>
> In my real project, the configuration is shown below.
> Any ideas?
>
> Thanks,
> Lior
>
> Hive context configuration in project:
> "(mapreduce.jobtracker.jobhistory.task.numberprogresssplits,12)"
> "(nfs3.mountd.port,4242)"
> "(mapreduce.tasktracker.healthchecker.script.timeout,600000)"
> "(yarn.app.mapreduce.am.scheduler.heartbeat.interval-ms,1000)"
> "(mapreduce.input.fileinputformat.input.dir.recursive,false)"
> "(hive.orc.compute.splits.num.threads,10)"
>
> "(mapreduce.job.classloader.system.classes,java.,javax.,org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop.)"
> "(hive.auto.convert.sortmerge.join.to.mapjoin,false)"
> "(hadoop.http.authentication.kerberos.principal,HTTP/_HOST@LOCALHOST)"
> "(hive.exec.perf.logger,org.apache.hadoop.hive.ql.log.PerfLogger)"
>  "(hive.mapjoin.lazy.hashtable,true)"
>  "(mapreduce.framework.name,local)"
>  "(hive.exec.script.maxerrsize,100000)"
>  "(dfs.namenode.checkpoint.txns,1000000)"
>  "(tfile.fs.output.buffer.size,262144)"
>  "(yarn.app.mapreduce.am.job.task.listener.thread-count,30)"
>  "(mapreduce.tasktracker.local.dir.minspacekill,0)"
>  "(hive.support.concurrency,false)"
>  "(fs.s3.block.size,67108864)"
>
>  "(hive.script.recordwriter,org.apache.hadoop.hive.ql.exec.TextRecordWriter)"
>  "(hive.stats.retries.max,0)"
>  "(hadoop.hdfs.configuration.version,1)"
>  "(dfs.bytes-per-checksum,512)"
>  "(fs.s3.buffer.dir,${hadoop.tmp.dir}/s3)"
>  "(mapreduce.job.acl-view-job, )"
>  "(hive.typecheck.on.insert,true)"
>  "(mapreduce.jobhistory.loadedjobs.cache.size,5)"
>  "(mapreduce.jobtracker.persist.jobstatus.hours,1)"
>  "(hive.unlock.numretries,10)"
>  "(dfs.namenode.handler.count,10)"
>  "(mapreduce.input.fileinputformat.split.minsize,1)"
>  "(hive.plan.serialization.format,kryo)"
>  "(dfs.datanode.failed.volumes.tolerated,0)"
>  "(yarn.resourcemanager.container.liveness-monitor.interval-ms,600000)"
>  "(yarn.resourcemanager.amliveliness-monitor.interval-ms,1000)"
>  "(yarn.resourcemanager.client.thread-count,50)"
>  "(io.seqfile.compress.blocksize,1000000)"
>  "(mapreduce.tasktracker.http.threads,40)"
>  "(hive.explain.dependency.append.tasktype,false)"
>  "(dfs.namenode.retrycache.expirytime.millis,600000)"
>  "(dfs.namenode.backup.address,0.0.0.0:50100)"
>  "(hive.hwi.listen.host,0.0.0.0)"
>  "(dfs.datanode.data.dir,file://${hadoop.tmp.dir}/dfs/data)"
>  "(dfs.replication,3)"
>  "(mapreduce.jobtracker.jobhistory.block.size,3145728)"
>
>  "(dfs.secondary.namenode.kerberos.internal.spnego.principal,${dfs.web.authentication.kerberos.principal})"
>  "(mapreduce.task.profile.maps,0-2)"
>  "(fs.har.impl,org.apache.hadoop.hive.shims.HiveHarFileSystem)"
>  "(hive.stats.reliable,false)"
>  "(yarn.nodemanager.admin-env,MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX)"
>
>