You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Sergi Vladykin (JIRA)" <ji...@apache.org> on 2015/12/14 08:29:46 UTC
[jira] [Comment Edited] (IGNITE-2143) inner join produce wrong
result and is very slow
[ https://issues.apache.org/jira/browse/IGNITE-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15055574#comment-15055574 ]
Sergi Vladykin edited comment on IGNITE-2143 at 12/14/15 7:29 AM:
------------------------------------------------------------------
Guys,
The result is wrong because GROUP BY is only supported at the top level query, we can't use it in sub-queries now (we will improve that in future releases).
I'd suggest to use a flat query like
{code}
select
ForkTable.lang, count(distinct ForkTable.repo_url) as cnt
from
gitrecord ForkTable, gitrecord PullTable
where
ForkTable.repo_url = PullTable.repo_url
and ForkTable.type = 'ForkEvent'
and PullTable.type='PullRequestEvent'
and ForkTable.lang is not null
group by lang
order by cnt desc
{code}
and create index on (type, repo_url)
Also you have to make sure that GitRecord.repo_url is an affinity key.
was (Author: sergi.vladykin):
Guys,
The result is wrong because GROUP BY is only supported at the top level query, we can't use it in sub-queries now (we will improve that in future releases).
I'd suggest to use a flat query like
```
select
ForkTable.lang, count(distinct ForkTable.repo_url) as cnt
from
gitrecord ForkTable, gitrecord PullTable
where
ForkTable.repo_url = PullTable.repo_url
and ForkTable.type = 'ForkEvent'
and PullTable.type='PullRequestEvent'
and ForkTable.lang is not null
group by lang
order by cnt desc
```
and create index on (type, repo_url)
Also you have to make sure that GitRecord.repo_url is an affinity key.
> inner join produce wrong result and is very slow
> ------------------------------------------------
>
> Key: IGNITE-2143
> URL: https://issues.apache.org/jira/browse/IGNITE-2143
> Project: Ignite
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.5
> Environment: 3 nodes in docker containers.
> Reporter: Sergey Soldatov
> Priority: Critical
> Attachments: config.xml, ignite-cache-1.0.0.jar
>
>
> I have following query for gitarchive records processing.
> {code}
> select
> lang, count(distinct ForkTable.repo_url) as cnt
> from (
> select repo_url, lang from gitrecord where type = 'ForkEvent' and
> lang is not null group by lang, repo_url) as ForkTable
> inner join (
> select repo_url, html_url from gitrecord where type='PullRequestEvent'
> group by repo_url, html_url) as PullTable
> on
> ForkTable.repo_url = PullTable.repo_url
> group by lang
> order by cnt desc
> {code}
> And there are two major problems:
> 1. It produces an incorrect result if it's running in cluster mode.
> 2. It's really slow. for 200k rows it takes 500+ seconds comparing to 20 on Spark for the same query
> Steps to reproduce:
> 1. Download github archive for 1 day and put it to hdfs ("/Data" in my case):
> {code}wget http://data.githubarchive.org/2015-01-01-{0..23}.json.gz{code}
> 2. copy attached ignite-cache-1.0.0.jar to Ignite's lib dir
> 3. run ignite with attached config.xml
> 4. run spark-shell
> {code}
> spark-shell --packages org.apache.ignite:ignite-spark:1.5.0-b1 --repositories http://www.gridgainsystems.com/nexus/content/repositories/external --jars /usr/lib/ignite-hadoop/libs/ignite-cache-1.0.0.jar --master spark://{spark-master}:7077
> {code}
> 5. load data and execute the query:
> {code}
> import org.apache.ignite._
> import org.apache.ignite.configuration._
> import org.apache.ignite.cache.query._
> import org.apache.ignite.spark._
> import org.apache.ignite.lang.IgniteBiPredicate
> import io.dtk._
> val df = sqlContext.read.json("/Data/*.gz")
> Ignition.setClientMode(true)
> val cacheName = "gitrecords"
> val rdd = df.select("id", "repo.url", "payload.forkee.language","type","payload.pull_request.head.repo.html_url").map(row => (row.getString(0), new GitRecord(0, row.getString(1),row.getString(2),row.getString(3),row.getString(4))))
> val cfg = new CacheConfiguration[String,GitRecord]().setName(cacheName).setIndexedTypes(classOf[String],classOf[GitRecord]).setLoadPreviousValue(true)
> val ic = new IgniteContext[String,GitRecord](sc, "/usr/lib/ignite-hadoop/config/config.xml")
> val sharedRDD = ic.fromCache(cfg)
> sharedRDD.savePairs(rdd)
> sharedRDD.sql("select lang, count(distinct ForkTable.repo_url) as cnt from (select repo_url, lang from gitrecord where type = 'ForkEvent' and lang is not null group by lang, repo_url) as ForkTable inner join (select repo_url, html_url from gitrecord where type='PullRequestEvent' group by repo_url, html_url) as PullTable on ForkTable.repo_url = PullTable.repo_url group by lang order by cnt desc").show
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)