You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ignite.apache.org by "Sergey Soldatov (JIRA)" <ji...@apache.org> on 2015/12/13 02:07:46 UTC

[jira] [Created] (IGNITE-2143) inner join produce wrong result and is very slow

Sergey Soldatov created IGNITE-2143:
---------------------------------------

             Summary: inner join produce wrong result and is very slow
                 Key: IGNITE-2143
                 URL: https://issues.apache.org/jira/browse/IGNITE-2143
             Project: Ignite
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.5
         Environment: 3 nodes in docker containers.
            Reporter: Sergey Soldatov
            Priority: Critical


I have following query for gitarchive records processing.  
{code}
select 
     lang, count(distinct ForkTable.repo_url) as cnt 
from (
     select repo_url, lang from gitrecord where type = 'ForkEvent' and 
     lang is not null group by lang, repo_url) as ForkTable 
inner join (
     select repo_url, html_url from gitrecord where type='PullRequestEvent'    
     group by repo_url, html_url) as PullTable 
on 
     ForkTable.repo_url = PullTable.repo_url 
group by lang 
order by cnt desc
{code}

And there are two major problems:
1. It produces an incorrect result if it's running in cluster mode.
2. It's really slow. for 200k rows it takes 500+ seconds comparing to 20 on Spark for the same query

Steps to reproduce:
1. Download github archive for 1 day and put it to hdfs ("/Data" in my case):
wget http://data.githubarchive.org/2015-01-01-{0..23}.json.gz
2. copy attached ignite-cache-1.0.0.jar to Ignite's lib dir 
3. run ignite with attached config.xml 
4. run spark-shell
{code}
spark-shell --packages org.apache.ignite:ignite-spark:1.5.0-b1 --repositories http://www.gridgainsystems.com/nexus/content/repositories/external --jars /usr/lib/ignite-hadoop/libs/ignite-cache-1.0.0.jar --master spark://{spark-master}:7077
{code}
5. load data and execute the query:
{code}
import org.apache.ignite._
import org.apache.ignite.configuration._
import org.apache.ignite.cache.query._
import org.apache.ignite.spark._
import org.apache.ignite.lang.IgniteBiPredicate
import io.dtk._
val df = sqlContext.read.json("/Data/*.gz")
Ignition.setClientMode(true)
val cacheName = "gitrecords"

val rdd = df.select("id", "repo.url", "payload.forkee.language","type","payload.pull_request.head.repo.html_url").map(row => (row.getString(0), new GitRecord(0, row.getString(1),row.getString(2),row.getString(3),row.getString(4))))

val cfg = new CacheConfiguration[String,GitRecord]().setName(cacheName).setIndexedTypes(classOf[String],classOf[GitRecord]).setLoadPreviousValue(true)
val ic = new IgniteContext[String,GitRecord](sc, "/usr/lib/ignite-hadoop/config/config.xml")
val sharedRDD = ic.fromCache(cfg)
sharedRDD.savePairs(rdd)
sharedRDD.sql("select lang, count(distinct ForkTable.repo_url) as cnt from (select repo_url, lang from gitrecord where type = 'ForkEvent' and lang is not null group by lang, repo_url) as ForkTable inner join (select repo_url, html_url from gitrecord where type='PullRequestEvent' group by repo_url, html_url) as PullTable on ForkTable.repo_url = PullTable.repo_url group by lang order by cnt desc").show
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)