You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by soroka21 <so...@gmail.com> on 2017/11/27 19:40:01 UTC

SQL Count(*) returns incorrect count

Hi,

I've loaded 50 million BinaryObjects into TEST cache using Apache Spark 
They look like this:

o.a.i.i.binary.BinaryObjectImpl | DATASET1 [hash=86282065, F01=-206809353,
F00=A1782096681-B2022047863-C554782990, F03=Must be timestamp,
F02=2.6983596317719E8, F05=182918247,
F04=A1997114384-B293944165-C1698394711, F07=Some string,
F06=-9.9171693326279E8, F09=47148799, F08=A1582165880-B1796903902-C784688220

BinaryObject name is DATASET1
I'm trying to get count of those objects using SQL:

            String sqlStr = "SELECT COUNT(*) FROM TEST.DATASET1";
            Ignite ignite = Ignition.start();
            IgniteCache<String, BinaryObject> cache =
ignite.cache(cacheName);
            logger.info("Starting SQL query : " + sqlStr);
            logger.info("ALL Cache size is " +
cache.sizeLong(CachePeekMode.ALL));
            logger.info("PRIMARY Cache size is " +
cache.sizeLong(CachePeekMode.PRIMARY));
            logger.info("NEAR Cache size is " +
cache.sizeLong(CachePeekMode.NEAR));
            logger.info("OFFHEAP Cache size is " +
cache.sizeLong(CachePeekMode.OFFHEAP));
            logger.info("ONHEAP Cache size is " +
cache.sizeLong(CachePeekMode.ONHEAP));

            SqlFieldsQuery sql = new SqlFieldsQuery(sqlStr).setLocal(false);

            FieldsQueryCursor<List&lt;?>> result = cache.query(sql);
            printResults(result);

17/11/27 19:34:18 INFO IgnLoader: Starting SQL query : SELECT COUNT(*) FROM
TEST.DATASET1
17/11/27 19:34:23 INFO IgnLoader: ALL Cache size is 50000000
17/11/27 19:34:23 INFO IgnLoader: PRIMARY Cache size is 50000000
17/11/27 19:34:23 INFO IgnLoader: NEAR Cache size is 0
17/11/27 19:34:23 INFO IgnLoader: OFFHEAP Cache size is 50000000
17/11/27 19:34:23 INFO IgnLoader: ONHEAP Cache size is 5120
17/11/27 19:34:45 INFO IgnLoader: Result size : 1

row 0:[COUNT(*)(Long) :*9859833*; ]

17/11/27 19:34:45 INFO IgnLoader: Done

My function printResults reports only 9,859,833 records even if I use
setLocal(false)

What should I do to run Query against full data set?

Thank you for help.
Alexey






--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: SQL Count(*) returns incorrect count

Posted by afedotov <al...@gmail.com>.

Hi,

For an example of table creation via native API please check
https://apacheignite-sql.readme.io/docs/schema-and-indexes#section-overview.

Kind regards,
Alex



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: SQL Count(*) returns incorrect count

Posted by soroka21 <so...@gmail.com>.

Yes, that worked for me. I guess this will be fixed in version 2.4. 
Final version of JavaIgniteContext construction looks like:

JavaIgniteContext<String, BinaryObject> igniteContext
  = new JavaIgniteContext<>(jctx, () -> {
       IgniteConfiguration cfg = null;
                try {
                    cfg =
IgnitionEx.loadConfiguration(igniteConfigPath).get1();
                    *cfg.setClientMode(true);*
                } catch(Exception e){
                    // Report exception and return null 
                    logger.error(e.getMessage());
                    e.printStackTrace();                  
             }             
             return cfg;
      }, true);

Thanks a lot for helping with this!
Was able to load 50,000,000 records from parquet files (6.4GB) in less than
4 minutes.

P.S. What would be the correct way to execute DDL (CREATE TABLE) statement
without JDBC/ODBC?





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: SQL Count(*) returns incorrect count

Posted by vkulichenko <va...@gmail.com>.

Hi,

Your assumptions are correct. There is also an issue [1] that likely is
causing this behavior. As a workaround you can try to force IgniteContext to
start everything in client mode. In order to achieve this, use
setClientMode(true) in the closure that created IgniteConfiguration:

IgniteOutClosure ioc = new IgniteOutClosure() {
    @Override
    public Object apply() {
        IgniteConfiguration cfg = new IgniteConfiguration();
        cfg.setClientMode(true);
        cfg.setCacheConfiguration(cacheCfg);
        return cfg;
    }
};

[1] https://issues.apache.org/jira/browse/IGNITE-5981

-Val



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: SQL Count(*) returns incorrect count

Posted by soroka21 <so...@gmail.com>.

Thank you for suggestions.
I have 5 node standalone ignite cluster and the main goal is to load data
into it and store it for long time for future use. I can't keep Spark
workers in memory and assume what my data is ending up in cache distributed
inside 5 standalone ignite nodes.

Spark process is starting 5 additional ignite servers ( I can see it from
topology snapshot) 
With standalone=true or standalone=false same issue is happening when I'm
running my spark application using YARN.

However I found what if Spark is working in local, non-distributed mode
(e.g. spark.master=local[2]) records are not getting lost.

So it looks like the issue is when spark workers, processing JavaIgniteRDD,
are going down.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: SQL Count(*) returns incorrect count

Posted by vkulichenko <va...@gmail.com>.

Looks like you're running in embedded mode. In this mode server nodes are
started within Spark executors, so when executor is stopped some of the data
is lost. Try to start a separate Ignite cluster and create IgniteContext
with standalone=true.

-Val



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: SQL Count(*) returns incorrect count

Posted by soroka21 <so...@gmail.com>.

Switched to Ignite 2.3.0 in hope it has better behavior.
Unfortunately it is not.

During the execution of Spark job number of cache rows is growing but after
Spark job completes - looks like some entries has been removed.
JavaIgniteRDD shows correct count but again final result is incorrect.

I was using sqlline to track count of record during Spark job execution - it
shows some problem also:

0: jdbc:ignite:thin://10.238.42.86/> select count(*) from data3;
+--------------------------------+
|            COUNT(*)            |
+--------------------------------+
| 6428                           |
+--------------------------------+
1 row selected (0.063 seconds)
0: jdbc:ignite:thin://10.238.42.86/> select count(*) from data3;
+--------------------------------+
|            COUNT(*)            |
+--------------------------------+
| 62501                          |
+--------------------------------+
1 row selected (0.096 seconds)
0: jdbc:ignite:thin://10.238.42.86/> select count(*) from data3;
+--------------------------------+
|            COUNT(*)            |
+--------------------------------+
| 268183                         |
+--------------------------------+
1 row selected (0.154 seconds)
0: jdbc:ignite:thin://10.238.42.86/> select count(*) from data3;
+--------------------------------+
|            COUNT(*)            |
+--------------------------------+
| 482616                         |
+--------------------------------+
1 row selected (0.159 seconds)
0: jdbc:ignite:thin://10.238.42.86/> select count(*) from data3;
+--------------------------------+
|            COUNT(*)            |
+--------------------------------+
| 722436                         |
+--------------------------------+
1 row selected (0.159 seconds)
0: jdbc:ignite:thin://10.238.42.86/> select count(*) from data3;
+--------------------------------+
|            COUNT(*)            |
+--------------------------------+
| 1017236                        |             <---------------------------
Still looks good here
+--------------------------------+
1 row selected (0.205 seconds)
0: jdbc:ignite:thin://10.238.42.86/> select count(*) from data3;
+--------------------------------+
|            COUNT(*)            |
+--------------------------------+
| 542535                         |        <---------------------------
Ooops!
+--------------------------------+






--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: SQL Count(*) returns incorrect count

Posted by soroka21 <so...@gmail.com>.

Hi Val

Thank you for response

you can find maven project here:

https://github.com/Soroka21/ign-loader-spark <http://>  

This app actually loads any parquet file into cache (let me know if you need
one)

I've tried to run on 1.2 mil records wit the same symptoms - looks like my
app is working with portion of cache instead of whole cache. May be
something is wrong with the way I'm putting data into cache.

You can see converter from Spark Row to Ignite BinaryObject in
RowToIgniteBinaryObjectConverter.java 
Which in term is consumed in my SparkLoader.load()

Sorry for not very clean code - it was done as a part of early R&D POC in my
project.

My ignite environment is based on Ignite 2.2.0 version running on 5 nodes.
1GB memory each instance.
Config is almost unchanged - you can find it in conf directory.







--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: SQL Count(*) returns incorrect count

Posted by vkulichenko <va...@gmail.com>.

Alexey,

Something is wrong, but I don't see any obvious mistakes in your code. Is it
possible to provide a test as a standalone GitHub project so that I can run
it and reproduce the problem?

Is it reproduced on smaller data sets? Or if load not through Spark, but
just doing regular put/putAll operations or via IgniteDataStreamer?

-Val



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/