You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@carbondata.apache.org by aaron <94...@qq.com> on 2018/09/26 15:47:41 UTC

[Serious Issue] Rows disappeared

Hi Community,

It seems that rows disappeared, same query get different result

carbon.time(carbon.sql(
      s"""
         |EXPLAIN SELECT date, market_code, device_code, country_code,
category_id, product_id, est_free_app_download, est_paid_app_download,
est_revenue
         |FROM store
         |WHERE date = '2016-09-01' AND device_code='ios-phone' AND
country_code='EE' AND product_id IN (590416158, 590437560)"""
        .stripMargin).show(truncate=false)
    )


Screen_Shot_2018-09-26_at_11.png
<http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t357/Screen_Shot_2018-09-26_at_11.png>  
Screen_Shot_2018-09-26_at_11.png
<http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t357/Screen_Shot_2018-09-26_at_11.png>  



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Serious Issue] Rows disappeared

Posted by aaron <94...@qq.com>.
Yes, you're right.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Serious Issue] Rows disappeared

Posted by aaron <94...@qq.com>.
Great and I will have a try later



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Serious Issue] Rows disappeared

Posted by aaron <94...@qq.com>.
Cool! It works now.  Thanks a lot!



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Serious Issue] Rows disappeared

Posted by Ajantha Bhat <aj...@gmail.com>.
@Aaron:

Please find the issue fix changes in the below PR.

*https://github.com/apache/carbondata/pull/2784
<https://github.com/apache/carbondata/pull/2784>*

I added a test case also and it is passed after my fix.

Thanks,
Ajantha

On Fri, Sep 28, 2018 at 4:57 AM aaron <94...@qq.com> wrote:

> @Ajantha, Great! looking forward to your fix:)
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

Re: [Serious Issue] Rows disappeared

Posted by aaron <94...@qq.com>.
@Ajantha, Great! looking forward to your fix:)



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Serious Issue] Rows disappeared

Posted by Ajantha Bhat <aj...@gmail.com>.
@Aaron: I was able to reproduce the issue with my own dataset. (total 350
KB data)

Issue is nothing to do with local dictionary.
I have narrowed down the scenario,

it is with sort columns + compaction.

I will fix soon and update you

Thanks,
Ajantha

On Thu, Sep 27, 2018 at 8:05 PM Kumar Vishal <ku...@gmail.com>
wrote:

> Hi Aaron,
> Can you please run compaction again with
> *carbon.local.dictionary.decoder.fallback=false
> *and share the result for the same.
>
> -Regards
> Kumar Vishal
>
> On Thu, Sep 27, 2018 at 7:37 PM aaron <94...@qq.com> wrote:
>
> > This is the method I construct carbon instance, hope this can help you.
> >
> > def carbonSession(appName: String, masterUrl: String, parallelism:
> String,
> > logLevel: String, hdfsUrl:
> > String="hdfs://ec2-dca-aa-p-sdn-16.appannie.org:9000"): SparkSession = {
> >     val storeLocation = s"${hdfsUrl}/usr/carbon/data"
> >
> >     CarbonProperties.getInstance()
> >       .addProperty(CarbonCommonConstants.STORE_LOCATION, storeLocation)
> >       .addProperty(CarbonCommonConstants.ENABLE_UNSAFE_SORT, "true")
> >       .addProperty(CarbonCommonConstants.ENABLE_OFFHEAP_SORT, "true")
> >       .addProperty(CarbonCommonConstants.CARBON_TASK_DISTRIBUTION,
> > CarbonCommonConstants.CARBON_TASK_DISTRIBUTION_BLOCKLET)
> >
>  .addProperty(CarbonCommonConstants.CARBON_CUSTOM_BLOCK_DISTRIBUTION,
> > "false")
> >       .addProperty(CarbonCommonConstants.ENABLE_VECTOR_READER, "true")
> >       //.addProperty(CarbonCommonConstants.ENABLE_AUTO_HANDOFF, "true")
> >       .addProperty(CarbonCommonConstants.ENABLE_AUTO_LOAD_MERGE, "true")
> >
> > .addProperty(CarbonCommonConstants.COMPACTION_SEGMENT_LEVEL_THRESHOLD,
> > "4,3")
> >       .addProperty(CarbonCommonConstants.DAYS_ALLOWED_TO_COMPACT, "0")
> >       .addProperty(CarbonCommonConstants.CARBON_BADRECORDS_LOC,
> > s"${hdfsUrl}/usr/carbon/badrecords")
> >       .addProperty(CarbonCommonConstants.CARBON_QUERY_MIN_MAX_ENABLED,
> > "true")
> >       .addProperty(CarbonCommonConstants.ENABLE_QUERY_STATISTICS,
> "false")
> >       .addProperty(CarbonCommonConstants.ENABLE_DATA_LOADING_STATISTICS,
> > "false")
> >       .addProperty(CarbonCommonConstants.MAX_QUERY_EXECUTION_TIME, "2")
> //
> > 2 minutes
> >       .addProperty(CarbonCommonConstants.LOCK_TYPE, "HDFSLOCK")
> >       .addProperty(CarbonCommonConstants.LOCK_PATH,
> > s"${hdfsUrl}/usr/carbon/lock")
> >       .addProperty(CarbonCommonConstants.CARBON_MERGE_SORT_READER_THREAD,
> > s"${parallelism}")
> >
> >
> >
> .addProperty(CarbonCommonConstants.CARBON_INVISIBLE_SEGMENTS_PRESERVE_COUNT,
> > "100")
> >       .addProperty(CarbonCommonConstants.LOAD_GLOBAL_SORT_PARTITIONS,
> > s"${parallelism}")
> >       .addProperty(CarbonCommonConstants.LOAD_SORT_SCOPE, "LOCAL_SORT")
> >       .addProperty(CarbonCommonConstants.NUM_CORES_COMPACTING,
> > s"${parallelism}")
> >       .addProperty(CarbonCommonConstants.UNSAFE_WORKING_MEMORY_IN_MB,
> > "4096")
> >       .addProperty(CarbonCommonConstants.NUM_CORES_LOADING,
> > s"${parallelism}")
> >       .addProperty(CarbonCommonConstants.CARBON_MAJOR_COMPACTION_SIZE,
> > "1024")
> >       .addProperty(CarbonCommonConstants.BLOCKLET_SIZE, "64")
> >       //.addProperty(CarbonCommonConstants.TABLE_BLOCKLET_SIZE, "64")
> >
> >     import org.apache.spark.sql.CarbonSession._
> >
> >     val carbon = SparkSession
> >       .builder()
> >       .master(masterUrl)
> >       .appName(appName)
> >       .config("spark.hadoop.fs.s3a.impl",
> > "org.apache.hadoop.fs.s3a.S3AFileSystem")
> >       .config("spark.hadoop.dfs.replication", 1)
> >       .config("spark.cores.max", s"${parallelism}")
> >       .getOrCreateCarbonSession(storeLocation)
> >
> >     carbon.sparkContext.hadoopConfiguration.setInt("dfs.replication", 1)
> >
> >     carbon.sql(s"SET spark.default.parallelism=${parallelism}")
> >     carbon.sql(s"SET spark.sql.shuffle.partitions=${parallelism}")
> >     carbon.sql(s"SET spark.sql.cbo.enabled=true")
> >     carbon.sql(s"SET carbon.options.bad.records.logger.enable=true")
> >
> >     carbon.sparkContext.setLogLevel(logLevel)
> >     carbon
> >   }
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>

Re: [Serious Issue] Rows disappeared

Posted by Kumar Vishal <ku...@gmail.com>.
Hi Aaron,
Can you please run compaction again with
*carbon.local.dictionary.decoder.fallback=false
*and share the result for the same.

-Regards
Kumar Vishal

On Thu, Sep 27, 2018 at 7:37 PM aaron <94...@qq.com> wrote:

> This is the method I construct carbon instance, hope this can help you.
>
> def carbonSession(appName: String, masterUrl: String, parallelism: String,
> logLevel: String, hdfsUrl:
> String="hdfs://ec2-dca-aa-p-sdn-16.appannie.org:9000"): SparkSession = {
>     val storeLocation = s"${hdfsUrl}/usr/carbon/data"
>
>     CarbonProperties.getInstance()
>       .addProperty(CarbonCommonConstants.STORE_LOCATION, storeLocation)
>       .addProperty(CarbonCommonConstants.ENABLE_UNSAFE_SORT, "true")
>       .addProperty(CarbonCommonConstants.ENABLE_OFFHEAP_SORT, "true")
>       .addProperty(CarbonCommonConstants.CARBON_TASK_DISTRIBUTION,
> CarbonCommonConstants.CARBON_TASK_DISTRIBUTION_BLOCKLET)
>       .addProperty(CarbonCommonConstants.CARBON_CUSTOM_BLOCK_DISTRIBUTION,
> "false")
>       .addProperty(CarbonCommonConstants.ENABLE_VECTOR_READER, "true")
>       //.addProperty(CarbonCommonConstants.ENABLE_AUTO_HANDOFF, "true")
>       .addProperty(CarbonCommonConstants.ENABLE_AUTO_LOAD_MERGE, "true")
>
> .addProperty(CarbonCommonConstants.COMPACTION_SEGMENT_LEVEL_THRESHOLD,
> "4,3")
>       .addProperty(CarbonCommonConstants.DAYS_ALLOWED_TO_COMPACT, "0")
>       .addProperty(CarbonCommonConstants.CARBON_BADRECORDS_LOC,
> s"${hdfsUrl}/usr/carbon/badrecords")
>       .addProperty(CarbonCommonConstants.CARBON_QUERY_MIN_MAX_ENABLED,
> "true")
>       .addProperty(CarbonCommonConstants.ENABLE_QUERY_STATISTICS, "false")
>       .addProperty(CarbonCommonConstants.ENABLE_DATA_LOADING_STATISTICS,
> "false")
>       .addProperty(CarbonCommonConstants.MAX_QUERY_EXECUTION_TIME, "2")  //
> 2 minutes
>       .addProperty(CarbonCommonConstants.LOCK_TYPE, "HDFSLOCK")
>       .addProperty(CarbonCommonConstants.LOCK_PATH,
> s"${hdfsUrl}/usr/carbon/lock")
>       .addProperty(CarbonCommonConstants.CARBON_MERGE_SORT_READER_THREAD,
> s"${parallelism}")
>
>
> .addProperty(CarbonCommonConstants.CARBON_INVISIBLE_SEGMENTS_PRESERVE_COUNT,
> "100")
>       .addProperty(CarbonCommonConstants.LOAD_GLOBAL_SORT_PARTITIONS,
> s"${parallelism}")
>       .addProperty(CarbonCommonConstants.LOAD_SORT_SCOPE, "LOCAL_SORT")
>       .addProperty(CarbonCommonConstants.NUM_CORES_COMPACTING,
> s"${parallelism}")
>       .addProperty(CarbonCommonConstants.UNSAFE_WORKING_MEMORY_IN_MB,
> "4096")
>       .addProperty(CarbonCommonConstants.NUM_CORES_LOADING,
> s"${parallelism}")
>       .addProperty(CarbonCommonConstants.CARBON_MAJOR_COMPACTION_SIZE,
> "1024")
>       .addProperty(CarbonCommonConstants.BLOCKLET_SIZE, "64")
>       //.addProperty(CarbonCommonConstants.TABLE_BLOCKLET_SIZE, "64")
>
>     import org.apache.spark.sql.CarbonSession._
>
>     val carbon = SparkSession
>       .builder()
>       .master(masterUrl)
>       .appName(appName)
>       .config("spark.hadoop.fs.s3a.impl",
> "org.apache.hadoop.fs.s3a.S3AFileSystem")
>       .config("spark.hadoop.dfs.replication", 1)
>       .config("spark.cores.max", s"${parallelism}")
>       .getOrCreateCarbonSession(storeLocation)
>
>     carbon.sparkContext.hadoopConfiguration.setInt("dfs.replication", 1)
>
>     carbon.sql(s"SET spark.default.parallelism=${parallelism}")
>     carbon.sql(s"SET spark.sql.shuffle.partitions=${parallelism}")
>     carbon.sql(s"SET spark.sql.cbo.enabled=true")
>     carbon.sql(s"SET carbon.options.bad.records.logger.enable=true")
>
>     carbon.sparkContext.setLogLevel(logLevel)
>     carbon
>   }
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

Re: [Serious Issue] Rows disappeared

Posted by aaron <94...@qq.com>.
This is the method I construct carbon instance, hope this can help you.

def carbonSession(appName: String, masterUrl: String, parallelism: String,
logLevel: String, hdfsUrl:
String="hdfs://ec2-dca-aa-p-sdn-16.appannie.org:9000"): SparkSession = {
    val storeLocation = s"${hdfsUrl}/usr/carbon/data"

    CarbonProperties.getInstance()
      .addProperty(CarbonCommonConstants.STORE_LOCATION, storeLocation)
      .addProperty(CarbonCommonConstants.ENABLE_UNSAFE_SORT, "true")
      .addProperty(CarbonCommonConstants.ENABLE_OFFHEAP_SORT, "true")
      .addProperty(CarbonCommonConstants.CARBON_TASK_DISTRIBUTION,
CarbonCommonConstants.CARBON_TASK_DISTRIBUTION_BLOCKLET)
      .addProperty(CarbonCommonConstants.CARBON_CUSTOM_BLOCK_DISTRIBUTION,
"false")
      .addProperty(CarbonCommonConstants.ENABLE_VECTOR_READER, "true")
      //.addProperty(CarbonCommonConstants.ENABLE_AUTO_HANDOFF, "true")
      .addProperty(CarbonCommonConstants.ENABLE_AUTO_LOAD_MERGE, "true")
      .addProperty(CarbonCommonConstants.COMPACTION_SEGMENT_LEVEL_THRESHOLD,
"4,3")
      .addProperty(CarbonCommonConstants.DAYS_ALLOWED_TO_COMPACT, "0")
      .addProperty(CarbonCommonConstants.CARBON_BADRECORDS_LOC,
s"${hdfsUrl}/usr/carbon/badrecords")
      .addProperty(CarbonCommonConstants.CARBON_QUERY_MIN_MAX_ENABLED,
"true")
      .addProperty(CarbonCommonConstants.ENABLE_QUERY_STATISTICS, "false")
      .addProperty(CarbonCommonConstants.ENABLE_DATA_LOADING_STATISTICS,
"false")
      .addProperty(CarbonCommonConstants.MAX_QUERY_EXECUTION_TIME, "2")  //
2 minutes
      .addProperty(CarbonCommonConstants.LOCK_TYPE, "HDFSLOCK")
      .addProperty(CarbonCommonConstants.LOCK_PATH,
s"${hdfsUrl}/usr/carbon/lock")
      .addProperty(CarbonCommonConstants.CARBON_MERGE_SORT_READER_THREAD,
s"${parallelism}")
     
.addProperty(CarbonCommonConstants.CARBON_INVISIBLE_SEGMENTS_PRESERVE_COUNT,
"100")
      .addProperty(CarbonCommonConstants.LOAD_GLOBAL_SORT_PARTITIONS,
s"${parallelism}")
      .addProperty(CarbonCommonConstants.LOAD_SORT_SCOPE, "LOCAL_SORT")
      .addProperty(CarbonCommonConstants.NUM_CORES_COMPACTING,
s"${parallelism}")
      .addProperty(CarbonCommonConstants.UNSAFE_WORKING_MEMORY_IN_MB,
"4096")
      .addProperty(CarbonCommonConstants.NUM_CORES_LOADING,
s"${parallelism}")
      .addProperty(CarbonCommonConstants.CARBON_MAJOR_COMPACTION_SIZE,
"1024")
      .addProperty(CarbonCommonConstants.BLOCKLET_SIZE, "64")
      //.addProperty(CarbonCommonConstants.TABLE_BLOCKLET_SIZE, "64")

    import org.apache.spark.sql.CarbonSession._

    val carbon = SparkSession
      .builder()
      .master(masterUrl)
      .appName(appName)
      .config("spark.hadoop.fs.s3a.impl",
"org.apache.hadoop.fs.s3a.S3AFileSystem")
      .config("spark.hadoop.dfs.replication", 1)
      .config("spark.cores.max", s"${parallelism}")
      .getOrCreateCarbonSession(storeLocation)

    carbon.sparkContext.hadoopConfiguration.setInt("dfs.replication", 1)

    carbon.sql(s"SET spark.default.parallelism=${parallelism}")
    carbon.sql(s"SET spark.sql.shuffle.partitions=${parallelism}")
    carbon.sql(s"SET spark.sql.cbo.enabled=true")
    carbon.sql(s"SET carbon.options.bad.records.logger.enable=true")

    carbon.sparkContext.setLogLevel(logLevel)
    carbon
  }



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Serious Issue] Rows disappeared

Posted by Ajantha Bhat <aj...@gmail.com>.
So, both local dictionary and compaction is required to reproduce the
issue? Without any one of them. Issue will not occur right?

On Thu 27 Sep, 2018, 6:54 PM aaron, <94...@qq.com> wrote:

> Another comment, this issue can be reproduces on spark2.3.1 +
> carbondata1.5.0, spark2.2.2 + carbondata1.5.0, I can send you the jar I
> compiled to you, hope this could help you.
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

Re: [Serious Issue] Rows disappeared

Posted by aaron <94...@qq.com>.
Another comment, this issue can be reproduces on spark2.3.1 +
carbondata1.5.0, spark2.2.2 + carbondata1.5.0, I can send you the jar I
compiled to you, hope this could help you.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Serious Issue] Rows disappeared

Posted by aaron <94...@qq.com>.
**************************************************************************
a) First can you disable local dictionary and try the same scenario?  I
would test in other time

Good idea, and I think this works, when I use global dictionary, query can
return right result. But the
question is, global dictionary also introduce a bug in spark 2.3, which I
described in another issue.
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Issue-Dictionary-and-S3-td63106.html

**************************************************************************
b) Can drop datamp and try the same scenario? -- If data is coming from 
data map (can see this in explain command) 
I have confirmed this, datamap is not the reason for this. because this can
reproduce without
datamap.

**************************************************************************
c) Avoid compaction and try the same scenario. 
I've confirmed, if no compaction, query works well.

**************************************************************************
d) If you can share, give me test data and complete steps. (Because 
compaction and other steps are not there in your previous mail) 
The data is kind of huge, the table holds on about 7T csv raw data. I have
no good idea to give you test data:)



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Serious Issue] Rows disappeared

Posted by Ajantha Bhat <aj...@gmail.com>.
Hi Aaron,
Thanks for reporting issue.
Can you help me narrow down the issue? as I cannot reproduce locally with
the information given in your mail.

a) First can you disable local dictionary and try the same scenario?
b) Can drop datamp and try the same scenario? -- If data is coming from
data map (can see this in explain command)
c) Avoid compaction and try the same scenario.
d) If you can share, give me test data and complete steps. (Because
compaction and other steps are not there in your previous mail)
Mean while, I will try to reproduce locally again but I don't have complete
steps you executed.

Thanks,
Ajantha

On Wed, Sep 26, 2018 at 9:17 PM aaron <94...@qq.com> wrote:

> Hi Community,
>
> It seems that rows disappeared, same query get different result
>
> carbon.time(carbon.sql(
>       s"""
>          |EXPLAIN SELECT date, market_code, device_code, country_code,
> category_id, product_id, est_free_app_download, est_paid_app_download,
> est_revenue
>          |FROM store
>          |WHERE date = '2016-09-01' AND device_code='ios-phone' AND
> country_code='EE' AND product_id IN (590416158, 590437560)"""
>         .stripMargin).show(truncate=false)
>     )
>
>
> Screen_Shot_2018-09-26_at_11.png
> <
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t357/Screen_Shot_2018-09-26_at_11.png>
>
> Screen_Shot_2018-09-26_at_11.png
> <
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t357/Screen_Shot_2018-09-26_at_11.png>
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>