You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/03/02 17:57:54 UTC

[GitHub] [iceberg] rlcyf opened a new issue #2289: data is not updated in spark-shell

rlcyf opened a new issue #2289:
URL: https://github.com/apache/iceberg/issues/2289


   
   spark 3.0.1
   iceberg 0.11
   
   ```
   # push one data to kafka
   bin/kafka-console-producer.sh --topic test --bootstrap-server localhost:9092
   > {"user_id":1}
   ```
   
   ```
   # use structured-streaming consume data and the consumption is successful
   val tableIdentifier: String = ...
   data.writeStream
       .format("iceberg")
       .outputMode("append")
       .trigger(Trigger.ProcessingTime(1, TimeUnit.MINUTES))
       .option("path", tableIdentifier)
       .option("checkpointLocation", checkpointPath)
       .start()
   ```
   
   when I execute a query in spark-shell
   ```
   bin/spark-shell --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.prod=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.prod.type=hive --conf spark.sql.catalog.prod.warehouse=hdfs://localhost:9000/prod --conf spark.sql.warehouse.dir=hdfs://localhost:9000/prod
   
   spark.sql("select * from prod.db.sample").count
   res0: Long = 1
   
   # count on trino
   trino:db> select count(1) from prod.db.sample;
     1
    (1 rows)
   ```
   
   ```
   # push one data again
   bin/kafka-console-producer.sh --topic test --bootstrap-server localhost:9092
   > {"user_id":1}
   ```
   ```
   spark.sql("select * from prod.db.sample").count
   res0: Long = 1
   
   # count on trino
   trino:db> select count(1) from prod.db.sample;
     2
    (1 rows)
   ```
   in trino, the correct results can be queried in real time
   when I close spark-shell, restart it
   ```
   spark.sql("select * from prod.db.sample").count
   res0: Long = 2
   ```
   the result is correct
   
   there is another situation，after inserting the data, a period of time has passed (i don't know how long it takes)
   query again！ the result of the query is correct!
   
   Has a merger compact?
   How can I set up to check the correct data in real-time in the spark shell?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rlcyf closed issue #2289: data is not updated in spark-shell

Posted by GitBox <gi...@apache.org>.

rlcyf closed issue #2289:
URL: https://github.com/apache/iceberg/issues/2289


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rlcyf commented on issue #2289: data is not updated in spark-shell

Posted by GitBox <gi...@apache.org>.

rlcyf commented on issue #2289:
URL: https://github.com/apache/iceberg/issues/2289#issuecomment-789466988


   > 1.The `CachingCatalog` cache is used by default for SQL queries, which can be turned off by adding the following parameter when launching Spark-shell
   > 
   > ```
   > --conf "spark.sql.catalog.hadoop_prod.cache-enabled=false"
   > ```
   > 
   > 2.The other way is to refresh the current table before querying the Iceberg table:
   > 
   > ```
   > spark.sql("refresh table prod.db.tb")
   > spark.sql("select * from prod.db.tb")
   > ```
   
   ths! 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] zhangdove commented on issue #2289: data is not updated in spark-shell

Posted by GitBox <gi...@apache.org>.

zhangdove commented on issue #2289:
URL: https://github.com/apache/iceberg/issues/2289#issuecomment-789408931


   1.The `CachingCatalog` cache is used by default for SQL queries, which can be turned off by adding the following parameter when launching Spark-shell
   ```
   --conf "spark.sql.catalog.hadoop_prod.cache-enabled=false"
   ```
   
   2.The other way is to refresh the current table before querying the Iceberg table:
   ```
   spark.sql("refresh table prod.db.tb")
   spark.sql("select * from prod.db.tb")
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org