You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "markj-db (via GitHub)" <gi...@apache.org> on 2024/03/14 22:00:35 UTC

[PR] [SPARK-47404] Add hooks to release the ANTLR DFA cache after parsing SQL [spark]

markj-db opened a new pull request, #45526:
URL: https://github.com/apache/spark/pull/45526

### What changes were proposed in this pull request?
Add hooks to release the ANTLR DFA cache after parsing SQL

### Why are the changes needed?
ANTLR builds a DFA cache while parsing to speed up parsing of similar future inputs. However, this cache is never cleared and can only grow. Extremely large SQL inputs can lead to very large DFA caches (>20GiB in one extreme case I've seen).

Spark’s ANTLR SQL parser is derived from the Presto ANTLR SQL Parser, and Presto has added hooks to be able to clear this DFA cache. I think Spark should have similar hooks.

### Does this PR introduce _any_ user-facing change?
New `SQLConf` to control the behaviour (`spark.sql.releaseAntlrCacheAfterParse`)

### How was this patch tested?
New unit test

### Was this patch authored or co-authored using generative AI tooling?
No

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47404][SQL] Add hooks to release the ANTLR DFA cache after parsing SQL [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45526:
URL: https://github.com/apache/spark/pull/45526#discussion_r1525622438


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/parsers.scala:
##########
@@ -105,11 +109,61 @@ abstract class AbstractParser extends DataTypeParserInterface with Logging {
           messageParameters = e.getMessageParameters.asScala.toMap,
           queryContext = e.getQueryContext)
     }
+    finally {
+      if (conf.releaseAntlrCacheAfterParsing) {
+        AbstractSqlParser.refreshParserCaches()
+      }
+    }
   }
 
   private def conf: SqlApiConf = SqlApiConf.get
 }
 
+object AbstractSqlParser {
+  private case class AntlrCaches(atn: ATN) {
+    val predictionContextCache: PredictionContextCache = new PredictionContextCache
+    val decisionToDFA: Array[DFA] = AntlrCaches.makeDecisionToDFA(atn)
+    def installCaches(parser: SqlBaseParser): Unit = {
+      parser.setInterpreter(
+        new ParserATNSimulator(parser, atn, decisionToDFA, predictionContextCache))
+    }
+  }
+
+  private object AntlrCaches {
+    private def makeDecisionToDFA(atn: ATN): Array[DFA] = {
+      val decisionToDFA = new Array[DFA](atn.getNumberOfDecisions)
+      for (i <- 0 until atn.getNumberOfDecisions) {
+        decisionToDFA(i) = new DFA(atn.getDecisionState(i), i)
+      }
+      decisionToDFA
+    }
+  }
+
+  private val parserCaches = new AtomicReference[AntlrCaches](AntlrCaches(SqlBaseParser._ATN))
+
+  /**
+   * Install the parser caches into the given parser.
+   *
+   * This method should be called before parsing any input.
+   */
+  def installCaches(parser: SqlBaseParser): Unit = parserCaches.get().installCaches(parser)
+
+  /**
+   * Drop the existing parser caches and create a new one.
+   *
+   * ANTLR retains caches in its parser that are never released.  This speeds up parsing of future
+   * input, but it can consume a lot of memory depending on the input seen so far.

Review Comment:
   Does this cache work within a query or only for future queries?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47404][SQL] Add hooks to release the ANTLR DFA cache after parsing SQL [spark]

Posted by "markj-db (via GitHub)" <gi...@apache.org>.

markj-db commented on code in PR #45526:
URL: https://github.com/apache/spark/pull/45526#discussion_r1525575792


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -4589,6 +4589,20 @@ object SQLConf {
       .stringConf
       .createWithDefault("versionAsOf")
 
+  val RELEASE_ANTLR_CACHE_AFTER_PARSE =
+    buildConf("spark.sql.releaseAntlrCacheAfterParse")

Review Comment:
   Done



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala:
##########
@@ -800,4 +800,39 @@ class SparkSqlParserSuite extends AnalysisTest with SharedSparkSession {
         start = 0,
         stop = 63))
   }
+
+  def getMemoryUsage: Long = {
+    System.gc()
+    Runtime.getRuntime.totalMemory() - Runtime.getRuntime.freeMemory()
+  }
+
+  def awfulQuery(depth: Int): String = {

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47404][SQL] Add hooks to release the ANTLR DFA cache after parsing SQL [spark]

Posted by "markj-db (via GitHub)" <gi...@apache.org>.

markj-db commented on code in PR #45526:
URL: https://github.com/apache/spark/pull/45526#discussion_r1525575923


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala:
##########
@@ -800,4 +800,39 @@ class SparkSqlParserSuite extends AnalysisTest with SharedSparkSession {
         start = 0,
         stop = 63))
   }
+
+  def getMemoryUsage: Long = {
+    System.gc()
+    Runtime.getRuntime.totalMemory() - Runtime.getRuntime.freeMemory()
+  }
+
+  def awfulQuery(depth: Int): String = {
+    if (depth == 0) {
+      s"rand()"
+    } else {
+      s"case when ${awfulQuery(depth - 1)} > 0.5 " +
+      s"then ${awfulQuery(depth - 1)} " +
+      s"else ${awfulQuery(depth - 1)} " +
+      "end"
+    }
+  }
+
+  test("release ANTLR cache after parsing") {

Review Comment:
   Done.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -4589,6 +4589,20 @@ object SQLConf {
       .stringConf
       .createWithDefault("versionAsOf")
 
+  val RELEASE_ANTLR_CACHE_AFTER_PARSE =
+    buildConf("spark.sql.releaseAntlrCacheAfterParse")

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47404][SQL] Add hooks to release the ANTLR DFA cache after parsing SQL [spark]

Posted by "markj-db (via GitHub)" <gi...@apache.org>.

markj-db commented on code in PR #45526:
URL: https://github.com/apache/spark/pull/45526#discussion_r1525577050


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -4589,6 +4589,20 @@ object SQLConf {
       .stringConf
       .createWithDefault("versionAsOf")
 
+  val RELEASE_ANTLR_CACHE_AFTER_PARSE =
+    buildConf("spark.sql.releaseAntlrCacheAfterParse")
+      .doc("When true, release the ANTLR cache after parsing a SQL query. ANTLR parsers retain a " +
+        "DFA cache designed to speed up parsing future input. However, there is no limit to how " +
+        "large this cache can become and parsing large SQL statements can lead to an " +
+        "accumulation of objects in the cache that are unlikely to be reused.  This can deplete " +
+        "the available heap on the driver, leading to high GC overhead and evenatually OOMs. Set " +
+        "this to `true` to release the cache after every SQL statement.  On a long-lived driver, " +

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47404][SQL] Add hooks to release the ANTLR DFA cache after parsing SQL [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #45526:
URL: https://github.com/apache/spark/pull/45526#discussion_r1525543352


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -4589,6 +4589,20 @@ object SQLConf {
       .stringConf
       .createWithDefault("versionAsOf")
 
+  val RELEASE_ANTLR_CACHE_AFTER_PARSE =
+    buildConf("spark.sql.releaseAntlrCacheAfterParse")
+      .doc("When true, release the ANTLR cache after parsing a SQL query. ANTLR parsers retain a " +
+        "DFA cache designed to speed up parsing future input. However, there is no limit to how " +
+        "large this cache can become and parsing large SQL statements can lead to an " +
+        "accumulation of objects in the cache that are unlikely to be reused.  This can deplete " +
+        "the available heap on the driver, leading to high GC overhead and evenatually OOMs. Set " +
+        "this to `true` to release the cache after every SQL statement.  On a long-lived driver, " +
+        "it may be beneficial to periodically set this to `true`, run a query to clear the " +
+        "cache, and revert it to `false`.")
+      .version("4.0.0")
+      .booleanConf
+      .createWithDefault(false)

Review Comment:
   Since this is `false`, we preserve the existing behavior. Did I understand correctly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47404][SQL] Add hooks to release the ANTLR DFA cache after parsing SQL [spark]

Posted by "markj-db (via GitHub)" <gi...@apache.org>.

markj-db commented on code in PR #45526:
URL: https://github.com/apache/spark/pull/45526#discussion_r1526541475


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/parsers.scala:
##########
@@ -105,11 +109,61 @@ abstract class AbstractParser extends DataTypeParserInterface with Logging {
           messageParameters = e.getMessageParameters.asScala.toMap,
           queryContext = e.getQueryContext)
     }
+    finally {
+      if (conf.releaseAntlrCacheAfterParsing) {
+        AbstractSqlParser.refreshParserCaches()
+      }
+    }
   }
 
   private def conf: SqlApiConf = SqlApiConf.get
 }
 
+object AbstractSqlParser {

Review Comment:
   I was able to run the tests with:
   
   ```
   SKIP_UNIDOC=true JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64 dev/run-tests --parallelism 1 --modules sparkr
   ```
   
   However, I'm seeing a different error:
   
   ```
   ── Error ('test_basic.R:25:3'): create DataFrame from list or data.frame ───────
   Error in `handleErrors(returnStatus, conn)`: java.lang.NoSuchMethodError: 'void org.eclipse.jetty.servlet.ServletHolder.<init>(jakarta.servlet.Servlet)'
           at org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:118)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47404][SQL] Add hooks to release the ANTLR DFA cache after parsing SQL [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #45526:
URL: https://github.com/apache/spark/pull/45526#discussion_r1525543740


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala:
##########
@@ -800,4 +800,39 @@ class SparkSqlParserSuite extends AnalysisTest with SharedSparkSession {
         start = 0,
         stop = 63))
   }
+
+  def getMemoryUsage: Long = {

Review Comment:
   nit. `private`



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala:
##########
@@ -800,4 +800,39 @@ class SparkSqlParserSuite extends AnalysisTest with SharedSparkSession {
         start = 0,
         stop = 63))
   }
+
+  def getMemoryUsage: Long = {
+    System.gc()
+    Runtime.getRuntime.totalMemory() - Runtime.getRuntime.freeMemory()
+  }
+
+  def awfulQuery(depth: Int): String = {

Review Comment:
   ditto. `private`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47404][SQL] Add hooks to release the ANTLR DFA cache after parsing SQL [spark]

Posted by "markj-db (via GitHub)" <gi...@apache.org>.

markj-db commented on code in PR #45526:
URL: https://github.com/apache/spark/pull/45526#discussion_r1525571423


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -4589,6 +4589,20 @@ object SQLConf {
       .stringConf
       .createWithDefault("versionAsOf")
 
+  val RELEASE_ANTLR_CACHE_AFTER_PARSE =
+    buildConf("spark.sql.releaseAntlrCacheAfterParse")
+      .doc("When true, release the ANTLR cache after parsing a SQL query. ANTLR parsers retain a " +
+        "DFA cache designed to speed up parsing future input. However, there is no limit to how " +
+        "large this cache can become and parsing large SQL statements can lead to an " +
+        "accumulation of objects in the cache that are unlikely to be reused.  This can deplete " +
+        "the available heap on the driver, leading to high GC overhead and evenatually OOMs. Set " +
+        "this to `true` to release the cache after every SQL statement.  On a long-lived driver, " +
+        "it may be beneficial to periodically set this to `true`, run a query to clear the " +
+        "cache, and revert it to `false`.")
+      .version("4.0.0")
+      .booleanConf
+      .createWithDefault(false)

Review Comment:
   Yes



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47404][SQL] Add hooks to release the ANTLR DFA cache after parsing SQL [spark]

Posted by "markj-db (via GitHub)" <gi...@apache.org>.

markj-db commented on PR #45526:
URL: https://github.com/apache/spark/pull/45526#issuecomment-1998729476

   > Could you provide a reference link to the corresponding code in the PR description?
   
   Done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47404][SQL] Add hooks to release the ANTLR DFA cache after parsing SQL [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #45526:
URL: https://github.com/apache/spark/pull/45526#discussion_r1525547164


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala:
##########
@@ -800,4 +800,39 @@ class SparkSqlParserSuite extends AnalysisTest with SharedSparkSession {
         start = 0,
         stop = 63))
   }
+
+  def getMemoryUsage: Long = {
+    System.gc()
+    Runtime.getRuntime.totalMemory() - Runtime.getRuntime.freeMemory()
+  }
+
+  def awfulQuery(depth: Int): String = {
+    if (depth == 0) {
+      s"rand()"
+    } else {
+      s"case when ${awfulQuery(depth - 1)} > 0.5 " +
+      s"then ${awfulQuery(depth - 1)} " +
+      s"else ${awfulQuery(depth - 1)} " +
+      "end"
+    }
+  }
+
+  test("release ANTLR cache after parsing") {
+    val baselineMemory = getMemoryUsage // On my system, this is about 61 MiB
+    parser.parsePlan(s"select ${awfulQuery(8)} from range(10)")
+    val estimatedCacheOverhead = getMemoryUsage - baselineMemory // about  119 MiB

Review Comment:
   I'm worrying about the possibility where this test case might introduce a new flakiness in our GitHub Action environment. Especially, Apple Silicon environment. Do you think this test case clears up the leftovers after completing this test suite?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47404][SQL] Add hooks to release the ANTLR DFA cache after parsing SQL [spark]

Posted by "markj-db (via GitHub)" <gi...@apache.org>.

markj-db commented on code in PR #45526:
URL: https://github.com/apache/spark/pull/45526#discussion_r1525685568


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/parsers.scala:
##########
@@ -105,11 +109,61 @@ abstract class AbstractParser extends DataTypeParserInterface with Logging {
           messageParameters = e.getMessageParameters.asScala.toMap,
           queryContext = e.getQueryContext)
     }
+    finally {
+      if (conf.releaseAntlrCacheAfterParsing) {
+        AbstractSqlParser.refreshParserCaches()
+      }
+    }
   }
 
   private def conf: SqlApiConf = SqlApiConf.get
 }
 
+object AbstractSqlParser {
+  private case class AntlrCaches(atn: ATN) {
+    val predictionContextCache: PredictionContextCache = new PredictionContextCache
+    val decisionToDFA: Array[DFA] = AntlrCaches.makeDecisionToDFA(atn)
+    def installCaches(parser: SqlBaseParser): Unit = {
+      parser.setInterpreter(
+        new ParserATNSimulator(parser, atn, decisionToDFA, predictionContextCache))
+    }
+  }
+
+  private object AntlrCaches {
+    private def makeDecisionToDFA(atn: ATN): Array[DFA] = {
+      val decisionToDFA = new Array[DFA](atn.getNumberOfDecisions)
+      for (i <- 0 until atn.getNumberOfDecisions) {
+        decisionToDFA(i) = new DFA(atn.getDecisionState(i), i)
+      }
+      decisionToDFA
+    }
+  }
+
+  private val parserCaches = new AtomicReference[AntlrCaches](AntlrCaches(SqlBaseParser._ATN))
+
+  /**
+   * Install the parser caches into the given parser.
+   *
+   * This method should be called before parsing any input.
+   */
+  def installCaches(parser: SqlBaseParser): Unit = parserCaches.get().installCaches(parser)
+
+  /**
+   * Drop the existing parser caches and create a new one.
+   *
+   * ANTLR retains caches in its parser that are never released.  This speeds up parsing of future
+   * input, but it can consume a lot of memory depending on the input seen so far.

Review Comment:
   When I say:
   
   > The `ParserATNSimulator` was constructed with a reference to the static DFA cache
   
   This is best seen in the ANTLR generated code.  Here's a snippet:
   
   ```java
   public class SqlBaseParser extends Parser {
           static { RuntimeMetaData.checkVersion("4.8", RuntimeMetaData.VERSION); }
   
           protected static final DFA[] _decisionToDFA;
           protected static final PredictionContextCache _sharedContextCache =
                   new PredictionContextCache();
   ...
           public SqlBaseParser(TokenStream input) {
                   super(input);
                   _interp = new ParserATNSimulator(this,_ATN,_decisionToDFA,_sharedContextCache);
           }
   ...
           static {
                   _decisionToDFA = new DFA[_ATN.getNumberOfDecisions()];
                   for (int i = 0; i < _ATN.getNumberOfDecisions(); i++) {
                           _decisionToDFA[i] = new DFA(_ATN.getDecisionState(i), i);
                   }
           }
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47404][SQL] Add hooks to release the ANTLR DFA cache after parsing SQL [spark]

Posted by "markj-db (via GitHub)" <gi...@apache.org>.

markj-db commented on code in PR #45526:
URL: https://github.com/apache/spark/pull/45526#discussion_r1526471948


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/parsers.scala:
##########
@@ -105,11 +109,61 @@ abstract class AbstractParser extends DataTypeParserInterface with Logging {
           messageParameters = e.getMessageParameters.asScala.toMap,
           queryContext = e.getQueryContext)
     }
+    finally {
+      if (conf.releaseAntlrCacheAfterParsing) {
+        AbstractSqlParser.refreshParserCaches()
+      }
+    }
   }
 
   private def conf: SqlApiConf = SqlApiConf.get
 }
 
+object AbstractSqlParser {

Review Comment:
   I'm seeing two `sparkr` tests failing with:
   
   ```
   Error in `handleErrors(returnStatus, conn)`: java.lang.IncompatibleClassChangeError: class org.apache.spark.sql.execution.SparkSqlParser cannot inherit from final class org.apache.spark.sql.catalyst.parser.AbstractSqlParser
   ```
   
   https://github.com/markj-db/spark/actions/runs/8288610971/job/22688738738
   
   Is this because of the addition of this companion object?  Can I run the `sparkr` tests locally?  I don't see them mentioned on https://spark.apache.org/developer-tools.html, but I'll keep looking around.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47404][SQL] Add hooks to release the ANTLR DFA cache after parsing SQL [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #45526:
URL: https://github.com/apache/spark/pull/45526#discussion_r1525548225


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -4589,6 +4589,20 @@ object SQLConf {
       .stringConf
       .createWithDefault("versionAsOf")
 
+  val RELEASE_ANTLR_CACHE_AFTER_PARSE =
+    buildConf("spark.sql.releaseAntlrCacheAfterParse")
+      .doc("When true, release the ANTLR cache after parsing a SQL query. ANTLR parsers retain a " +
+        "DFA cache designed to speed up parsing future input. However, there is no limit to how " +
+        "large this cache can become and parsing large SQL statements can lead to an " +
+        "accumulation of objects in the cache that are unlikely to be reused.  This can deplete " +
+        "the available heap on the driver, leading to high GC overhead and evenatually OOMs. Set " +
+        "this to `true` to release the cache after every SQL statement.  On a long-lived driver, " +

Review Comment:
   ditto. `.  On` -> `. On`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47404][SQL] Add hooks to release the ANTLR DFA cache after parsing SQL [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #45526:
URL: https://github.com/apache/spark/pull/45526#discussion_r1525547925


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -4589,6 +4589,20 @@ object SQLConf {
       .stringConf
       .createWithDefault("versionAsOf")
 
+  val RELEASE_ANTLR_CACHE_AFTER_PARSE =
+    buildConf("spark.sql.releaseAntlrCacheAfterParse")
+      .doc("When true, release the ANTLR cache after parsing a SQL query. ANTLR parsers retain a " +
+        "DFA cache designed to speed up parsing future input. However, there is no limit to how " +
+        "large this cache can become and parsing large SQL statements can lead to an " +
+        "accumulation of objects in the cache that are unlikely to be reused.  This can deplete " +

Review Comment:
   nit. `.  This` -> `. This`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47404][SQL] Add hooks to release the ANTLR DFA cache after parsing SQL [spark]

Posted by "markj-db (via GitHub)" <gi...@apache.org>.

markj-db commented on PR #45526:
URL: https://github.com/apache/spark/pull/45526#issuecomment-1998596814

   > Instead of recommending this, why don't we do this from Spark driver side by itself? For example, instead of having a boolean configuration like this, we can have an integer configuration to clear up. The default value could be Int.MaxValue.
   
   Could you be more specific about what the integer means in your proposal?
   
   I had considered a few possibilities for making this simpler to work with:
    * try to detect memory pressure and clear the cache only when there's memory pressure; is there a callback we could hook into that would let us free the memory when the driver is memory-constrained?
    * let the user specify a maximum size for the cache: this requires being able to calculate the size of the objects cached, and I'm not clear how to achieve this ([memory-measurer](https://github.com/DimitrisAndreou/memory-measurer/tree/master?tab=readme-ov-file) is one option I came across while searching)
    * I suppose you could also clear the cache every N queries
    * Alternatively, you could have a background thread that clears the cache on a timer
    * You could just clear the cache on every parse; this allows reuse of the cache during the processing of a query, but not between queries.
   
   I left this manual because there's a tradeoff here where neither behaviour is clearly superior.  For example, if the driver will be handling many similar queries over its lifetime and none of them are terribly large, the performance benefit of the cache probably outweighs the reclaimed memory (e.g. imagine a cluster that's optimized for minimizing latency on small queries).  The current behaviour is probably appropriate for most Spark users most of the time, so leaving this as a manual tuning option seems like the right balance of complexity, effort, and impact.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47404][SQL] Add hooks to release the ANTLR DFA cache after parsing SQL [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #45526:
URL: https://github.com/apache/spark/pull/45526#issuecomment-1998862359

   cc @linhongliu-db 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47404][SQL] Add hooks to release the ANTLR DFA cache after parsing SQL [spark]

Posted by "markj-db (via GitHub)" <gi...@apache.org>.

markj-db commented on code in PR #45526:
URL: https://github.com/apache/spark/pull/45526#discussion_r1525576127


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -4589,6 +4589,20 @@ object SQLConf {
       .stringConf
       .createWithDefault("versionAsOf")
 
+  val RELEASE_ANTLR_CACHE_AFTER_PARSE =
+    buildConf("spark.sql.releaseAntlrCacheAfterParse")
+      .doc("When true, release the ANTLR cache after parsing a SQL query. ANTLR parsers retain a " +
+        "DFA cache designed to speed up parsing future input. However, there is no limit to how " +
+        "large this cache can become and parsing large SQL statements can lead to an " +
+        "accumulation of objects in the cache that are unlikely to be reused.  This can deplete " +

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47404][SQL] Add hooks to release the ANTLR DFA cache after parsing SQL [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #45526:
URL: https://github.com/apache/spark/pull/45526#discussion_r1525544763


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala:
##########
@@ -800,4 +800,39 @@ class SparkSqlParserSuite extends AnalysisTest with SharedSparkSession {
         start = 0,
         stop = 63))
   }
+
+  def getMemoryUsage: Long = {
+    System.gc()
+    Runtime.getRuntime.totalMemory() - Runtime.getRuntime.freeMemory()
+  }
+
+  def awfulQuery(depth: Int): String = {
+    if (depth == 0) {
+      s"rand()"
+    } else {
+      s"case when ${awfulQuery(depth - 1)} > 0.5 " +
+      s"then ${awfulQuery(depth - 1)} " +
+      s"else ${awfulQuery(depth - 1)} " +
+      "end"
+    }
+  }
+
+  test("release ANTLR cache after parsing") {

Review Comment:
   If you don't mind, please use the test prefix like the following.
   ```scala
   - test("release ANTLR cache after parsing") {
   + test("SPARK-47404: release ANTLR cache after parsing") {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47404][SQL] Add hooks to release the ANTLR DFA cache after parsing SQL [spark]

Posted by "markj-db (via GitHub)" <gi...@apache.org>.

markj-db commented on code in PR #45526:
URL: https://github.com/apache/spark/pull/45526#discussion_r1525574297


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala:
##########
@@ -800,4 +800,39 @@ class SparkSqlParserSuite extends AnalysisTest with SharedSparkSession {
         start = 0,
         stop = 63))
   }
+
+  def getMemoryUsage: Long = {
+    System.gc()
+    Runtime.getRuntime.totalMemory() - Runtime.getRuntime.freeMemory()
+  }
+
+  def awfulQuery(depth: Int): String = {
+    if (depth == 0) {
+      s"rand()"
+    } else {
+      s"case when ${awfulQuery(depth - 1)} > 0.5 " +
+      s"then ${awfulQuery(depth - 1)} " +
+      s"else ${awfulQuery(depth - 1)} " +
+      "end"
+    }
+  }
+
+  test("release ANTLR cache after parsing") {
+    val baselineMemory = getMemoryUsage // On my system, this is about 61 MiB
+    parser.parsePlan(s"select ${awfulQuery(8)} from range(10)")
+    val estimatedCacheOverhead = getMemoryUsage - baselineMemory // about  119 MiB
+    parser.parsePlan("select * from range(10)")
+    // Within some tolerance, the memory usage should be the same as it was after filling the ANTLR
+    // DFA cache.
+    val tolerance = 0.4
+    val memoryUsageWithoutRelease = getMemoryUsage
+    assert(memoryUsageWithoutRelease > baselineMemory + (1 - tolerance) * estimatedCacheOverhead)
+    assert(memoryUsageWithoutRelease < baselineMemory + (1 + tolerance) * estimatedCacheOverhead)
+    withSQLConf("spark.sql.releaseAntlrCacheAfterParse" -> "true") {
+      parser.parsePlan("select id from range(10)")

Review Comment:
   How would I measure the flakiness of this test?  I think I've made a reasonable attempt to mitigate this risk, but I'm happy to try verifying.  Under what circumstances do multiple tests run concurrently in the same JVM?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47404][SQL] Add hooks to release the ANTLR DFA cache after parsing SQL [spark]

Posted by "markj-db (via GitHub)" <gi...@apache.org>.

markj-db commented on code in PR #45526:
URL: https://github.com/apache/spark/pull/45526#discussion_r1525575493


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala:
##########
@@ -800,4 +800,39 @@ class SparkSqlParserSuite extends AnalysisTest with SharedSparkSession {
         start = 0,
         stop = 63))
   }
+
+  def getMemoryUsage: Long = {
+    System.gc()
+    Runtime.getRuntime.totalMemory() - Runtime.getRuntime.freeMemory()
+  }
+
+  def awfulQuery(depth: Int): String = {
+    if (depth == 0) {
+      s"rand()"
+    } else {
+      s"case when ${awfulQuery(depth - 1)} > 0.5 " +
+      s"then ${awfulQuery(depth - 1)} " +
+      s"else ${awfulQuery(depth - 1)} " +
+      "end"
+    }
+  }
+
+  test("release ANTLR cache after parsing") {
+    val baselineMemory = getMemoryUsage // On my system, this is about 61 MiB
+    parser.parsePlan(s"select ${awfulQuery(8)} from range(10)")
+    val estimatedCacheOverhead = getMemoryUsage - baselineMemory // about  119 MiB
+    parser.parsePlan("select * from range(10)")
+    // Within some tolerance, the memory usage should be the same as it was after filling the ANTLR
+    // DFA cache.
+    val tolerance = 0.4
+    val memoryUsageWithoutRelease = getMemoryUsage
+    assert(memoryUsageWithoutRelease > baselineMemory + (1 - tolerance) * estimatedCacheOverhead)
+    assert(memoryUsageWithoutRelease < baselineMemory + (1 + tolerance) * estimatedCacheOverhead)
+    withSQLConf("spark.sql.releaseAntlrCacheAfterParse" -> "true") {
+      parser.parsePlan("select id from range(10)")

Review Comment:
   Running in SBT, my experience was the testcases in the suite run serially.  I observed the new test to take about 2 seconds:
   
   ```
   SPARK-47404: release ANTLR cache after parsing (1 second, 855 milliseconds)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47404][SQL] Add hooks to release the ANTLR DFA cache after parsing SQL [spark]

Posted by "markj-db (via GitHub)" <gi...@apache.org>.

markj-db commented on code in PR #45526:
URL: https://github.com/apache/spark/pull/45526#discussion_r1525572646


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala:
##########
@@ -800,4 +800,39 @@ class SparkSqlParserSuite extends AnalysisTest with SharedSparkSession {
         start = 0,
         stop = 63))
   }
+
+  def getMemoryUsage: Long = {
+    System.gc()
+    Runtime.getRuntime.totalMemory() - Runtime.getRuntime.freeMemory()
+  }
+
+  def awfulQuery(depth: Int): String = {
+    if (depth == 0) {
+      s"rand()"
+    } else {
+      s"case when ${awfulQuery(depth - 1)} > 0.5 " +
+      s"then ${awfulQuery(depth - 1)} " +
+      s"else ${awfulQuery(depth - 1)} " +
+      "end"
+    }
+  }
+
+  test("release ANTLR cache after parsing") {
+    val baselineMemory = getMemoryUsage // On my system, this is about 61 MiB
+    parser.parsePlan(s"select ${awfulQuery(8)} from range(10)")
+    val estimatedCacheOverhead = getMemoryUsage - baselineMemory // about  119 MiB

Review Comment:
   I worry a little about flakiness too, but the thresholding is intended to prevent that from being an issue.
   
   Why especially Apple Silicon?
   
   What leftovers are you referring to?  The only thing I can think of is the ANTLR cache, and this test is explicitly verifying that the ANTLR cache is cleared.  So I think yes, all leftovers are cleared?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47404][SQL] Add hooks to release the ANTLR DFA cache after parsing SQL [spark]

Posted by "markj-db (via GitHub)" <gi...@apache.org>.

markj-db commented on PR #45526:
URL: https://github.com/apache/spark/pull/45526#issuecomment-2008072396

@dongjoon-hyun I can certainly implement your proposal if the consensus among reviewers is that we'd like to go that direction. The downside I can see in what you suggest is that it adds complexity (the user now has to pick `N` as well as decide to use this feature) and I think the benefit is dubious. When I try to imagine picking `N`, I can envision a few scenarios:
1. The cluster exists to run a workload known _a priori_; in this case, there may be an obvious point where the user of the cluster wants to clear the cache (e.g. after parsing a large query string; after parsing all query strings but before execution) and this is best served by explicitly clearing the cache once at the intended point. Picking some `N` to try to clear the cache at the right moment is brittle (what if refactoring adds a query) and a little opaque (since it involves setting the config some number of queries before the intended point to clear the cache).
2. The cluster will be long-lived and running an unknown workload; in this case, clearing the cache every `N` queries will unpredictably pessimize the experience of some queries without necessarily clearing the cache after the largest queries or when the driver memory demand is the greatest. In a sense, the larger `N` is, the more likely the cache will be cleared at the wrong time, making `N==1` optimal (in some sense -- it's kind of a stretch).
I think it will be difficult to pick a value for `N` that a user can set-and-forget.

The two strategies I favour are:
1. explicitly clear the cache when running a query that's known to either fill the cache or require maximal driver RAM (i.e. the strategy currently implemented)
2. automatically clear the cache in response to memory pressure (not implemented, but something that someone could pick up as future work to improve on what's in this PR)

I think adding an integer here increases the configuration space, but not in a way that I think is likely to benefit the end user.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47404][SQL] Add hooks to release the ANTLR DFA cache after parsing SQL [spark]

Posted by "linhongliu-db (via GitHub)" <gi...@apache.org>.

linhongliu-db commented on code in PR #45526:
URL: https://github.com/apache/spark/pull/45526#discussion_r1525536117


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -4589,6 +4589,20 @@ object SQLConf {
       .stringConf
       .createWithDefault("versionAsOf")
 
+  val RELEASE_ANTLR_CACHE_AFTER_PARSE =
+    buildConf("spark.sql.releaseAntlrCacheAfterParse")

Review Comment:
   nit: afterParsing?



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala:
##########
@@ -800,4 +800,39 @@ class SparkSqlParserSuite extends AnalysisTest with SharedSparkSession {
         start = 0,
         stop = 63))
   }
+
+  def getMemoryUsage: Long = {
+    System.gc()
+    Runtime.getRuntime.totalMemory() - Runtime.getRuntime.freeMemory()
+  }
+
+  def awfulQuery(depth: Int): String = {
+    if (depth == 0) {
+      s"rand()"
+    } else {
+      s"case when ${awfulQuery(depth - 1)} > 0.5 " +
+      s"then ${awfulQuery(depth - 1)} " +
+      s"else ${awfulQuery(depth - 1)} " +
+      "end"
+    }
+  }
+
+  test("release ANTLR cache after parsing") {
+    val baselineMemory = getMemoryUsage // On my system, this is about 61 MiB
+    parser.parsePlan(s"select ${awfulQuery(8)} from range(10)")
+    val estimatedCacheOverhead = getMemoryUsage - baselineMemory // about  119 MiB
+    parser.parsePlan("select * from range(10)")
+    // Within some tolerance, the memory usage should be the same as it was after filling the ANTLR
+    // DFA cache.
+    val tolerance = 0.4
+    val memoryUsageWithoutRelease = getMemoryUsage
+    assert(memoryUsageWithoutRelease > baselineMemory + (1 - tolerance) * estimatedCacheOverhead)
+    assert(memoryUsageWithoutRelease < baselineMemory + (1 + tolerance) * estimatedCacheOverhead)
+    withSQLConf("spark.sql.releaseAntlrCacheAfterParse" -> "true") {
+      parser.parsePlan("select id from range(10)")

Review Comment:
   how about run this query multiple times, so the result could be more accurate.



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala:
##########
@@ -800,4 +800,39 @@ class SparkSqlParserSuite extends AnalysisTest with SharedSparkSession {
         start = 0,
         stop = 63))
   }
+
+  def getMemoryUsage: Long = {
+    System.gc()
+    Runtime.getRuntime.totalMemory() - Runtime.getRuntime.freeMemory()
+  }
+
+  def awfulQuery(depth: Int): String = {
+    if (depth == 0) {
+      s"rand()"
+    } else {
+      s"case when ${awfulQuery(depth - 1)} > 0.5 " +
+      s"then ${awfulQuery(depth - 1)} " +
+      s"else ${awfulQuery(depth - 1)} " +
+      "end"
+    }
+  }
+
+  test("release ANTLR cache after parsing") {
+    val baselineMemory = getMemoryUsage // On my system, this is about 61 MiB
+    parser.parsePlan(s"select ${awfulQuery(8)} from range(10)")
+    val estimatedCacheOverhead = getMemoryUsage - baselineMemory // about  119 MiB
+    parser.parsePlan("select * from range(10)")
+    // Within some tolerance, the memory usage should be the same as it was after filling the ANTLR
+    // DFA cache.
+    val tolerance = 0.4
+    val memoryUsageWithoutRelease = getMemoryUsage
+    assert(memoryUsageWithoutRelease > baselineMemory + (1 - tolerance) * estimatedCacheOverhead)
+    assert(memoryUsageWithoutRelease < baselineMemory + (1 + tolerance) * estimatedCacheOverhead)
+    withSQLConf("spark.sql.releaseAntlrCacheAfterParse" -> "true") {
+      parser.parsePlan("select id from range(10)")

Review Comment:
   but even running repeatedly, I'm worried about the flakiness if the test runs with others in parallel.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47404][SQL] Add hooks to release the ANTLR DFA cache after parsing SQL [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #45526:
URL: https://github.com/apache/spark/pull/45526#discussion_r1525542770


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -4589,6 +4589,20 @@ object SQLConf {
       .stringConf
       .createWithDefault("versionAsOf")
 
+  val RELEASE_ANTLR_CACHE_AFTER_PARSE =
+    buildConf("spark.sql.releaseAntlrCacheAfterParse")

Review Comment:
   Please use the existing config namespace, `spark.sql.parser`.
   ```
   - spark.sql.releaseAntlrCacheAfterParse
   + spark.sql.parser.releaseAntlrCacheAfterParse
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47404][SQL] Add hooks to release the ANTLR DFA cache after parsing SQL [spark]

Posted by "markj-db (via GitHub)" <gi...@apache.org>.

markj-db commented on code in PR #45526:
URL: https://github.com/apache/spark/pull/45526#discussion_r1525683035


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/parsers.scala:
##########
@@ -105,11 +109,61 @@ abstract class AbstractParser extends DataTypeParserInterface with Logging {
           messageParameters = e.getMessageParameters.asScala.toMap,
           queryContext = e.getQueryContext)
     }
+    finally {
+      if (conf.releaseAntlrCacheAfterParsing) {
+        AbstractSqlParser.refreshParserCaches()
+      }
+    }
   }
 
   private def conf: SqlApiConf = SqlApiConf.get
 }
 
+object AbstractSqlParser {
+  private case class AntlrCaches(atn: ATN) {
+    val predictionContextCache: PredictionContextCache = new PredictionContextCache
+    val decisionToDFA: Array[DFA] = AntlrCaches.makeDecisionToDFA(atn)
+    def installCaches(parser: SqlBaseParser): Unit = {
+      parser.setInterpreter(
+        new ParserATNSimulator(parser, atn, decisionToDFA, predictionContextCache))
+    }
+  }
+
+  private object AntlrCaches {
+    private def makeDecisionToDFA(atn: ATN): Array[DFA] = {
+      val decisionToDFA = new Array[DFA](atn.getNumberOfDecisions)
+      for (i <- 0 until atn.getNumberOfDecisions) {
+        decisionToDFA(i) = new DFA(atn.getDecisionState(i), i)
+      }
+      decisionToDFA
+    }
+  }
+
+  private val parserCaches = new AtomicReference[AntlrCaches](AntlrCaches(SqlBaseParser._ATN))
+
+  /**
+   * Install the parser caches into the given parser.
+   *
+   * This method should be called before parsing any input.
+   */
+  def installCaches(parser: SqlBaseParser): Unit = parserCaches.get().installCaches(parser)
+
+  /**
+   * Drop the existing parser caches and create a new one.
+   *
+   * ANTLR retains caches in its parser that are never released.  This speeds up parsing of future
+   * input, but it can consume a lot of memory depending on the input seen so far.

Review Comment:
   It's my understanding that it works for future tokens in the same query as well as for future queries.
   
   `ParserInterpreter.parse` calls `ParserInterpreter.visitState`, which calls `ParserInterpreter.visitDecisionState`, which calls `ParserATNSimulator.adaptivePredict`.  The `ParserATNSimulator` was constructed with a reference to the static DFA cache and `ParserATNSimulator.adaptivePredict` reads/updates that cache via `ParserATNSimulator.addDFAState` which calls `ATNConfigSet.optimizeConfigs`, which calls `ATNSimulator.getCachedContext`, which calls `PredictionContext.getCachedContext`, which calls `PredictionContextCache.add` and `PredictionContextCache.get`.
   
   https://github.com/antlr/antlr4/blob/69cfd8e49b911c4a0483035b100fd438f1a55841/runtime/Java/src/org/antlr/v4/runtime/atn/PredictionContext.java#L544-L605
   
   So, in short, any time there's a prediction during parsing, whether it's later in an input stream or in a subsequent input stream, it's using this cache.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org