You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by GitBox <gi...@apache.org> on 2022/09/22 20:29:07 UTC

[GitHub] [solr] gerlowskija opened a new pull request, #1040: SOLR-16428: Add "permissive" mode to IgnoreLargeDocumentsProcessorFactory

gerlowskija opened a new pull request, #1040:
URL: https://github.com/apache/solr/pull/1040

   # Description
   
   IgnoreLargeDocumentProcessorFactory only has a single way to handle documents that exceed its configurable size limit. The first violation throws a SolrException: in effect, short-circuiting any remaining documents in the "batch" and returning a 400 to the user.
   
   This is great for end users whose clients are built to handle the resulting 400 response, and who can modify and resubmit the batch. But it's not ideal for every use-case, especially where "best-effort" indexing is good enough.
   
   # Solution
   
   This PR includes adding a new "permissive" mode of handling too-large documents to ILDPF. Under this new mode "too-large" documents will be logged (and not indexed), but won't cause the entire batch to be aborted/error-out.
   
   # Tests
   
   Tests and documentations are still needed.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request title.
   - [x] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `main` branch.
   - [ ] I have run `./gradlew check`.
   - [ ] I have added tests for my changes.
   - [ ] I have added documentation for the [Reference Guide](https://github.com/apache/solr/tree/main/solr/solr-ref-guide)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] gerlowskija commented on a diff in pull request #1040: SOLR-16428: Add "permissive" mode to IgnoreLargeDocumentsProcessorFactory

Posted by GitBox <gi...@apache.org>.
gerlowskija commented on code in PR #1040:
URL: https://github.com/apache/solr/pull/1040#discussion_r980316083


##########
solr/core/src/java/org/apache/solr/update/processor/IgnoreLargeDocumentProcessorFactory.java:
##########
@@ -59,15 +71,25 @@ public void init(NamedList<?> args) {
   public UpdateRequestProcessor getInstance(
       SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next) {
     return new UpdateRequestProcessor(next) {
+
       @Override
       public void processAdd(AddUpdateCommand cmd) throws IOException {
         long docSize = ObjectSizeEstimator.estimate(cmd.getSolrInputDocument());
         if (docSize / 1024 > maxDocumentSize) {
+          handleViolatingDoc(cmd, docSize);
+        } else {
+          super.processAdd(cmd);
+        }
+      }
+
+      private void handleViolatingDoc(AddUpdateCommand cmd, long estimatedSizeBytes) {
+        if (onlyLogErrors) {
+          log.warn("Skipping doc {} bc size {} exceeds limit {}", cmd.getPrintableId(), estimatedSizeBytes / 1024, maxDocumentSize);

Review Comment:
   Good call; I've updated the var names and log message to make this clearer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] sonatype-lift[bot] commented on a diff in pull request #1040: SOLR-16428: Add "permissive" mode to IgnoreLargeDocumentsProcessorFactory

Posted by GitBox <gi...@apache.org>.
sonatype-lift[bot] commented on code in PR #1040:
URL: https://github.com/apache/solr/pull/1040#discussion_r978094487


##########
solr/core/src/java/org/apache/solr/update/processor/IgnoreLargeDocumentProcessorFactory.java:
##########
@@ -41,14 +46,21 @@
  */
 public class IgnoreLargeDocumentProcessorFactory extends UpdateRequestProcessorFactory {
   public static final String LIMIT_SIZE_PARAM = "limit";
+  public static final String ONLY_LOG_ERRORS_PARAM = "onlyLogErrors";
+  private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
+
 
   // limit of a SolrInputDocument size (in kb)
   private long maxDocumentSize = 1024 * 1024;
+  private boolean onlyLogErrors = false;
 
   @Override
   public void init(NamedList<?> args) {
-    maxDocumentSize = args.toSolrParams().required().getLong(LIMIT_SIZE_PARAM);
+    final SolrParams params = args.toSolrParams();
+    maxDocumentSize = params.required().getLong(LIMIT_SIZE_PARAM);

Review Comment:
   *NULL_DEREFERENCE:*  object returned by `params.required().getLong("limit")` could be null and is dereferenced at line 60.
   
   ---
   
   <details><summary><b>ℹī¸ Learn about @sonatype-lift commands</b></summary>
   
   You can reply with the following commands. For example, reply with ***@sonatype-lift ignoreall*** to leave out all findings.
   | **Command** | **Usage** |
   | ------------- | ------------- |
   | `@sonatype-lift ignore` | Leave out the above finding from this PR |
   | `@sonatype-lift ignoreall` | Leave out all the existing findings from this PR |
   | `@sonatype-lift exclude <file\|issue\|path\|tool>` | Exclude specified `file\|issue\|path\|tool` from Lift findings by updating your config.toml file |
   
   **Note:** When talking to LiftBot, you need to **refresh** the page to see its response.
   <sub>[Click here](https://github.com/apps/sonatype-lift/installations/new) to add LiftBot to another repo.</sub></details>
   
   
   
   ---
   
   Was this a good recommendation?
   [ [🙁 Not relevant](https://www.sonatype.com/lift-comment-rating?comment=336441442&lift_comment_rating=1) ] - [ [😕 Won't fix](https://www.sonatype.com/lift-comment-rating?comment=336441442&lift_comment_rating=2) ] - [ [😑 Not critical, will fix](https://www.sonatype.com/lift-comment-rating?comment=336441442&lift_comment_rating=3) ] - [ [🙂 Critical, will fix](https://www.sonatype.com/lift-comment-rating?comment=336441442&lift_comment_rating=4) ] - [ [😊 Critical, fixing now](https://www.sonatype.com/lift-comment-rating?comment=336441442&lift_comment_rating=5) ]



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] sonatype-lift[bot] commented on pull request #1040: SOLR-16428: Add "permissive" mode to IgnoreLargeDocumentsProcessorFactory

Posted by GitBox <gi...@apache.org>.
sonatype-lift[bot] commented on PR #1040:
URL: https://github.com/apache/solr/pull/1040#issuecomment-1258508777

   :warning: **313 God Classes** were detected by Lift in this project. [Visit the Lift web console](https://lift.sonatype.com/results/github.com/apache/solr/01GDXGHJK4V24WPWQJ151NBM9N?tab=technical-debt&utm_source=github.com&utm_campaign=lift-comment&utm_content=apache\%20solr) for more details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] sonatype-lift[bot] commented on a diff in pull request #1040: SOLR-16428: Add "permissive" mode to IgnoreLargeDocumentsProcessorFactory

Posted by GitBox <gi...@apache.org>.
sonatype-lift[bot] commented on code in PR #1040:
URL: https://github.com/apache/solr/pull/1040#discussion_r980415100


##########
solr/core/src/java/org/apache/solr/update/processor/IgnoreLargeDocumentProcessorFactory.java:
##########
@@ -41,14 +45,20 @@
  */
 public class IgnoreLargeDocumentProcessorFactory extends UpdateRequestProcessorFactory {
   public static final String LIMIT_SIZE_PARAM = "limit";
+  public static final String PERMISSIVE_MODE_PARAM = "permissiveMode";
+  private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
 
   // limit of a SolrInputDocument size (in kb)
-  private long maxDocumentSize = 1024 * 1024;
+  private long maxDocumentSizeKb = 1024 * 1024;
+  private boolean onlyLogErrors = false;
 
   @Override
   public void init(NamedList<?> args) {
-    maxDocumentSize = args.toSolrParams().required().getLong(LIMIT_SIZE_PARAM);
+    final SolrParams params = args.toSolrParams();
+    maxDocumentSizeKb = params.required().getLong(LIMIT_SIZE_PARAM);

Review Comment:
   *NULL_DEREFERENCE:*  object returned by `params.required().getLong("limit")` could be null and is dereferenced at line 58.
   
   ---
   
   <details><summary><b>ℹī¸ Learn about @sonatype-lift commands</b></summary>
   
   You can reply with the following commands. For example, reply with ***@sonatype-lift ignoreall*** to leave out all findings.
   | **Command** | **Usage** |
   | ------------- | ------------- |
   | `@sonatype-lift ignore` | Leave out the above finding from this PR |
   | `@sonatype-lift ignoreall` | Leave out all the existing findings from this PR |
   | `@sonatype-lift exclude <file\|issue\|path\|tool>` | Exclude specified `file\|issue\|path\|tool` from Lift findings by updating your config.toml file |
   
   **Note:** When talking to LiftBot, you need to **refresh** the page to see its response.
   <sub>[Click here](https://github.com/apps/sonatype-lift/installations/new) to add LiftBot to another repo.</sub></details>
   
   
   
   ---
   
   Was this a good recommendation?
   [ [🙁 Not relevant](https://www.sonatype.com/lift-comment-rating?comment=338019884&lift_comment_rating=1) ] - [ [😕 Won't fix](https://www.sonatype.com/lift-comment-rating?comment=338019884&lift_comment_rating=2) ] - [ [😑 Not critical, will fix](https://www.sonatype.com/lift-comment-rating?comment=338019884&lift_comment_rating=3) ] - [ [🙂 Critical, will fix](https://www.sonatype.com/lift-comment-rating?comment=338019884&lift_comment_rating=4) ] - [ [😊 Critical, fixing now](https://www.sonatype.com/lift-comment-rating?comment=338019884&lift_comment_rating=5) ]



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] gerlowskija merged pull request #1040: SOLR-16428: Add "permissive" mode to IgnoreLargeDocumentsProcessorFactory

Posted by GitBox <gi...@apache.org>.
gerlowskija merged PR #1040:
URL: https://github.com/apache/solr/pull/1040


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] dsmiley commented on a diff in pull request #1040: SOLR-16428: Add "permissive" mode to IgnoreLargeDocumentsProcessorFactory

Posted by GitBox <gi...@apache.org>.
dsmiley commented on code in PR #1040:
URL: https://github.com/apache/solr/pull/1040#discussion_r979055239


##########
solr/core/src/java/org/apache/solr/update/processor/IgnoreLargeDocumentProcessorFactory.java:
##########
@@ -59,15 +71,25 @@ public void init(NamedList<?> args) {
   public UpdateRequestProcessor getInstance(
       SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next) {
     return new UpdateRequestProcessor(next) {
+
       @Override
       public void processAdd(AddUpdateCommand cmd) throws IOException {
         long docSize = ObjectSizeEstimator.estimate(cmd.getSolrInputDocument());
         if (docSize / 1024 > maxDocumentSize) {
+          handleViolatingDoc(cmd, docSize);
+        } else {
+          super.processAdd(cmd);
+        }
+      }
+
+      private void handleViolatingDoc(AddUpdateCommand cmd, long estimatedSizeBytes) {
+        if (onlyLogErrors) {
+          log.warn("Skipping doc {} bc size {} exceeds limit {}", cmd.getPrintableId(), estimatedSizeBytes / 1024, maxDocumentSize);

Review Comment:
   Are both numbers logged in the same units?  As a reader of this log message; I'm not sure what the unit is.



##########
solr/core/src/java/org/apache/solr/update/processor/IgnoreLargeDocumentProcessorFactory.java:
##########
@@ -59,15 +71,25 @@ public void init(NamedList<?> args) {
   public UpdateRequestProcessor getInstance(
       SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next) {
     return new UpdateRequestProcessor(next) {
+
       @Override
       public void processAdd(AddUpdateCommand cmd) throws IOException {
         long docSize = ObjectSizeEstimator.estimate(cmd.getSolrInputDocument());
         if (docSize / 1024 > maxDocumentSize) {
+          handleViolatingDoc(cmd, docSize);
+        } else {
+          super.processAdd(cmd);
+        }
+      }
+
+      private void handleViolatingDoc(AddUpdateCommand cmd, long estimatedSizeBytes) {
+        if (onlyLogErrors) {
+          log.warn("Skipping doc {} bc size {} exceeds limit {}", cmd.getPrintableId(), estimatedSizeBytes / 1024, maxDocumentSize);
+        } else {
           throw new SolrException(
-              BAD_REQUEST,
-              "Size of the document " + cmd.getPrintableId() + " is too large, around:" + docSize);
+                  BAD_REQUEST,
+                  "Size of the document " + cmd.getPrintableId() + " is too large, around:" + estimatedSizeBytes);

Review Comment:
   If necessary, convert to the same units as that we'd log?



##########
solr/core/src/java/org/apache/solr/update/processor/IgnoreLargeDocumentProcessorFactory.java:
##########
@@ -41,14 +46,21 @@
  */
 public class IgnoreLargeDocumentProcessorFactory extends UpdateRequestProcessorFactory {
   public static final String LIMIT_SIZE_PARAM = "limit";
+  public static final String ONLY_LOG_ERRORS_PARAM = "onlyLogErrors";
+  private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
+
 
   // limit of a SolrInputDocument size (in kb)
   private long maxDocumentSize = 1024 * 1024;
+  private boolean onlyLogErrors = false;
 
   @Override
   public void init(NamedList<?> args) {
-    maxDocumentSize = args.toSolrParams().required().getLong(LIMIT_SIZE_PARAM);
+    final SolrParams params = args.toSolrParams();
+    maxDocumentSize = params.required().getLong(LIMIT_SIZE_PARAM);

Review Comment:
   Can't be null because it doesn't notice the impact of `required()`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] gerlowskija commented on a diff in pull request #1040: SOLR-16428: Add "permissive" mode to IgnoreLargeDocumentsProcessorFactory

Posted by GitBox <gi...@apache.org>.
gerlowskija commented on code in PR #1040:
URL: https://github.com/apache/solr/pull/1040#discussion_r980343005


##########
solr/core/src/java/org/apache/solr/update/processor/IgnoreLargeDocumentProcessorFactory.java:
##########
@@ -59,15 +71,25 @@ public void init(NamedList<?> args) {
   public UpdateRequestProcessor getInstance(
       SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next) {
     return new UpdateRequestProcessor(next) {
+
       @Override
       public void processAdd(AddUpdateCommand cmd) throws IOException {
         long docSize = ObjectSizeEstimator.estimate(cmd.getSolrInputDocument());
         if (docSize / 1024 > maxDocumentSize) {
+          handleViolatingDoc(cmd, docSize);
+        } else {
+          super.processAdd(cmd);
+        }
+      }
+
+      private void handleViolatingDoc(AddUpdateCommand cmd, long estimatedSizeBytes) {
+        if (onlyLogErrors) {
+          log.warn("Skipping doc {} bc size {} exceeds limit {}", cmd.getPrintableId(), estimatedSizeBytes / 1024, maxDocumentSize);

Review Comment:
   I actually ended up keeping the estimated size in bytes here, with the "limit" value still in KB.
   
   "Limit" makes sense to keep in kb, because that's the unit that users specify when configuring this URP.  As for the estimated size, I thought about using "kb" for that as well, but I was worried that users might be confused in cases where the integer-division/rounding needed to convert bytes to kb might result in displaying a message where the estimated size and limit are equal.
   
   In short I didn't want to have us printing log messages that looked like:
   
   > Skipping doc asdf bc size 2kb exceeds limit 2kb
   
   But I've added explicit units to the variables and log-messages involved here, so hopefully that's good enough to address the core of your issue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org