You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Eelco Lempsink (JIRA)" <ji...@apache.org> on 2006/10/27 09:17:16 UTC

[jira] Created: (NUTCH-393) Indexer doesn't handle null documents returned by filters

Indexer doesn't handle null documents returned by filters
---------------------------------------------------------

                 Key: NUTCH-393
                 URL: http://issues.apache.org/jira/browse/NUTCH-393
             Project: Nutch
          Issue Type: Bug
          Components: indexer
    Affects Versions: 0.8.1
            Reporter: Eelco Lempsink


Plugins (like IndexingFilter) may return a null value, but this isn't handled by the Indexer.  A trivial adjustment is all it takes:


@@ -237,6 +237,7 @@
       if (LOG.isWarnEnabled()) { LOG.warn("Error indexing "+key+": "+e); }
       return;
     }
+    if (doc == null) return;
 
     float boost = 1.0f;
     // run scoring filters


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-393) Indexer doesn't handle null documents returned by filters

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494552 ] 

Andrzej Bialecki  commented on NUTCH-393:
-----------------------------------------

I agree with that - either all filters should run or the document should be discarded. If it's acceptable to tolerate exceptions in some indexing filters, such exceptions should be caught there.

> Indexer doesn't handle null documents returned by filters
> ---------------------------------------------------------
>
>                 Key: NUTCH-393
>                 URL: https://issues.apache.org/jira/browse/NUTCH-393
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 0.8.1, 0.9.0
>            Reporter: Eelco Lempsink
>         Attachments: NUTCH-393.patch
>
>
> Plugins (like IndexingFilter) may return a null value, but this isn't handled by the Indexer.  A trivial adjustment is all it takes:
> @@ -237,6 +237,7 @@
>        if (LOG.isWarnEnabled()) { LOG.warn("Error indexing "+key+": "+e); }
>        return;
>      }
> +    if (doc == null) return;
>  
>      float boost = 1.0f;
>      // run scoring filters

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (NUTCH-393) Indexer doesn't handle null documents returned by filters

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  resolved NUTCH-393.
-------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0.0
         Assignee: Andrzej Bialecki 

Both places (Indexer and IndexingFilters) fixed in rev. 536629, plus some javadoc clarification has been added. Thank you!

> Indexer doesn't handle null documents returned by filters
> ---------------------------------------------------------
>
>                 Key: NUTCH-393
>                 URL: https://issues.apache.org/jira/browse/NUTCH-393
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 0.8.1, 0.9.0
>            Reporter: Eelco Lempsink
>         Assigned To: Andrzej Bialecki 
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-393.patch
>
>
> Plugins (like IndexingFilter) may return a null value, but this isn't handled by the Indexer.  A trivial adjustment is all it takes:
> @@ -237,6 +237,7 @@
>        if (LOG.isWarnEnabled()) { LOG.warn("Error indexing "+key+": "+e); }
>        return;
>      }
> +    if (doc == null) return;
>  
>      float boost = 1.0f;
>      // run scoring filters

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-393) Indexer doesn't handle null documents returned by filters

Posted by "Eelco Lempsink (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-393?page=all ]

Eelco Lempsink updated NUTCH-393:
---------------------------------

    Attachment: NUTCH-393.patch

Here's a complete patch against the latest revision to fix this issue.  

Note that not only the Indexer.java must be adjusted, the loop in IndexingFilters.java that executes each filter must also stop when doc == null.  

This means that once a filter decides to drop the document no other filter can undo that action.

> Indexer doesn't handle null documents returned by filters
> ---------------------------------------------------------
>
>                 Key: NUTCH-393
>                 URL: http://issues.apache.org/jira/browse/NUTCH-393
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 0.8.1
>            Reporter: Eelco Lempsink
>         Attachments: NUTCH-393.patch
>
>
> Plugins (like IndexingFilter) may return a null value, but this isn't handled by the Indexer.  A trivial adjustment is all it takes:
> @@ -237,6 +237,7 @@
>        if (LOG.isWarnEnabled()) { LOG.warn("Error indexing "+key+": "+e); }
>        return;
>      }
> +    if (doc == null) return;
>  
>      float boost = 1.0f;
>      // run scoring filters

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (NUTCH-393) Indexer doesn't handle null documents returned by filters

Posted by "Eelco Lempsink (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eelco Lempsink updated NUTCH-393:
---------------------------------

    Affects Version/s: 0.9.0

> Indexer doesn't handle null documents returned by filters
> ---------------------------------------------------------
>
>                 Key: NUTCH-393
>                 URL: https://issues.apache.org/jira/browse/NUTCH-393
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 0.8.1, 0.9.0
>            Reporter: Eelco Lempsink
>         Attachments: NUTCH-393.patch
>
>
> Plugins (like IndexingFilter) may return a null value, but this isn't handled by the Indexer.  A trivial adjustment is all it takes:
> @@ -237,6 +237,7 @@
>        if (LOG.isWarnEnabled()) { LOG.warn("Error indexing "+key+": "+e); }
>        return;
>      }
> +    if (doc == null) return;
>  
>      float boost = 1.0f;
>      // run scoring filters

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-393) Indexer doesn't handle null documents returned by filters

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-393?page=comments#action_12447787 ] 
            
Enis Soztutar commented on NUTCH-393:
-------------------------------------

Also IndexingException is catched by the Indexer, in which  case the whole document is not added to the writer (the function returns).

Indexer : 334
try {
    // run indexing filters
    doc = this.filters.filter(doc, parse, (UTF8)key, fetchDatum, inlinks);
} catch (IndexingException e) {
    if (LOG.isWarnEnabled()) { LOG.warn("Error indexing "+key+": "+e); }
       return;
}  

IndexingException should be cought in the IndexingFilters.filter(), so that when an IndexingException is thrown in one indexing plugin, the other plugins could still be run. 



> Indexer doesn't handle null documents returned by filters
> ---------------------------------------------------------
>
>                 Key: NUTCH-393
>                 URL: http://issues.apache.org/jira/browse/NUTCH-393
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 0.8.1
>            Reporter: Eelco Lempsink
>         Attachments: NUTCH-393.patch
>
>
> Plugins (like IndexingFilter) may return a null value, but this isn't handled by the Indexer.  A trivial adjustment is all it takes:
> @@ -237,6 +237,7 @@
>        if (LOG.isWarnEnabled()) { LOG.warn("Error indexing "+key+": "+e); }
>        return;
>      }
> +    if (doc == null) return;
>  
>      float boost = 1.0f;
>      // run scoring filters

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-393) Indexer doesn't handle null documents returned by filters

Posted by "Eelco Lempsink (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-393?page=comments#action_12447939 ] 
            
Eelco Lempsink commented on NUTCH-393:
--------------------------------------

I'm not sure I agree with that. After running a document through a set of filters you'd expect all filters ran. If not, that's an exception.  For instance, your index might depend on all numbers and non-english words being stripped. When one of those filters hits an exception, but the other one runs, your index will become dirty.

> Indexer doesn't handle null documents returned by filters
> ---------------------------------------------------------
>
>                 Key: NUTCH-393
>                 URL: http://issues.apache.org/jira/browse/NUTCH-393
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 0.8.1
>            Reporter: Eelco Lempsink
>         Attachments: NUTCH-393.patch
>
>
> Plugins (like IndexingFilter) may return a null value, but this isn't handled by the Indexer.  A trivial adjustment is all it takes:
> @@ -237,6 +237,7 @@
>        if (LOG.isWarnEnabled()) { LOG.warn("Error indexing "+key+": "+e); }
>        return;
>      }
> +    if (doc == null) return;
>  
>      float boost = 1.0f;
>      // run scoring filters

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira