You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Alexis (JIRA)" <ji...@apache.org> on 2011/01/01 09:06:45 UTC

[jira] Created: (NUTCH-950) Content-Length limit, URL filter and few minor issues

Content-Length limit, URL filter and few minor issues
-----------------------------------------------------

                 Key: NUTCH-950
                 URL: https://issues.apache.org/jira/browse/NUTCH-950
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 2.0
            Reporter: Alexis


1. crawl command (nutch1.patch)

The class was renamed to Crawler but the references to it were not updated.


2. URL filter (nutch2.patch)

This avoids a NPE on bogus urls which host do not have a suffix.


3. Content-Length limit (nutch3.patch)

This is related to NUTCH-899.
The patch avoids the entire flush operation on the Gora datastore to crash because the MySQL blob limit was exceeded by a few bytes. Both protocol-http and protocol-httpclient plugins were problematic.


4. Ivy configuration (nutch4.patch)
- Change xercesImpl and restlet versions. These 2 version changes are required. The first one currently makes a JUnit test crash, the second one is missing in default Maven repository.

- Add gora-hbase, zookeeper which is an HBase dependency. Add MySQL connector. These jars are necesary to run Gora with HBase or MySQL datastores. (more a suggestion that a requirement here)

- Add com.jcraft/jsch, which is a protocol-sftp plugin dependency. 


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-950) Content-Length limit, URL filter and few minor issues

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12978832#action_12978832 ] 

Julien Nioche commented on NUTCH-950:
-------------------------------------

Have committed the first 3 sub-issues.

Regarding the last one, I haven't tested the first point (version changes) but here are a few comments about the other issues : 
* Hbase + MySQL : these backends should not be provided by default, same for the MySQL connector. One option would be to add them to the ivy file but comment them out and give a bit of an explanation e.g. "uncomment this if you want to use xxx as a GORA backend"
* the dependency  com.jcraft/jsch should be placed in the ivy file of the corresponding plugin, not in the main one

Alexis, could you please create a new issue for this then mark this issue as resolved? Having a single JIRA number for completely separated issues is a bad idea and does not help keeping things in sync with the svn commits.

Thanks a lot for your contributions

Julien


> Content-Length limit, URL filter and few minor issues
> -----------------------------------------------------
>
>                 Key: NUTCH-950
>                 URL: https://issues.apache.org/jira/browse/NUTCH-950
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.0
>            Reporter: Alexis
>         Attachments: nutch1.patch, nutch2.patch, nutch3.patch, nutch4.patch
>
>
> 1. crawl command (nutch1.patch)
> The class was renamed to Crawler but the references to it were not updated.
> 2. URL filter (nutch2.patch)
> This avoids a NPE on bogus urls which host do not have a suffix.
> 3. Content-Length limit (nutch3.patch)
> This is related to NUTCH-899.
> The patch avoids the entire flush operation on the Gora datastore to crash because the MySQL blob limit was exceeded by a few bytes. Both protocol-http and protocol-httpclient plugins were problematic.
> 4. Ivy configuration (nutch4.patch)
> - Change xercesImpl and restlet versions. These 2 version changes are required. The first one currently makes a JUnit test crash, the second one is missing in default Maven repository.
> - Add gora-hbase, zookeeper which is an HBase dependency. Add MySQL connector. These jars are necesary to run Gora with HBase or MySQL datastores. (more a suggestion that a requirement here)
> - Add com.jcraft/jsch, which is a protocol-sftp plugin dependency. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (NUTCH-950) Content-Length limit, URL filter and few minor issues

Posted by "Alexis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexis resolved NUTCH-950.
--------------------------

       Resolution: Fixed
    Fix Version/s: 2.0

Sorry I missed the Ivy configuration file in the plugin directory.

See NUTCH-955 for the new Ivy issue.

> Content-Length limit, URL filter and few minor issues
> -----------------------------------------------------
>
>                 Key: NUTCH-950
>                 URL: https://issues.apache.org/jira/browse/NUTCH-950
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.0
>            Reporter: Alexis
>             Fix For: 2.0
>
>         Attachments: nutch1.patch, nutch2.patch, nutch3.patch, nutch4.patch
>
>
> 1. crawl command (nutch1.patch)
> The class was renamed to Crawler but the references to it were not updated.
> 2. URL filter (nutch2.patch)
> This avoids a NPE on bogus urls which host do not have a suffix.
> 3. Content-Length limit (nutch3.patch)
> This is related to NUTCH-899.
> The patch avoids the entire flush operation on the Gora datastore to crash because the MySQL blob limit was exceeded by a few bytes. Both protocol-http and protocol-httpclient plugins were problematic.
> 4. Ivy configuration (nutch4.patch)
> - Change xercesImpl and restlet versions. These 2 version changes are required. The first one currently makes a JUnit test crash, the second one is missing in default Maven repository.
> - Add gora-hbase, zookeeper which is an HBase dependency. Add MySQL connector. These jars are necesary to run Gora with HBase or MySQL datastores. (more a suggestion that a requirement here)
> - Add com.jcraft/jsch, which is a protocol-sftp plugin dependency. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-950) Content-Length limit, URL filter and few minor issues

Posted by "Alexis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexis updated NUTCH-950:
-------------------------

    Attachment: nutch4.patch

> Content-Length limit, URL filter and few minor issues
> -----------------------------------------------------
>
>                 Key: NUTCH-950
>                 URL: https://issues.apache.org/jira/browse/NUTCH-950
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.0
>            Reporter: Alexis
>         Attachments: nutch1.patch, nutch2.patch, nutch3.patch, nutch4.patch
>
>
> 1. crawl command (nutch1.patch)
> The class was renamed to Crawler but the references to it were not updated.
> 2. URL filter (nutch2.patch)
> This avoids a NPE on bogus urls which host do not have a suffix.
> 3. Content-Length limit (nutch3.patch)
> This is related to NUTCH-899.
> The patch avoids the entire flush operation on the Gora datastore to crash because the MySQL blob limit was exceeded by a few bytes. Both protocol-http and protocol-httpclient plugins were problematic.
> 4. Ivy configuration (nutch4.patch)
> - Change xercesImpl and restlet versions. These 2 version changes are required. The first one currently makes a JUnit test crash, the second one is missing in default Maven repository.
> - Add gora-hbase, zookeeper which is an HBase dependency. Add MySQL connector. These jars are necesary to run Gora with HBase or MySQL datastores. (more a suggestion that a requirement here)
> - Add com.jcraft/jsch, which is a protocol-sftp plugin dependency. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-950) Content-Length limit, URL filter and few minor issues

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976421#action_12976421 ] 

Julien Nioche commented on NUTCH-950:
-------------------------------------

Will look into this next week, thanks for your contribution. In the future please open separate JIRA issues instead of putting everything into a single one

> Content-Length limit, URL filter and few minor issues
> -----------------------------------------------------
>
>                 Key: NUTCH-950
>                 URL: https://issues.apache.org/jira/browse/NUTCH-950
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.0
>            Reporter: Alexis
>         Attachments: nutch1.patch, nutch2.patch, nutch3.patch, nutch4.patch
>
>
> 1. crawl command (nutch1.patch)
> The class was renamed to Crawler but the references to it were not updated.
> 2. URL filter (nutch2.patch)
> This avoids a NPE on bogus urls which host do not have a suffix.
> 3. Content-Length limit (nutch3.patch)
> This is related to NUTCH-899.
> The patch avoids the entire flush operation on the Gora datastore to crash because the MySQL blob limit was exceeded by a few bytes. Both protocol-http and protocol-httpclient plugins were problematic.
> 4. Ivy configuration (nutch4.patch)
> - Change xercesImpl and restlet versions. These 2 version changes are required. The first one currently makes a JUnit test crash, the second one is missing in default Maven repository.
> - Add gora-hbase, zookeeper which is an HBase dependency. Add MySQL connector. These jars are necesary to run Gora with HBase or MySQL datastores. (more a suggestion that a requirement here)
> - Add com.jcraft/jsch, which is a protocol-sftp plugin dependency. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (NUTCH-950) Content-Length limit, URL filter and few minor issues

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977934#action_12977934 ] 

Julien Nioche edited comment on NUTCH-950 at 1/5/11 2:56 PM:
-------------------------------------------------------------

Committed revision 1055604 in 1.3
Committed revision 1055608 for trunk

{panel}
 NUTCH-950 DomainURLFilter throws NPE on bogus urls (Alexis Detreglode via jnioche)
{panel}

will review the other submissions later



      was (Author: jnioche):
    Committed 1055604 in 1.3

 NUTCH-950 DomainURLFilter throws NPE on bogus urls (Alexis Detreglode via jnioche)

will commit for 2.0 later and review the other submissions


  
> Content-Length limit, URL filter and few minor issues
> -----------------------------------------------------
>
>                 Key: NUTCH-950
>                 URL: https://issues.apache.org/jira/browse/NUTCH-950
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.0
>            Reporter: Alexis
>         Attachments: nutch1.patch, nutch2.patch, nutch3.patch, nutch4.patch
>
>
> 1. crawl command (nutch1.patch)
> The class was renamed to Crawler but the references to it were not updated.
> 2. URL filter (nutch2.patch)
> This avoids a NPE on bogus urls which host do not have a suffix.
> 3. Content-Length limit (nutch3.patch)
> This is related to NUTCH-899.
> The patch avoids the entire flush operation on the Gora datastore to crash because the MySQL blob limit was exceeded by a few bytes. Both protocol-http and protocol-httpclient plugins were problematic.
> 4. Ivy configuration (nutch4.patch)
> - Change xercesImpl and restlet versions. These 2 version changes are required. The first one currently makes a JUnit test crash, the second one is missing in default Maven repository.
> - Add gora-hbase, zookeeper which is an HBase dependency. Add MySQL connector. These jars are necesary to run Gora with HBase or MySQL datastores. (more a suggestion that a requirement here)
> - Add com.jcraft/jsch, which is a protocol-sftp plugin dependency. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-950) Content-Length limit, URL filter and few minor issues

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977934#action_12977934 ] 

Julien Nioche commented on NUTCH-950:
-------------------------------------

Committed 1055604 in 1.3

 NUTCH-950 DomainURLFilter throws NPE on bogus urls (Alexis Detreglode via jnioche)

will commit for 2.0 later and review the other submissions



> Content-Length limit, URL filter and few minor issues
> -----------------------------------------------------
>
>                 Key: NUTCH-950
>                 URL: https://issues.apache.org/jira/browse/NUTCH-950
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.0
>            Reporter: Alexis
>         Attachments: nutch1.patch, nutch2.patch, nutch3.patch, nutch4.patch
>
>
> 1. crawl command (nutch1.patch)
> The class was renamed to Crawler but the references to it were not updated.
> 2. URL filter (nutch2.patch)
> This avoids a NPE on bogus urls which host do not have a suffix.
> 3. Content-Length limit (nutch3.patch)
> This is related to NUTCH-899.
> The patch avoids the entire flush operation on the Gora datastore to crash because the MySQL blob limit was exceeded by a few bytes. Both protocol-http and protocol-httpclient plugins were problematic.
> 4. Ivy configuration (nutch4.patch)
> - Change xercesImpl and restlet versions. These 2 version changes are required. The first one currently makes a JUnit test crash, the second one is missing in default Maven repository.
> - Add gora-hbase, zookeeper which is an HBase dependency. Add MySQL connector. These jars are necesary to run Gora with HBase or MySQL datastores. (more a suggestion that a requirement here)
> - Add com.jcraft/jsch, which is a protocol-sftp plugin dependency. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-950) Content-Length limit, URL filter and few minor issues

Posted by "Alexis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexis updated NUTCH-950:
-------------------------

    Attachment: nutch3.patch
                nutch2.patch
                nutch1.patch

> Content-Length limit, URL filter and few minor issues
> -----------------------------------------------------
>
>                 Key: NUTCH-950
>                 URL: https://issues.apache.org/jira/browse/NUTCH-950
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.0
>            Reporter: Alexis
>         Attachments: nutch1.patch, nutch2.patch, nutch3.patch, nutch4.patch
>
>
> 1. crawl command (nutch1.patch)
> The class was renamed to Crawler but the references to it were not updated.
> 2. URL filter (nutch2.patch)
> This avoids a NPE on bogus urls which host do not have a suffix.
> 3. Content-Length limit (nutch3.patch)
> This is related to NUTCH-899.
> The patch avoids the entire flush operation on the Gora datastore to crash because the MySQL blob limit was exceeded by a few bytes. Both protocol-http and protocol-httpclient plugins were problematic.
> 4. Ivy configuration (nutch4.patch)
> - Change xercesImpl and restlet versions. These 2 version changes are required. The first one currently makes a JUnit test crash, the second one is missing in default Maven repository.
> - Add gora-hbase, zookeeper which is an HBase dependency. Add MySQL connector. These jars are necesary to run Gora with HBase or MySQL datastores. (more a suggestion that a requirement here)
> - Add com.jcraft/jsch, which is a protocol-sftp plugin dependency. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.