You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (Created) (JIRA)" <ji...@apache.org> on 2012/03/06 01:43:57 UTC

[jira] [Created] (NUTCH-1300) Indexer to normalize URL's

Indexer to normalize URL's
--------------------------

                 Key: NUTCH-1300
                 URL: https://issues.apache.org/jira/browse/NUTCH-1300
             Project: Nutch
          Issue Type: New Feature
          Components: indexer
            Reporter: Markus Jelsma
            Assignee: Markus Jelsma
            Priority: Minor
             Fix For: 1.5


Indexers should be able to normalize URL's. This is useful when a new normalizer is applied to the entire CrawlDB. Without it, some or all records in a segment cannot be indexed at all.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (NUTCH-1300) Indexer to normalize URL's

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma resolved NUTCH-1300.
----------------------------------

    Resolution: Fixed

Committed for 1.6 in rev. 1349262.

The -filter and -normalize options are now available and a new scope SCOPE_NORMALIZE was added. Thanks Sebastian and Lewis.
                
> Indexer to normalize URL's
> --------------------------
>
>                 Key: NUTCH-1300
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1300
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: NUTCH-1300-1.5-1.patch
>
>
> Indexers should be able to normalize URL's. This is useful when a new normalizer is applied to the entire CrawlDB. Without it, some or all records in a segment cannot be indexed at all.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1300) Indexer to normalize URL's

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239418#comment-13239418 ] 

Markus Jelsma commented on NUTCH-1300:
--------------------------------------

I think a scope "index" makes sense. It would make building a two-way normalizer a bit easier. Commandline options can be added but you can use -D option as well.
                
> Indexer to normalize URL's
> --------------------------
>
>                 Key: NUTCH-1300
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1300
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: NUTCH-1300-1.5-1.patch
>
>
> Indexers should be able to normalize URL's. This is useful when a new normalizer is applied to the entire CrawlDB. Without it, some or all records in a segment cannot be indexed at all.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1300) Indexer to normalize URL's

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266624#comment-13266624 ] 

Lewis John McGibbney commented on NUTCH-1300:
---------------------------------------------

Hi Markus. Before commenting on NUTCH-1323, I would also agree with Sebastian w.r.t commandline options. As with NUTCH-1139, there was a clear cut decision made to support cmd line options, so this patch would also need them added to work correctly? Additionally, we know that many people find this convenient and comprehensive, especially if they are documented well on the wiki :0) Apart from this I'm also +1
                
> Indexer to normalize URL's
> --------------------------
>
>                 Key: NUTCH-1300
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1300
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: NUTCH-1300-1.5-1.patch
>
>
> Indexers should be able to normalize URL's. This is useful when a new normalizer is applied to the entire CrawlDB. Without it, some or all records in a segment cannot be indexed at all.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1300) Indexer to normalize URL's

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293544#comment-13293544 ] 

Hudson commented on NUTCH-1300:
-------------------------------

Integrated in nutch-trunk-maven #310 (See [https://builds.apache.org/job/nutch-trunk-maven/310/])
    NUTCH-1300 Indexer to filter normalize URL's (Revision 1349262)

     Result = SUCCESS
markus : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
* /nutch/trunk/src/java/org/apache/nutch/net/URLNormalizers.java

                
> Indexer to normalize URL's
> --------------------------
>
>                 Key: NUTCH-1300
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1300
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: NUTCH-1300-1.5-1.patch
>
>
> Indexers should be able to normalize URL's. This is useful when a new normalizer is applied to the entire CrawlDB. Without it, some or all records in a segment cannot be indexed at all.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1300) Indexer to normalize URL's

Posted by "Sebastian Nagel (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13225174#comment-13225174 ] 

Sebastian Nagel commented on NUTCH-1300:
----------------------------------------

+1
* effective fix for a serious problem: long running continuous crawls require adjustments of the normalization rules quite often
* tested (with 1.4): costs (time spent for extra normalization) are ok compared to the benefit

Two suggestions:
# Does a URLNormalizer scope "index" make sense? E.g., if only outlinks are normalized and default rules are empty, the scope "index" may use the same rules as scope "outlink".
# Wouldn't commandline options for solrindex be nice? Most other tools (generate, updatedb, invertlinks) have options such as -filter / -norm / -noNorm.
                
> Indexer to normalize URL's
> --------------------------
>
>                 Key: NUTCH-1300
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1300
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: NUTCH-1300-1.5-1.patch
>
>
> Indexers should be able to normalize URL's. This is useful when a new normalizer is applied to the entire CrawlDB. Without it, some or all records in a segment cannot be indexed at all.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1300) Indexer to normalize URL's

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266638#comment-13266638 ] 

Markus Jelsma commented on NUTCH-1300:
--------------------------------------

Sure! I'll add a command line option and update the tool description on the wiki. Will upload the patch and commit when trunk is 1.6.
                
> Indexer to normalize URL's
> --------------------------
>
>                 Key: NUTCH-1300
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1300
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: NUTCH-1300-1.5-1.patch
>
>
> Indexers should be able to normalize URL's. This is useful when a new normalizer is applied to the entire CrawlDB. Without it, some or all records in a segment cannot be indexed at all.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1300) Indexer to normalize URL's

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1300:
---------------------------------

    Attachment: NUTCH-1300-1.5-1.patch

Patch for 1.5.
                
> Indexer to normalize URL's
> --------------------------
>
>                 Key: NUTCH-1300
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1300
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: NUTCH-1300-1.5-1.patch
>
>
> Indexers should be able to normalize URL's. This is useful when a new normalizer is applied to the entire CrawlDB. Without it, some or all records in a segment cannot be indexed at all.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1300) Indexer to normalize URL's

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295798#comment-13295798 ] 

Hudson commented on NUTCH-1300:
-------------------------------

Integrated in Nutch-trunk #1869 (See [https://builds.apache.org/job/Nutch-trunk/1869/])
    NUTCH-1300 Indexer to filter normalize URL's (Revision 1349262)

     Result = SUCCESS
markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349262
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
* /nutch/trunk/src/java/org/apache/nutch/net/URLNormalizers.java

                
> Indexer to normalize URL's
> --------------------------
>
>                 Key: NUTCH-1300
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1300
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: NUTCH-1300-1.5-1.patch
>
>
> Indexers should be able to normalize URL's. This is useful when a new normalizer is applied to the entire CrawlDB. Without it, some or all records in a segment cannot be indexed at all.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1300) Indexer to normalize URL's

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1300:
---------------------------------

    Fix Version/s:     (was: 1.5)
                   1.6

20120304-push-1.6
                
> Indexer to normalize URL's
> --------------------------
>
>                 Key: NUTCH-1300
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1300
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: NUTCH-1300-1.5-1.patch
>
>
> Indexers should be able to normalize URL's. This is useful when a new normalizer is applied to the entire CrawlDB. Without it, some or all records in a segment cannot be indexed at all.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1300) Indexer to normalize URL's

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1300:
---------------------------------

    Patch Info: Patch Available
    
> Indexer to normalize URL's
> --------------------------
>
>                 Key: NUTCH-1300
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1300
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: NUTCH-1300-1.5-1.patch
>
>
> Indexers should be able to normalize URL's. This is useful when a new normalizer is applied to the entire CrawlDB. Without it, some or all records in a segment cannot be indexed at all.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira