You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2007/06/23 22:15:25 UTC

[jira] Created: (NUTCH-505) Outlink urls should be validated

Outlink urls should be validated
--------------------------------

                 Key: NUTCH-505
                 URL: https://issues.apache.org/jira/browse/NUTCH-505
             Project: Nutch
          Issue Type: Improvement
            Reporter: Doğacan Güney
            Priority: Minor


See discussion here:
http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html

Parse plugins may extract garbage urls from pages. We need a url validation system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-505) Outlink urls should be validated

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512074 ] 

Doğacan Güney commented on NUTCH-505:
-------------------------------------

Thanks for the suggestion. Automaton really looks good, but using automaton in UrlValidator will mean bringing automaton jar inside nutch core (it currently resides in plugin urlfilter-automaton's lib). I am not sure if that's OK with everyone.

> Outlink urls should be validated
> --------------------------------
>
>                 Key: NUTCH-505
>                 URL: https://issues.apache.org/jira/browse/NUTCH-505
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-505-v2.patch, NUTCH-505.patch, NUTCH-505.patch, NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-505) Outlink urls should be validated

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-505:
--------------------------------

    Attachment: NUTCH-505-v3.patch
                filtered.txt

New and final version. I shuffled some code around in ParseOutputFormat for better performance, and updated some regex patterns in UrlValidator.

I am also attaching a file showing which urls are filtered from a sample 2000 url parse.

> Outlink urls should be validated
> --------------------------------
>
>                 Key: NUTCH-505
>                 URL: https://issues.apache.org/jira/browse/NUTCH-505
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: filtered.txt, NUTCH-505-v2.patch, NUTCH-505-v3.patch, NUTCH-505.patch, NUTCH-505.patch, NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (NUTCH-505) Outlink urls should be validated

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney resolved NUTCH-505.
---------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0.0
         Assignee: Doğacan Güney

Committed in rev. 555237.

> Outlink urls should be validated
> --------------------------------
>
>                 Key: NUTCH-505
>                 URL: https://issues.apache.org/jira/browse/NUTCH-505
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-505.patch, NUTCH-505.patch, NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-505) Outlink urls should be validated

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512139 ] 

Andrzej Bialecki  commented on NUTCH-505:
-----------------------------------------

Please test Java 1.5 and Java 1.6 - IIRC there are some differences in performance of java.util.regex between these two versions.

> Outlink urls should be validated
> --------------------------------
>
>                 Key: NUTCH-505
>                 URL: https://issues.apache.org/jira/browse/NUTCH-505
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: filtered.txt, NUTCH-505-v2.patch, NUTCH-505-v3.patch, NUTCH-505.patch, NUTCH-505.patch, NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-505) Outlink urls should be validated

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511447 ] 

Andrzej Bialecki  commented on NUTCH-505:
-----------------------------------------

* In ParseOutputFormat, the calculation of outlinksToStore should not make repeating calls to job.getInt() - the value of db.max.outlinksper.page should be retrieved once per invocation of getRecordWriter().

* you should increase the version number of ParseData, and add a code to read the current version of  ParseData. Otherwise the updated code won't be able to read older segments.

Other than that, the patch looks great, +1 for committing it after fixing these issues.

> Outlink urls should be validated
> --------------------------------
>
>                 Key: NUTCH-505
>                 URL: https://issues.apache.org/jira/browse/NUTCH-505
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>            Priority: Minor
>         Attachments: NUTCH-505.patch, NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-505) Outlink urls should be validated

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512201 ] 

Doğacan Güney commented on NUTCH-505:
-------------------------------------

Andrzej, on my tests, java.util.regex is faster on both Java 1.5 and Java 1.6.

And btw, I added ( and ) as valid path characters to the relevant regex pattern because nutch was able to fetch a url containing them.

> Outlink urls should be validated
> --------------------------------
>
>                 Key: NUTCH-505
>                 URL: https://issues.apache.org/jira/browse/NUTCH-505
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: filtered.txt, NUTCH-505-v2.patch, NUTCH-505-v3.patch, NUTCH-505.patch, NUTCH-505.patch, NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-505) Outlink urls should be validated

Posted by "Espen Amble Kolstad (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512071 ] 

Espen Amble Kolstad commented on NUTCH-505:
-------------------------------------------

Automaton (http://www.brics.dk/automaton/), used in AutomatonURLFilter, is even faster if you preparse the regex'es
It doesn't support all regex, but most.

> Outlink urls should be validated
> --------------------------------
>
>                 Key: NUTCH-505
>                 URL: https://issues.apache.org/jira/browse/NUTCH-505
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-505-v2.patch, NUTCH-505.patch, NUTCH-505.patch, NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Closed: (NUTCH-505) Outlink urls should be validated

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney closed NUTCH-505.
-------------------------------


Latest patch (for optimization) is committed in rev. 555969.

> Outlink urls should be validated
> --------------------------------
>
>                 Key: NUTCH-505
>                 URL: https://issues.apache.org/jira/browse/NUTCH-505
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: filtered.txt, NUTCH-505-v2.patch, NUTCH-505-v3.patch, NUTCH-505.patch, NUTCH-505.patch, NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-505) Outlink urls should be validated

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507803 ] 

Doğacan Güney commented on NUTCH-505:
-------------------------------------

btw, for http://www.variety.com/, these are the 'urls' filtered:

http:/
http://www.variety.com/</div>
http://www.variety.com/</div></a>
mailto:varietycomments@reedbusiness.com
http://ad.doubleclick.net/jump/variety.dart/;sz=993x47;ord=' + randomnumber + '?
http://ad.doubleclick.net/ad/variety.dart/;sz=993x47;ord=' + randomnumber + '?

Since we will not distribute score to these, this patch may also slightly improve scoring.


> Outlink urls should be validated
> --------------------------------
>
>                 Key: NUTCH-505
>                 URL: https://issues.apache.org/jira/browse/NUTCH-505
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>            Priority: Minor
>         Attachments: NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-505) Outlink urls should be validated

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511985 ] 

Hudson commented on NUTCH-505:
------------------------------

Integrated in Nutch-Nightly #147 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/147/])

> Outlink urls should be validated
> --------------------------------
>
>                 Key: NUTCH-505
>                 URL: https://issues.apache.org/jira/browse/NUTCH-505
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-505.patch, NUTCH-505.patch, NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-505) Outlink urls should be validated

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-505:
--------------------------------

    Attachment: NUTCH-505_draft.patch

Initial draft patch. 

* Uses UrlValidator class from apache commons validator.
* ParseOutputFormat first checks if an outlink is valid. If it is, then it runs normalizers and urlfilters on url.

This patch is tested very lightly, so it probably doesn't work great yet. Comments, reviews, suggestions are welcome.

> Outlink urls should be validated
> --------------------------------
>
>                 Key: NUTCH-505
>                 URL: https://issues.apache.org/jira/browse/NUTCH-505
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>            Priority: Minor
>         Attachments: NUTCH-505_draft.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-505) Outlink urls should be validated

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-505:
--------------------------------

    Attachment: NUTCH-505-v2.patch

After my last commit, I read that Sun's java.util.regex implementation is actually faster than jakarta-oro. So, I changed UrlValidator to use java.util.regex instead of jakarta-oro. I made some simple tests and java.util.regex really seems to be faster. I also added some basic optimizations to ParseOutputFormat (added initialCapacity arguments to ArrayLists to reduce the number of allocations).

Is it necessary to reopen this issue or open another issue for this? I think this one is simple enough to commit without opening a seperate issue, but feel free to disagree.

Also, I realized that UrlValidator considers http://www.iiit.net/images/CCCCCC_line_br[1].gif invalid, even though firefox will display the gif (firefox escapes the path then fetches the escaped url). This doesn't seem to be a problem right now since nutch can't fetch these urls anyway, but we may consider adding some sort of smart escaping later.

> Outlink urls should be validated
> --------------------------------
>
>                 Key: NUTCH-505
>                 URL: https://issues.apache.org/jira/browse/NUTCH-505
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-505-v2.patch, NUTCH-505.patch, NUTCH-505.patch, NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-505) Outlink urls should be validated

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-505:
--------------------------------

    Attachment: NUTCH-505.patch

New version of the patch. As Andrzej has pointed out, db.max.outlinks.per.page is read once per getRecordWriter now.

> * you should increase the version number of ParseData, and add a code to read the current version of ParseData. > Otherwise the updated code won't be able to read older segments. 

This patch doesn't how parse data reads outlinks. Before this patch, parse data used to read db.max.outlinks.per.page outlinks then skip over (as in read the outlink then ignore it) the rest. After this patch, parse data reads all outlinks. So, I/O behaviour is the same

> Outlink urls should be validated
> --------------------------------
>
>                 Key: NUTCH-505
>                 URL: https://issues.apache.org/jira/browse/NUTCH-505
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>            Priority: Minor
>         Attachments: NUTCH-505.patch, NUTCH-505.patch, NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-505) Outlink urls should be validated

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-505:
--------------------------------

    Attachment: NUTCH-505.patch

New patch. This is sort of a release candidate, if there are no objections, I think this patch can go in as it is.

The biggest change is that ParseData is no longer a Configurable. In the current implementation, when a parse data comes to ParseOutputFormat, it contains at most db.max.outlinks.per.page, then after filtering, ParseOutputFormat outputs whatever remains. 

For example, in a situation where ignoreExternalLinks is true and the first hundred links (assuming db.max.outlinks per page is 100) are all external, no outlinks will be extracted, even if there are internal urls past 100th outlinks mark.

So, now parse data reads all outlinks, ParseOutputFormat processes them and outputs at most db.max.outlinks.per.page many outlinks (Also resulting parse data contains db.max.outlinks.per.page outlinks too). I think this is a better approach but it may be a bit slower.

Besides this change, UrlValidator code is cleaned up and moved into org.apache.nutch.net package. Also, outlinks are not normalized in ParseOutputFormat since they are already normalized in Outlink.Outlink. There is no point in normalizing them twice.

> Outlink urls should be validated
> --------------------------------
>
>                 Key: NUTCH-505
>                 URL: https://issues.apache.org/jira/browse/NUTCH-505
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>            Priority: Minor
>         Attachments: NUTCH-505.patch, NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-505) Outlink urls should be validated

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-505:
--------------------------------

    Attachment: NUTCH-505_draft_v2.patch

Patch updated for latest trunk.

> Outlink urls should be validated
> --------------------------------
>
>                 Key: NUTCH-505
>                 URL: https://issues.apache.org/jira/browse/NUTCH-505
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>            Priority: Minor
>         Attachments: NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (NUTCH-505) Outlink urls should be validated

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511542 ] 

Doğacan Güney edited comment on NUTCH-505 at 7/10/07 11:28 PM:
---------------------------------------------------------------

New version of the patch. As Andrzej has pointed out, db.max.outlinks.per.page is read once per getRecordWriter now.

> * you should increase the version number of ParseData, and add a code to read the current version
> of ParseData. Otherwise the updated code won't be able to read older segments. 

This patch doesn't change how parse data reads outlinks. Before this patch, parse data used to read db.max.outlinks.per.page outlinks then skip over (as in read the outlink then ignore it) the rest. After this patch, parse data reads all outlinks. So, I/O behaviour is the same.


 was:
New version of the patch. As Andrzej has pointed out, db.max.outlinks.per.page is read once per getRecordWriter now.

> * you should increase the version number of ParseData, and add a code to read the current version of ParseData. > Otherwise the updated code won't be able to read older segments. 

This patch doesn't how parse data reads outlinks. Before this patch, parse data used to read db.max.outlinks.per.page outlinks then skip over (as in read the outlink then ignore it) the rest. After this patch, parse data reads all outlinks. So, I/O behaviour is the same

> Outlink urls should be validated
> --------------------------------
>
>                 Key: NUTCH-505
>                 URL: https://issues.apache.org/jira/browse/NUTCH-505
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>            Priority: Minor
>         Attachments: NUTCH-505.patch, NUTCH-505.patch, NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.