You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Enis Soztutar (JIRA)" <ji...@apache.org> on 2007/02/06 14:35:05 UTC

[jira] Created: (NUTCH-439) Top Level Domains Indexing / Scoring

Top Level Domains Indexing / Scoring
------------------------------------

                 Key: NUTCH-439
                 URL: https://issues.apache.org/jira/browse/NUTCH-439
             Project: Nutch
          Issue Type: New Feature
          Components: indexer
    Affects Versions: 0.9.0
            Reporter: Enis Soztutar


Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as "com", "edu") and country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. 




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar updated NUTCH-439:
--------------------------------

    Attachment: tld_plugin_v2.0.patch

I have made major improvements to the code and configuration files. Mainly the issue is not only a plugin, but a package, one big xml file, and an indexing/scoring plugin(which is disabled by default). The list of recognized suffixes now is not limited to top level domains. second, or third level public domain names can be recognized. The patch also changes the naming from top level domains to domain suffixes. 

This patch also introduces URLUtil class, which include methods for getting domain name, or public domain suffix of an url. Finding the domain name of a url is quite important for several reasons. First we can use this function as an replacement of URL.getHost() in LinkDB for ignoring internal links, or in similar context. Second we can perform statistical analysis on domain names. Third we can list subdomains under a domain, etc.. 

I have changed the build.encoding to UTF-8 so that non-ascii characters are recognized. 

here is an excerpt from the domain-suffixes.xml file : 
       This document contains top level domains 
 	as described by the Internet Assigned Numbers
	Authotiry (IANA), and second or third level domains that 
	are known to be managed by domain registerers. People at 
	Mozilla community call these as public suffixes or effective 
	tlds. There is no algorithmic way of knowing whether a suffix 
	is a public domain suffix, or not. So this large file is used 
	for this purpose. The entries in the file is used to find the
	domain of a url, which may not the same thing as the host of 
	the url. For example for "http://lucene.apache.org/nutch" the 
	hostname is lucene.apache.org, however the domain name for this
	url would be apache.org. Domain names can be quite handy for 
	statistical analysis, and fighting against spam.    
	
	The list of TLDs is constructed from IANA, and the 
	list of "effective tlds" are constructed from Wikipedia, 
	http://wiki.mozilla.org/TLD_List, and http://publicsuffix.org/
	The list may not include all the suffixes, but some
	effort has been spent to make it comprehensive. Please forward 
	any improvements for this list to nutch-dev mailing list, or 
	nutch JIRA. 




> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>         Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as "com", "edu") and country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar updated NUTCH-439:
--------------------------------

    Attachment: tld_plugin_v2.3.patch

bq. TLDScoringFilter contains a misspelled field, tldEnties, it should be renamed to tldEntries
Done!
bq. one of the use cases for the "tld" index field that you mention is that users may search on it. But in the latest patch this field is added with Field.Index.NO, which makes searching on it impossible. Also, in order to search on arbitrary Lucene fields Nutch needs a Query filter, so we would need a TLDQueryFilter, which doesn't exist (yet?). 

Well, infact NUTCH-445 covers searching on tlds, namely we would be able to search site:lucene.apache.org, or site:apache.org or even site:org, therefore i think indexing tld fields and TLDQueryFilter is not needed. I will delve deeper into NUTCH-445 as soon as i find some time. We can move domain indexing functionality to index-basic so that it will be generic enough. 

bq. using domain names instead of host names - we need to discuss this further, let's create a separate issue on this. 
we  can open issues case by case since the patches is expected to have major side effects. 

> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>         Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch, tld_plugin_v2.1.patch, tld_plugin_v2.2.patch, tld_plugin_v2.3.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as "com", "edu") and country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-439) Top Level Domains Indexing / Scoring

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515812 ] 

Andrzej Bialecki  commented on NUTCH-439:
-----------------------------------------

Some minor issues:

* TLDScoringFilter contains a misspelled field, tldEnties, it should be renamed to tldEntries. Functionally it's of course the same, it's just a puzzling name that is easy to misspell (ie. spell correctly ;) ).

* one of the use cases for the "tld" index field that you mention is that users may search on it. But in the latest patch this field is added with Field.Index.NO, which makes searching on it impossible. Also, in order to search on arbitrary Lucene fields Nutch needs a Query filter, so we would need a TLDQueryFilter, which doesn't exist (yet?).

Other than that, +1 from me.

Re: using domain names instead of host names - we need to discuss this further, let's create a separate issue on this.





> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>         Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch, tld_plugin_v2.1.patch, tld_plugin_v2.2.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as "com", "edu") and country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-439) Top Level Domains Indexing / Scoring

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511362 ] 

Andrzej Bialecki  commented on NUTCH-439:
-----------------------------------------

Very nice patch! A couple comments:

* the fix to OPICScoringFilter - I will make this as a separate commit (no need to create a separate patch).

* IP_PATTERN  - it could be tighter, instead of \\d+ it could use \\d{1,3}

* the DomainStatistics tool: I'd rather see it as a separate JIRA issue. The reason is that it's a common request for enhancement, but specific requirements vary wildly. Some users prefer to build a separate DB that holds staistical info and can be used in various steps of the work cycle, others still prefer one-time tools such as this one.

> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>         Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as "com", "edu") and country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-439) Top Level Domains Indexing / Scoring

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521033 ] 

Enis Soztutar commented on NUTCH-439:
-------------------------------------

Recently Matt Cutts have written about parts of the urls : 
http://www.mattcutts.com/blog/seo-glossary-url-definitions/

it seems that, as expected, google deals with different parts of the urls. *smile*

> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>         Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch, tld_plugin_v2.1.patch, tld_plugin_v2.2.patch, tld_plugin_v2.3.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as "com", "edu") and country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar updated NUTCH-439:
--------------------------------

    Attachment: tld_plugin_v1.1.patch

I have forgotten to unset http.agent.name in the v1.0 accidentally. this version is the same except agent name is not set. This patch obsoletes v1.0. 


> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>         Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as "com", "edu") and country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Closed: (NUTCH-439) Top Level Domains Indexing / Scoring

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney closed NUTCH-439.
-------------------------------


Resolved and committed.

> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 1.0.0
>
>         Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch, tld_plugin_v2.1.patch, tld_plugin_v2.2.patch, tld_plugin_v2.3.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as "com", "edu") and country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar updated NUTCH-439:
--------------------------------

    Attachment:     (was: domain.suffixes_v2.1.patch)

> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>         Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as "com", "edu") and country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar updated NUTCH-439:
--------------------------------

    Attachment: tld_plugin_v1.0.patch

This is a plugin implementation for indexing and scoring top level domains in nutch. Tlds are stored in TLDEntry class, which has fields domain, status and boost fileds. The tlds are read from an xml file. There is also a xsd for validation. 

TLDIndexingFilter implements IndexingFilter interface to index the domain extensions (such as "net", "org", "en", "de") in the tld field. 

TLDScoringFilter implements ScoringFilter interface. Basically this filter multiplies the initial boost(coming from another scoring filter such as opic) by the boost of the domain. This way, by configuring boost of say "edu" domains to 1.1, the document boosts in the index of educational sites is boosted by 1.1. Also local search engines may wish to boost the domains hosted in that country. For ex. boosting "de" domains a little in a German SE seems reasonable. An alternative usage may be to lower the boosts of domains such as biz, or info, which are known to have lots of spam. 

The users can also query the tld field for advanced search. 

Implementation note : 1. OpicScoringFilter is changed to respect ScoringFilter chaining. 
                                        2. some of the second level domains such as co.uk is not recognized, but edu.uk is recognized
                                        



> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>         Attachments: tld_plugin_v1.0.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as "com", "edu") and country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (NUTCH-439) Top Level Domains Indexing / Scoring

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney reassigned NUTCH-439:
-----------------------------------

    Assignee: Enis Soztutar

> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>         Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch, tld_plugin_v2.1.patch, tld_plugin_v2.2.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as "com", "edu") and country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-439) Top Level Domains Indexing / Scoring

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515650 ] 

Doğacan Güney commented on NUTCH-439:
-------------------------------------

If there are no objections, I am going to commit this one. 

This is a big change, but it is almost completely self contained (besides the tld plugin which is disabled by default), so there should be no harm in committing it. Later, we can discuss whether it is useful to replace URL.getHost with getDomainName on a case-by-case basis. 

(FWIW, I think scoring-opic and linkdb should use domain name instead of host.)

> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>         Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch, tld_plugin_v2.1.patch, tld_plugin_v2.2.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as "com", "edu") and country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar updated NUTCH-439:
--------------------------------

    Attachment: tld_plugin_v2.2.patch

This patch includes "core" domain utilities and the tld plugin, but excludes the changes in NUTCH-517 and NUTCH-518. 

> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>         Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch, tld_plugin_v2.1.patch, tld_plugin_v2.2.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as "com", "edu") and country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-439) Top Level Domains Indexing / Scoring

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515987 ] 

Enis Soztutar commented on NUTCH-439:
-------------------------------------

By the way, Andrzej could you please enable support for wiki style editing for Nutch JIRA, similar to Hadoop's. Specifically I find {code}'s and bq. quite useful.  

> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>         Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch, tld_plugin_v2.1.patch, tld_plugin_v2.2.patch, tld_plugin_v2.3.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as "com", "edu") and country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar updated NUTCH-439:
--------------------------------

    Attachment: domain.suffixes_v2.1.patch

> Very nice patch! 
Thanks !
> IP_PATTERN - it could be tighter, instead of \\d+ it could use \\d{1,3}
now it is (\\d{1,3}\\.){3}(\\d{1,3})

>the DomainStatistics tool: I'd rather see it as a separate JIRA issue. The reason is that it's a common request for enhancement, but specific requirements vary wildly. Some users prefer to build a separate DB that holds staistical info and can be used in various steps of the work cycle, others still prefer one-time tools such as this one.

DomainStatistics is really a quick hack i've written for demonstration of the new patch. I've moved it from the latest patch. Once the user requirements are settled, we can move on from there. 

Also you may not want to commit MozillaPublicSuffixListParser.java, but it is good we have it somewhere public. 


> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>         Attachments: domain.suffixes_v2.1.patch, tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as "com", "edu") and country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-439) Top Level Domains Indexing / Scoring

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513482 ] 

Enis Soztutar commented on NUTCH-439:
-------------------------------------

As for Doğacan's comments I've opened issues NUTCH-518 and NUTCH-517. 

> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>         Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch, tld_plugin_v2.1.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as "com", "edu") and country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar updated NUTCH-439:
--------------------------------

    Attachment: tld_plugin_v2.1.patch

Oops, it seems that i've uploaded the wrong file. This is the correct one. 

> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>         Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch, tld_plugin_v2.1.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as "com", "edu") and country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (NUTCH-439) Top Level Domains Indexing / Scoring

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney resolved NUTCH-439.
---------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0.0

Committed in rev. 568053.

I left MozillaPublicSuffixListParser class out after a discussion with Enis.

> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 1.0.0
>
>         Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch, tld_plugin_v2.1.patch, tld_plugin_v2.2.patch, tld_plugin_v2.3.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as "com", "edu") and country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-439) Top Level Domains Indexing / Scoring

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521512 ] 

Hudson commented on NUTCH-439:
------------------------------

Integrated in Nutch-Nightly #184 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/184/])

> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 1.0.0
>
>         Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch, tld_plugin_v2.1.patch, tld_plugin_v2.2.patch, tld_plugin_v2.3.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as "com", "edu") and country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-439) Top Level Domains Indexing / Scoring

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512930 ] 

Doğacan Güney commented on NUTCH-439:
-------------------------------------

A big +1 from me. Though, it may be useful to break this patch into multiple pieces (fixes to opic and build system as a seperate patch, core changes as a seperate patch and plugin as a seperate patch).

IMHO, most usages of URL.getHost should be replaced with this patch's getDomainName. For example, "host" field in index gets a big boost currently. But it is easy to spam hosts. Just buy a host 'example.com' then set up your own dns and add 'foo.example.com', 'bar.example.com', 'baz.example.com'. I have actually seen a lot of spam sites that do this. Doing this in linkdb reduces anchor spam (where 'foo.example.com' gives a link to 'bar.example.com' and nutch considers this an external link and stores this anchor).

Another example is generator. Instead of partitioning on host or ip, we can partition urls based on their domains. This doesn't have the overhead of resolving ips (and ip-resolving also has problems. Urls under the same domain [sometimes even the same url] may be served from different ips [think load balancers and stuff]) and will be much more polite and resistant to honey pots.

> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>         Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch, tld_plugin_v2.1.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as "com", "edu") and country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.