You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Rodrigo Joni Sestari (JIRA)" <ji...@apache.org> on 2017/05/02 11:54:04 UTC
[jira] [Created] (NUTCH-2381) In some situations the class
TextProfileSignature gives different signatures for the same text "profile"
page.
Rodrigo Joni Sestari created NUTCH-2381:
-------------------------------------------
Summary: In some situations the class TextProfileSignature gives different signatures for the same text "profile" page.
Key: NUTCH-2381
URL: https://issues.apache.org/jira/browse/NUTCH-2381
Project: Nutch
Issue Type: Bug
Components: crawldb
Affects Versions: 1.13
Reporter: Rodrigo Joni Sestari
In some situations the class TextProfileSignature gives different signatures for the same text "profile" page.
The method TextProfileSignature.calculate uses a HashMap to salve the tokens, after some process, the tokens come sorted by decreasing frequency.
For some pages like "http://curia.europa.eu/jcms/" the text "profile" is the same but the signature come different for each fetch.
Its happens because the tokens are sorted only by decreasing frequency. Tokens with the same frequency maybe not have the same order in different fetchs.
The HashMap no guarantees as to the order of the map and not guarantee that the order will remain constant over time.
My suggestion is change the methods TokenComparator.compare in order to sort by frequency and Name.
Rodrigo
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)