You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Shinichiro Abe (JIRA)" <ji...@apache.org> on 2013/01/09 07:52:12 UTC
[jira] [Created] (CONNECTORS-601) make the thresholds of isText()
input-able
Shinichiro Abe created CONNECTORS-601:
-----------------------------------------
Summary: make the thresholds of isText() input-able
Key: CONNECTORS-601
URL: https://issues.apache.org/jira/browse/CONNECTORS-601
Project: ManifoldCF
Issue Type: Improvement
Components: Web connector
Affects Versions: ManifoldCF 1.0.1
Reporter: Shinichiro Abe
Priority: Minor
Currently the thresholds of isText() is 0.30 as default.
This is too strict value for Japanese sites because those sites don't often have ASCII characters.
As a result some sites is judged as not-text then MCF can't extract links from those documents.
I'd like to make this value input-able at Repository connection.
There is no patch from me now.
{code:title=WebcrawlerConnector.java|borderStyle=solid}
/** Test to see if a document is text or not. The first n bytes are passed
* in, and this code returns "true" if it thinks they represent text. The code
* has been lifted algorithmically from products/Sharecrawler/Fingerprinter.pas,
* which was based on "perldoc -f -T".
*/
protected static boolean isText(byte[] beginChunk, int chunkLength)
{
if (chunkLength == 0)
return true;
int i = 0;
int count = 0;
while (i < chunkLength)
{
byte x = beginChunk[i++];
if (x == 0)
return false;
if (isStrange(x))
count++;
}
return ((double)count)/((double)chunkLength) < 0.30;
}
/** Check if character is not typical ASCII. */
protected static boolean isStrange(byte x)
{
return (x > 127 || x < 32) && (!isWhiteSpace(x));
}
{code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira