You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Shinichiro Abe (JIRA)" <ji...@apache.org> on 2013/01/09 07:52:12 UTC

[jira] [Created] (CONNECTORS-601) make the thresholds of isText() input-able

Shinichiro Abe created CONNECTORS-601:
-----------------------------------------

             Summary: make the thresholds of isText() input-able
                 Key: CONNECTORS-601
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-601
             Project: ManifoldCF
          Issue Type: Improvement
          Components: Web connector
    Affects Versions: ManifoldCF 1.0.1
            Reporter: Shinichiro Abe
            Priority: Minor


Currently the thresholds of isText() is 0.30 as default.
This is too strict value for Japanese sites because those sites don't often have ASCII characters.
As a result some sites is judged as not-text then MCF can't extract links from those documents.
I'd like to make this value input-able at Repository connection. 
There is no patch from me now.

{code:title=WebcrawlerConnector.java|borderStyle=solid}
  /** Test to see if a document is text or not.  The first n bytes are passed
  * in, and this code returns "true" if it thinks they represent text.  The code
  * has been lifted algorithmically from products/Sharecrawler/Fingerprinter.pas,
  * which was based on "perldoc -f -T".
  */
  protected static boolean isText(byte[] beginChunk, int chunkLength)
  {
    if (chunkLength == 0)
      return true;
    int i = 0;
    int count = 0;
    while (i < chunkLength)
    {
      byte x = beginChunk[i++];
      if (x == 0)
        return false;
      if (isStrange(x))
        count++;
    }
    return ((double)count)/((double)chunkLength) < 0.30;
  }

  /** Check if character is not typical ASCII. */
  protected static boolean isStrange(byte x)
  {
    return (x > 127 || x < 32) && (!isWhiteSpace(x));
  }
{code} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira