You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Karl Wright (JIRA)" <ji...@apache.org> on 2013/01/09 12:04:16 UTC

[jira] [Assigned] (CONNECTORS-601) make the thresholds of isText() input-able

     [ https://issues.apache.org/jira/browse/CONNECTORS-601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wright reassigned CONNECTORS-601:
--------------------------------------

    Assignee: Karl Wright
    
> make the thresholds of isText() input-able
> ------------------------------------------
>
>                 Key: CONNECTORS-601
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-601
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>    Affects Versions: ManifoldCF 1.0.1
>            Reporter: Shinichiro Abe
>            Assignee: Karl Wright
>            Priority: Minor
>
> Currently the thresholds of isText() is 0.30 as default.
> This is too strict value for Japanese sites because those sites don't often have ASCII characters.
> As a result some sites is judged as not-text then MCF can't extract links from those documents.
> I'd like to make this value input-able at Repository connection. 
> There is no patch from me now.
> {code:title=WebcrawlerConnector.java|borderStyle=solid}
>   /** Test to see if a document is text or not.  The first n bytes are passed
>   * in, and this code returns "true" if it thinks they represent text.  The code
>   * has been lifted algorithmically from products/Sharecrawler/Fingerprinter.pas,
>   * which was based on "perldoc -f -T".
>   */
>   protected static boolean isText(byte[] beginChunk, int chunkLength)
>   {
>     if (chunkLength == 0)
>       return true;
>     int i = 0;
>     int count = 0;
>     while (i < chunkLength)
>     {
>       byte x = beginChunk[i++];
>       if (x == 0)
>         return false;
>       if (isStrange(x))
>         count++;
>     }
>     return ((double)count)/((double)chunkLength) < 0.30;
>   }
>   /** Check if character is not typical ASCII. */
>   protected static boolean isStrange(byte x)
>   {
>     return (x > 127 || x < 32) && (!isWhiteSpace(x));
>   }
> {code} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira