You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Karl Wright (JIRA)" <ji...@apache.org> on 2013/01/09 12:04:16 UTC
[jira] [Assigned] (CONNECTORS-601) make the thresholds of isText()
input-able
[ https://issues.apache.org/jira/browse/CONNECTORS-601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Karl Wright reassigned CONNECTORS-601:
--------------------------------------
Assignee: Karl Wright
> make the thresholds of isText() input-able
> ------------------------------------------
>
> Key: CONNECTORS-601
> URL: https://issues.apache.org/jira/browse/CONNECTORS-601
> Project: ManifoldCF
> Issue Type: Improvement
> Components: Web connector
> Affects Versions: ManifoldCF 1.0.1
> Reporter: Shinichiro Abe
> Assignee: Karl Wright
> Priority: Minor
>
> Currently the thresholds of isText() is 0.30 as default.
> This is too strict value for Japanese sites because those sites don't often have ASCII characters.
> As a result some sites is judged as not-text then MCF can't extract links from those documents.
> I'd like to make this value input-able at Repository connection.
> There is no patch from me now.
> {code:title=WebcrawlerConnector.java|borderStyle=solid}
> /** Test to see if a document is text or not. The first n bytes are passed
> * in, and this code returns "true" if it thinks they represent text. The code
> * has been lifted algorithmically from products/Sharecrawler/Fingerprinter.pas,
> * which was based on "perldoc -f -T".
> */
> protected static boolean isText(byte[] beginChunk, int chunkLength)
> {
> if (chunkLength == 0)
> return true;
> int i = 0;
> int count = 0;
> while (i < chunkLength)
> {
> byte x = beginChunk[i++];
> if (x == 0)
> return false;
> if (isStrange(x))
> count++;
> }
> return ((double)count)/((double)chunkLength) < 0.30;
> }
> /** Check if character is not typical ASCII. */
> protected static boolean isStrange(byte x)
> {
> return (x > 127 || x < 32) && (!isWhiteSpace(x));
> }
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira