You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Derek Wood <dd...@outlook.com> on 2015/06/09 09:14:34 UTC

Refactoring language detection in Solr and a minor bug in maxTotalChars param

I found a bug in the LangDetect implementation of language detection, where the
maxTotalChars property isn't doing what it's description says it does: Solr uses
the append() method solely in the LangDetect library, which checks the string
length of the text to be appended and not its entire contents [1].

I've got a patch (attached) that solves this issue and hoists out a few of the
utility methods in the Tika implementation and reuses them in the LangDetect
one, but I stumbled upon SOLR-3881 [2], where the methods (concatFields and
getExpectedSize specifically) were taken out of the parent class for reasons
that are sort of unclear from the comments.

Could I get some historical context on the issue and feedback on my patch?
Thanks

[1] https://github.com/shuyo/language-detection/blob/master/src/com/cybozu/labs/langdetect/Detector.java#L170
[2] https://issues.apache.org/jira/browse/SOLR-3881

Re: Refactoring language detection in Solr and a minor bug in maxTotalChars param

Posted by Upayavira <uv...@odoko.co.uk>.
Derek,

Make your own JIRA, and link it to the one you mention below. Then this
issue can potentially be tracked through to a commit if it goes that
far.

Thx!

Upayavira

On Tue, Jun 9, 2015, at 08:14 AM, Derek Wood wrote:
> I found a bug in the LangDetect implementation of language detection,
> where the
> maxTotalChars property isn't doing what it's description says it does:
> Solr uses
> the append() method solely in the LangDetect library, which checks the
> string
> length of the text to be appended and not its entire contents [1].
> 
> I've got a patch (attached) that solves this issue and hoists out a few
> of the
> utility methods in the Tika implementation and reuses them in the
> LangDetect
> one, but I stumbled upon SOLR-3881 [2], where the methods (concatFields
> and
> getExpectedSize specifically) were taken out of the parent class for
> reasons
> that are sort of unclear from the comments.
> 
> Could I get some historical context on the issue and feedback on my
> patch?
> Thanks
> 
> [1]
> https://github.com/shuyo/language-detection/blob/master/src/com/cybozu/labs/langdetect/Detector.java#L170
> [2] https://issues.apache.org/jira/browse/SOLR-3881
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> Email had 1 attachment:
> + langdetect-fix.patch
>   8k (text/plain)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org