You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Dawid Weiss <da...@gmail.com> on 2012/02/23 10:59:40 UTC

PatternReplaceCharFilter, LUCENE-3820

Heads up.

https://issues.apache.org/jira/browse/LUCENE-3820

I've suggested a patch to PatternReplaceCharFilter that heavily
simplifies the code, passes tests that previously didn't pass and,
well, perhaps contains the processing logic that is a bit easier to
understand (for me :).

This patch does drop support for block delimiters and block processing
though (!) so the entire input from CharStreams underneath is buffered
and preprocessed in one go. Is anybody using these block delimiters?
It should be possible to add them back in, but if it's a dead feature
then we may as well just drop it entirely.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: PatternReplaceCharFilter, LUCENE-3820

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
Hi Koji,

> As I thought that buffering entire String tends to take place OOM when doing
> pattern matching,

This is a possibility of course if you have super-long documents on
input (buffering the input strings will be a problem, pattern matching
itself shouldn't be). Was it precautionary measure or did you really
hit OOMs? It would be possible to introduce those block delimiters of
course but I still think it doesn't make that much sense from a
practical point of view -- the code is simpler, more effective and
doesn't crash without it; if somebody parses super long inputs then
I'm sure this won't be the only source of the problem.

I will commit this in. Now that this is covered by Robert's randomized
pattern tests (which I'm sure will make all regexp implementations
very excited indeed) we can go back to it once there's realistic
feedback no boundaries cause a problem.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: PatternReplaceCharFilter, LUCENE-3820

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
(12/02/23 18:59), Dawid Weiss wrote:
> Heads up.
>
> https://issues.apache.org/jira/browse/LUCENE-3820
>
> I've suggested a patch to PatternReplaceCharFilter that heavily
> simplifies the code, passes tests that previously didn't pass and,
> well, perhaps contains the processing logic that is a bit easier to
> understand (for me :).
>
> This patch does drop support for block delimiters and block processing
> though (!) so the entire input from CharStreams underneath is buffered
> and preprocessed in one go. Is anybody using these block delimiters?
> It should be possible to add them back in, but if it's a dead feature
> then we may as well just drop it entirely.

Hi Dawid,

As I thought that buffering entire String tends to take place OOM when doing
pattern matching, I introduced an idea of the block. If it shouldn't,
I think we can just drop it.

koji
-- 
Query Log Visualizer for Apache Solr
http://soleami.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org