You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Dawid Weiss (Created) (JIRA)" <ji...@apache.org> on 2012/02/28 10:43:48 UTC

[jira] [Created] (LUCENE-3830) MappingCharFilter could be improved by switching to an FST.

MappingCharFilter could be improved by switching to an FST.
-----------------------------------------------------------

                 Key: LUCENE-3830
                 URL: https://issues.apache.org/jira/browse/LUCENE-3830
             Project: Lucene - Java
          Issue Type: Improvement
            Reporter: Dawid Weiss
            Assignee: Dawid Weiss
            Priority: Minor
             Fix For: 4.0


MappingCharFilter stores an overly complex tree-like structure for matching input patterns. The input is a union of fixed strings mapped to a set of fixed strings; an fst matcher would be ideal here and provide both memory and speed improvement I bet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3830) MappingCharFilter could be improved by switching to an FST.

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3830:
---------------------------------------

    Attachment: PerfTestMappingCharFilter.java
                LUCENE-3830.patch

New patch: I made a simple perf test (attached:
PerfTestMappingCharFilter.java), and found the FST was slower... I
fixed RollingCharBuffer to bulk read, and also pre-cache the FST arcs
in a HashMap (FST pre-caches only latin bytes), and now perf is a bit
faster.  I also switched to builder API (like SynFilter) so
the NormalizeCharMap instance is immutable.

I think it's ready!

We can separately improve the matching algo based on the 23 page
paper...

                
> MappingCharFilter could be improved by switching to an FST.
> -----------------------------------------------------------
>
>                 Key: LUCENE-3830
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3830
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Assignee: Michael McCandless
>            Priority: Minor
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>         Attachments: LUCENE-3830.patch, LUCENE-3830.patch, PerfTestMappingCharFilter.java
>
>
> MappingCharFilter stores an overly complex tree-like structure for matching input patterns. The input is a union of fixed strings mapped to a set of fixed strings; an fst matcher would be ideal here and provide both memory and speed improvement I bet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3830) MappingCharFilter could be improved by switching to an FST.

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13267908#comment-13267908 ] 

Michael McCandless commented on LUCENE-3830:
--------------------------------------------

bq.  from the brief code inspection I concluded you're using a search from each position to get the maximum match, is this right?

Right.  I mean, that's how the code currently works ... I'm just
replacing the recursive Map w/ an FST.

bq. If so, the pessimistic time is quite bad for patterns like "aaa*b" and input strings "aaaa*c" (replace * with a large number of repeated sequences). 

Yes... though I suspect in practice the impact is minimal for Lucene's
typical use cases.  Still it would be nice to fix :)

bq. The way I would try to implement it is to do a state-tracking much like in non-deterministic automata, where you have a stack of "active" states which you follow in the automaton and you move them from state to state on each input symbol.

OK!  I think I understand... I think this amounts to putting a .*
cycle on the FST's start node, and then doing subset construction to
"determinize" as you traverse?  Ie track all states that the input
characters could be in... so that you only traverse the input once.

bq. I suppose what Mihov et al. do is to statically (on the fst) determine which states can lead to this "deferred match queue" and simply eliminate them, but haven't really looked into it.

Yeah I think I think the paper (and Aho/Corasick) essentially do that
determinization "up front".

SynonymFilter also has the same limitation....

                
> MappingCharFilter could be improved by switching to an FST.
> -----------------------------------------------------------
>
>                 Key: LUCENE-3830
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3830
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Assignee: Michael McCandless
>            Priority: Minor
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>         Attachments: LUCENE-3830.patch
>
>
> MappingCharFilter stores an overly complex tree-like structure for matching input patterns. The input is a union of fixed strings mapped to a set of fixed strings; an fst matcher would be ideal here and provide both memory and speed improvement I bet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Resolved] (LUCENE-3830) MappingCharFilter could be improved by switching to an FST.

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-3830.
----------------------------------------

    Resolution: Fixed
    
> MappingCharFilter could be improved by switching to an FST.
> -----------------------------------------------------------
>
>                 Key: LUCENE-3830
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3830
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Assignee: Michael McCandless
>            Priority: Minor
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>         Attachments: LUCENE-3830.patch, LUCENE-3830.patch, LUCENE-3830.patch, PerfTestMappingCharFilter.java
>
>
> MappingCharFilter stores an overly complex tree-like structure for matching input patterns. The input is a union of fixed strings mapped to a set of fixed strings; an fst matcher would be ideal here and provide both memory and speed improvement I bet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3830) MappingCharFilter could be improved by switching to an FST.

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3830:
---------------------------------------

    Attachment: LUCENE-3830.patch

Patch; I think it's basically working.

I disallow empty string match to be added (this was accepted, but I think ignored, before), so I had to fix a couple places that were doing that.

I also added another random test that separately (slowly) computes the output & corrected offsets and then compares ... it's passing.

I still need to do a perf test.. does anyone have a "typical" set of mappings + content I could use?  Else I'll try to come up with something...
                
> MappingCharFilter could be improved by switching to an FST.
> -----------------------------------------------------------
>
>                 Key: LUCENE-3830
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3830
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Assignee: Michael McCandless
>            Priority: Minor
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>         Attachments: LUCENE-3830.patch
>
>
> MappingCharFilter stores an overly complex tree-like structure for matching input patterns. The input is a union of fixed strings mapped to a set of fixed strings; an fst matcher would be ideal here and provide both memory and speed improvement I bet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3830) MappingCharFilter could be improved by switching to an FST.

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13269210#comment-13269210 ] 

Robert Muir commented on LUCENE-3830:
-------------------------------------

patch looks good: I guess the bulk read in RollingCharBuffer should help
other things like Kuromoji that use it too?!
                
> MappingCharFilter could be improved by switching to an FST.
> -----------------------------------------------------------
>
>                 Key: LUCENE-3830
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3830
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Assignee: Michael McCandless
>            Priority: Minor
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>         Attachments: LUCENE-3830.patch, LUCENE-3830.patch, LUCENE-3830.patch, PerfTestMappingCharFilter.java
>
>
> MappingCharFilter stores an overly complex tree-like structure for matching input patterns. The input is a union of fixed strings mapped to a set of fixed strings; an fst matcher would be ideal here and provide both memory and speed improvement I bet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3830) MappingCharFilter could be improved by switching to an FST.

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3830:
---------------------------------------

    Attachment: LUCENE-3830.patch

Patch moving the root arc cache creation to NormalizeCharMap...
                
> MappingCharFilter could be improved by switching to an FST.
> -----------------------------------------------------------
>
>                 Key: LUCENE-3830
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3830
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Assignee: Michael McCandless
>            Priority: Minor
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>         Attachments: LUCENE-3830.patch, LUCENE-3830.patch, LUCENE-3830.patch, PerfTestMappingCharFilter.java
>
>
> MappingCharFilter stores an overly complex tree-like structure for matching input patterns. The input is a union of fixed strings mapped to a set of fixed strings; an fst matcher would be ideal here and provide both memory and speed improvement I bet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3830) MappingCharFilter could be improved by switching to an FST.

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265137#comment-13265137 ] 

Robert Muir commented on LUCENE-3830:
-------------------------------------

just laughing a little bit to stumble on this:

http://www.cis.uni-muenchen.de/people/Schulz/Pub/dictle5.ps
                
> MappingCharFilter could be improved by switching to an FST.
> -----------------------------------------------------------
>
>                 Key: LUCENE-3830
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3830
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Assignee: Michael McCandless
>            Priority: Minor
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>         Attachments: LUCENE-3830.patch
>
>
> MappingCharFilter stores an overly complex tree-like structure for matching input patterns. The input is a union of fixed strings mapped to a set of fixed strings; an fst matcher would be ideal here and provide both memory and speed improvement I bet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3830) MappingCharFilter could be improved by switching to an FST.

Posted by "Dawid Weiss (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266442#comment-13266442 ] 

Dawid Weiss commented on LUCENE-3830:
-------------------------------------

bq. Hmm I don't quite understand ... can you describe more?

I'm on vacation this week, so short: from the brief code inspection I concluded you're using a search from each position to get the maximum match, is this right? If so, the pessimistic time is quite bad for patterns like "aaa*b" and input strings "aaaa*c" (replace * with a large number of repeated sequences). The way I would try to implement it is to do a state-tracking much like in non-deterministic automata, where you have a stack of "active" states which you follow in the automaton and you move them from state to state on each input symbol. If there a given state fires as a match then you have some conditions to check -- are there any states that may be potentially longer but haven't fired yet (if so, you need to delay this firing state because it can be subsumed by others), otherwise when the no other active state blocks that one you can do the replacement.

I haven't really thought about how twisted the logic above can become but I suspect it's not going to be really that bad. The gain is that each input symbol advances your state (at most linearly with the number of active states). It helps with degenerate cases like the one above. I suppose what Mihov et al. do is to statically (on the fst) determine which states can lead to this "deferred match queue" and simply eliminate them, but haven't really looked into it.

It's an improvement, doesn't need to be implemented right away. Sorry for being brief -- the network is flaky here, I'm in the middle of nowhere.

                
> MappingCharFilter could be improved by switching to an FST.
> -----------------------------------------------------------
>
>                 Key: LUCENE-3830
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3830
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Assignee: Michael McCandless
>            Priority: Minor
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>         Attachments: LUCENE-3830.patch
>
>
> MappingCharFilter stores an overly complex tree-like structure for matching input patterns. The input is a union of fixed strings mapped to a set of fixed strings; an fst matcher would be ideal here and provide both memory and speed improvement I bet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3830) MappingCharFilter could be improved by switching to an FST.

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13268444#comment-13268444 ] 

Robert Muir commented on LUCENE-3830:
-------------------------------------

Why does mappingcharfilter itself cache FST arcs?

This seems like a slow thing to do per-reader. Shouldnt this be done
when you build the NormalizeCharMap instead?
                
> MappingCharFilter could be improved by switching to an FST.
> -----------------------------------------------------------
>
>                 Key: LUCENE-3830
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3830
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Assignee: Michael McCandless
>            Priority: Minor
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>         Attachments: LUCENE-3830.patch, LUCENE-3830.patch, PerfTestMappingCharFilter.java
>
>
> MappingCharFilter stores an overly complex tree-like structure for matching input patterns. The input is a union of fixed strings mapped to a set of fixed strings; an fst matcher would be ideal here and provide both memory and speed improvement I bet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3830) MappingCharFilter could be improved by switching to an FST.

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265888#comment-13265888 ] 

Robert Muir commented on LUCENE-3830:
-------------------------------------

I only glanced at the Schulz/Mihov paper, but it seemed like in order for it to work you have
to add extra transitions (loops) which would make the FST infinite? (could be totally wrong here)

Maybe there is some creative way we could do this, but thats just in addition to the fact the
paper is ... like their other papers and would take some work to decode anyway :)
                
> MappingCharFilter could be improved by switching to an FST.
> -----------------------------------------------------------
>
>                 Key: LUCENE-3830
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3830
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Assignee: Michael McCandless
>            Priority: Minor
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>         Attachments: LUCENE-3830.patch
>
>
> MappingCharFilter stores an overly complex tree-like structure for matching input patterns. The input is a union of fixed strings mapped to a set of fixed strings; an fst matcher would be ideal here and provide both memory and speed improvement I bet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3830) MappingCharFilter could be improved by switching to an FST.

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265964#comment-13265964 ] 

Michael McCandless commented on LUCENE-3830:
--------------------------------------------

bq. Just laughing a little bit to stumble on this:

Only 23 pages!

bq. Javadoc is probably wrong (sequence of bytes?).

Thanks Dawid, I'll fix.

bq. I'd probably implement it based on the stack of currently "active" tokens moving through the fst – this way you know when you have a greedy first longest match simply by looking at the set of active tokens when a match fires instead of restarting from each position.

Hmm I don't quite understand ... can you describe more?

If there's something simple in between what MappingCharFilter (and my patch) does today ("restart on each position"), and the 23 page paper, that would be nice :)

                
> MappingCharFilter could be improved by switching to an FST.
> -----------------------------------------------------------
>
>                 Key: LUCENE-3830
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3830
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Assignee: Michael McCandless
>            Priority: Minor
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>         Attachments: LUCENE-3830.patch
>
>
> MappingCharFilter stores an overly complex tree-like structure for matching input patterns. The input is a union of fixed strings mapped to a set of fixed strings; an fst matcher would be ideal here and provide both memory and speed improvement I bet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3830) MappingCharFilter could be improved by switching to an FST.

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13269211#comment-13269211 ] 

Michael McCandless commented on LUCENE-3830:
--------------------------------------------

bq.  I guess the bulk read in RollingCharBuffer should help other things like Kuromoji that use it too?!

I haven't tested but I think it should help!
                
> MappingCharFilter could be improved by switching to an FST.
> -----------------------------------------------------------
>
>                 Key: LUCENE-3830
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3830
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Assignee: Michael McCandless
>            Priority: Minor
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>         Attachments: LUCENE-3830.patch, LUCENE-3830.patch, LUCENE-3830.patch, PerfTestMappingCharFilter.java
>
>
> MappingCharFilter stores an overly complex tree-like structure for matching input patterns. The input is a union of fixed strings mapped to a set of fixed strings; an fst matcher would be ideal here and provide both memory and speed improvement I bet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3830) MappingCharFilter could be improved by switching to an FST.

Posted by "Dawid Weiss (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265154#comment-13265154 ] 

Dawid Weiss commented on LUCENE-3830:
-------------------------------------

I briefly looked at the patch. Looks good (didn't check indexes and most of the logic, just went through the general idea)
{code}
+ * An FST {@link Outputs} implementation where each output
+ * is a sequence of bytes.
+ *
+ * @lucene.experimental
+ */
+
+public final class CharSequenceOutputs extends Outputs<CharsRef> { 
{code}
Javadoc is probably wrong (sequence of bytes?).

Other than that I think it's all right. I'd probably implement it based on the stack of currently "active" tokens moving through the fst -- this way you know when you have a greedy first longest match simply by looking at the set of active tokens when a match fires instead of restarting from each position.

But then -- Mihov's algorithm claims to reduce this to O(1) basically so it'd be very nice to implement that instead ;)

                
> MappingCharFilter could be improved by switching to an FST.
> -----------------------------------------------------------
>
>                 Key: LUCENE-3830
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3830
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Assignee: Michael McCandless
>            Priority: Minor
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>         Attachments: LUCENE-3830.patch
>
>
> MappingCharFilter stores an overly complex tree-like structure for matching input patterns. The input is a union of fixed strings mapped to a set of fixed strings; an fst matcher would be ideal here and provide both memory and speed improvement I bet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3830) MappingCharFilter could be improved by switching to an FST.

Posted by "Michael McCandless (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3830:
---------------------------------------

    Labels: gsoc2012 lucene-gsoc-12  (was: )
    
> MappingCharFilter could be improved by switching to an FST.
> -----------------------------------------------------------
>
>                 Key: LUCENE-3830
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3830
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>            Priority: Minor
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>
> MappingCharFilter stores an overly complex tree-like structure for matching input patterns. The input is a union of fixed strings mapped to a set of fixed strings; an fst matcher would be ideal here and provide both memory and speed improvement I bet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3830) MappingCharFilter could be improved by switching to an FST.

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13268469#comment-13268469 ] 

Michael McCandless commented on LUCENE-3830:
--------------------------------------------

bq. Shouldnt this be done when you build the NormalizeCharMap instead?

Duh, yes!  I'll fix.
                
> MappingCharFilter could be improved by switching to an FST.
> -----------------------------------------------------------
>
>                 Key: LUCENE-3830
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3830
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Assignee: Michael McCandless
>            Priority: Minor
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>         Attachments: LUCENE-3830.patch, LUCENE-3830.patch, PerfTestMappingCharFilter.java
>
>
> MappingCharFilter stores an overly complex tree-like structure for matching input patterns. The input is a union of fixed strings mapped to a set of fixed strings; an fst matcher would be ideal here and provide both memory and speed improvement I bet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3830) MappingCharFilter could be improved by switching to an FST.

Posted by "Dawid Weiss (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265145#comment-13265145 ] 

Dawid Weiss commented on LUCENE-3830:
-------------------------------------

Stoyan Mihov is everywhere, he's done everything :)
                
> MappingCharFilter could be improved by switching to an FST.
> -----------------------------------------------------------
>
>                 Key: LUCENE-3830
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3830
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Assignee: Michael McCandless
>            Priority: Minor
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>         Attachments: LUCENE-3830.patch
>
>
> MappingCharFilter stores an overly complex tree-like structure for matching input patterns. The input is a union of fixed strings mapped to a set of fixed strings; an fst matcher would be ideal here and provide both memory and speed improvement I bet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Assigned] (LUCENE-3830) MappingCharFilter could be improved by switching to an FST.

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned LUCENE-3830:
------------------------------------------

    Assignee: Michael McCandless  (was: Dawid Weiss)
    
> MappingCharFilter could be improved by switching to an FST.
> -----------------------------------------------------------
>
>                 Key: LUCENE-3830
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3830
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Assignee: Michael McCandless
>            Priority: Minor
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>
> MappingCharFilter stores an overly complex tree-like structure for matching input patterns. The input is a union of fixed strings mapped to a set of fixed strings; an fst matcher would be ideal here and provide both memory and speed improvement I bet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org