You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2011/06/23 16:01:47 UTC

[jira] [Created] (LUCENE-3233) HuperDuperSynonymsFilter™

HuperDuperSynonymsFilter™
-------------------------

                 Key: LUCENE-3233
                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
             Project: Lucene - Java
          Issue Type: Improvement
            Reporter: Robert Muir


The current synonymsfilter uses a lot of ram and cpu, especially at build time.

I think yesterday I heard about "huge synonyms files" three times.

So, I think we should use an FST-based structure, sharing the inputs and outputs.
And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3233:
---------------------------------------

    Attachment: LUCENE-3233.patch

Another rev of the patch: I did a hard bump the FST version (so
existing trunk indices must be rebuilt), and added NOTE in suggest's
FST impl that the file format is experimental; removed
maxVerticalContext; fixed false test failure.


> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088666#comment-13088666 ] 

Michael McCandless commented on LUCENE-3233:
--------------------------------------------

It does support multi-word synonyms -- can you give some details on the odd behavior?  Maybe email dev@lucene.apache.org, or open a new Jira issue?

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061001#comment-13061001 ] 

Robert Muir commented on LUCENE-3233:
-------------------------------------

Benchmark with the synonyms.zip attached to this issue (so that we are actually matching synonyms):
in this case i only analyzed the text 100,000 times, as its a lot more output.
I also checked they are emitting the same stuff:

||Impl||ms||
|SynonymsFilter|112527|
|FST|22872|

So, thats 5x faster, I think probably due to avoiding the expensive cloning.

I think we are fine on performance.


> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Resolved] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-3233.
---------------------------------

       Resolution: Fixed
    Fix Version/s: 4.0
                   3.4

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3233:
--------------------------------

    Attachment: LUCENE-3233.patch

here's a rough start to building a datastructure that I think makes good tradeoffs between RAM and processing.

No matter what, the processing on the filter-side will be hairy because of the 'interleaving' with the tokenstream.

This one is just an FST<CharsRef,Int[]>(BYTE4) where Int is an ord to a BytesRefHash, containing the output Bytes for each term.

This way, at input time we can walk the FST with codePointAt()

On both sides, the Chars/Bytes are actually phrases, using \u0000 as a word separator.


> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3233.patch
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Glen Stampoultzis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088540#comment-13088540 ] 

Glen Stampoultzis commented on LUCENE-3233:
-------------------------------------------

Does this support multi word synonyms? I tried one of the 3.4 builds and I was getting odd results on multi word synonyms.

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "David Smiley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060698#comment-13060698 ] 

David Smiley commented on LUCENE-3233:
--------------------------------------

Wow that's striking; nice work guys.  FSTs are definitely one of those killer pieces of technology in Lucene.

The difference in build time is surprising to me.  Any theory why SynonymFilterFactory takes so much more time to build?

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3233:
--------------------------------

    Attachment: LUCENE-3233.patch

fixed some bugs, added some tests, but there is a problem, I started to add a little benchmark and I hit this on my largish synonyms file:
{noformat}
java.lang.IllegalStateException: max arc size is too large (445)
{noformat}

Just run the TestFSTSynonymFilterFactory and you will see it, i enabled some prints and it doesn't appear like anything totally stupid is going on... giving up for the night :)

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13091919#comment-13091919 ] 

Yonik Seeley commented on LUCENE-3233:
--------------------------------------

I just tested by hand:
I added a line to synonyms.txt "a\,a => b\,b", fired up the example server and then executed the following query:
http://localhost:8983/solr/select?q=a,a&debugQuery=true

I then verified that the synonyms were in effect in general, via:
http://localhost:8983/solr/select?q=fooaaa&debugQuery=true

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Resolved] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley resolved LUCENE-3233.
----------------------------------

    Resolution: Fixed

User error - the field type changes in the example schema tripped up the user (and me too).  The standard tokenizer does not keep "a,a" as a single token.

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3233:
---------------------------------------

    Attachment: LUCENE-3233.patch

New patch, moving the root arcs cache into FST, not using up our last precious arc bit.

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3233:
--------------------------------

    Attachment: LUCENE-3233.patch

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060613#comment-13060613 ] 

Robert Muir commented on LUCENE-3233:
-------------------------------------

Thanks Mike, I will set the option for now, we can address any potential perf hit in a number of different ways here (besides modifying FST itself).


> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3233:
---------------------------------------

    Attachment: LUCENE-3223.patch

Dumping my current state on FSTSynonymFilter -- it compiles but it's got tons of bugs I'm sure!  I added a trivial initial test.

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13062384#comment-13062384 ] 

Robert Muir commented on LUCENE-3233:
-------------------------------------

This one is faster for the simple reason that the TokenFilter uses captureState() and restoreState() [and has logic to minimize cloning in more cases] instead of AttributeSource.cloneAttributes()/copyTo()

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061232#comment-13061232 ] 

Robert Muir commented on LUCENE-3233:
-------------------------------------

so i don't forget, lets not waste an arc bitflag marking an arc as 'first'... 

I hear the secret is instead arc.target == startNode

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060781#comment-13060781 ] 

Robert Muir commented on LUCENE-3233:
-------------------------------------

I agree, this would be the best solution. Maybe we should just open a separate issue for that? 

we can let this one be for now until that is resolved, can even continue working on other parts of it.

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3233:
--------------------------------

    Attachment: LUCENE-3233.patch

here is a patch with a little microbenchmark... so we have some tuning to do. 

the benchmark analyzes a short string a million times, that doesn't match any synonyms (actually hte solr default)

||impl||ms||
|SynonymsFilter|1692|
|FST with array arcs|2794|
|FST with no array arcs|8823|

so, disabling the array arcs is a pretty crucial hit here. but we could do other options to speed up this common case, e.g. with daciuk we could build a charrunautomaton of the K-prefixes of the synonyms, this would be really fast to reject these terms that don't match any syns.

or we could explicitly put our bytesref output in a byte[], and use long pointers as outputs.

or we could speed up FST! But i think its interesting to see how important this parameter is.

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060705#comment-13060705 ] 

Robert Muir commented on LUCENE-3233:
-------------------------------------

{quote}
The difference in build time is surprising to me. Any theory why SynonymFilterFactory takes so much more time to build?
{quote}

Yes, its the n^2 portion where you have a synonym entry like this: a, b, c, d
in reality this is creating entries like this:
a -> a
a -> b
a -> c
a -> d
b -> a
b -> b
...

in the current impl, this is done using some inefficient datastructures (like nested chararraymaps with Token),
as well as calling merge().

In the FST impl, we don't use any nested structures (instead input and output entries are just phrases), and we explicitly 
deduplicate both inputs and outputs during construction, the FST output is just a
List<Integer> basically pointing to ords in the deduplicated bytesrefhash.

so during construction when you add() its just a hashmap lookup on the input phrase, a bytesrefhash get/put on the UTF16toUTF8WithHash
to get the output ord, and an append to an arraylist.

this code isn't really optimized right now and we can definitely speed it up even more in the future. but the main thing
right now is to ensure the filter performance is good.


> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060471#comment-13060471 ] 

Michael McCandless commented on LUCENE-3233:
--------------------------------------------

bq. java.lang.IllegalStateException: max arc size is too large (445)

Ahh -- to fix this we have to call Builder.setAllowArrayArcs(false), ie, disable the array arcs in the FST (and this binary search lookup for finding arcs!).  I had to do this also for MemoryCodec, since postings encoded as output per arc can be more than 256 bytes, in general.

This will hurt perf, ie, the arc lookup cannot use a binary search; it's because of a silly limitation in the FST representation, that we use a single byte to hold the max size of all arcs, so that if any arc is > 256 bytes we are unable to encode it as an array.  We could fix this (eg, use vInt), however, arcs with such widely varying sizes (due to widely varying outputs on each arc) will be very wasteful in space because all arcs will use up a fixed number of bytes when represented as an array.

For now I think we should just call the above method, and then test the resulting perf.

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13062382#comment-13062382 ] 

Yonik Seeley commented on LUCENE-3233:
--------------------------------------

bq. but I'd like to rename existing syn filter to SlowSynonymFilter and rename the new one to SynonymFilter.

But the lookup on the original is still faster, right?  And if someone has small synonym dicts (actually pretty common in my experience since SynonymFilter isn't necessary used to inject synonyms in the traditional sense, but for any mapping task) then build time and mem use won't be much of an issue (esp if the input to match is mostly single words).

This looks great for large synonym maps, but perhaps instead of Slow* or Fast* we could name them for the implementation and either name the new one FSTSynonymFilter or rename the current one to MapSynonymFilter?  Or is the plan to actually deprecate the current SynonymFilter? 


> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060905#comment-13060905 ] 

Robert Muir commented on LUCENE-3233:
-------------------------------------

bq. New patch, including some optimizing to FST (which we can commit under a separate issue)

works! I don't think we need to open a new issue, I didn't think you would come back with a patch in just two hours!

I'll play with the patch some now and see what I can do with it.

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13062380#comment-13062380 ] 

Robert Muir commented on LUCENE-3233:
-------------------------------------

I'll try to add some not-so-sophisticated backwards here!

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3233:
--------------------------------

    Attachment: synonyms.zip

attaching my synonyms.txt test file that i was using: its derived from wordnet

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13091843#comment-13091843 ] 

Robert Muir commented on LUCENE-3233:
-------------------------------------

Do you have a test?

Because we have tests for this:
{noformat}
    String testFile = 
      "a\\=>a => b\\=>b\n" +
      "a\\,a => b\\,b";
...
    assertAnalyzesTo(analyzer, "a=>a",
        new String[] { "b=>b" },
        new int[] { 1 });
    
    assertAnalyzesTo(analyzer, "a,a",
        new String[] { "b,b" },
        new int[] { 1 });
{noformat}

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3233:
--------------------------------

    Attachment: LUCENE-3233.patch

updated patch, this tableizes the first FST arcs for latin-1.

precomputing this tiny table speeds up this filter a ton (~3000ms -> ~2000ms) and I think is a cheap easy win for the terms index too.

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060686#comment-13060686 ] 

Robert Muir commented on LUCENE-3233:
-------------------------------------

I ran some quick numbers, using the syn file example here, just best of 3 runs:

||Impl||Build time||RAM usage||
|SynonymFilterFactory|6619ms|207.92 mb|
|FSTSynonymFilterFactory|463 ms|3.51 mb|

I modified the builder slightly to build the FST more efficiently for this, will upload the updated patch.

So i think the build-time and RAM consumption are really improved, the next thing is to benchmark the runtime perf.

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060578#comment-13060578 ] 

Michael McCandless commented on LUCENE-3233:
--------------------------------------------

Actually, maybe a better general fix for FST would be for it to dynamically decide whether to make an array based on how many bytes will be wasted (in addition to the number of arcs / depth of the node).  This way we could turn on arcs always, and FST would pick the right times to use it.  If we stick to only 1 byte for the number of bytes per arc, the FST could simply not use the array when an arc is > 256 bytes.

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3233:
---------------------------------------

    Attachment: LUCENE-3233.patch

New patch, also setting offsets in the produced tokens (the wordnet test passes), and adding adding a NOTE about the dup output words issue.

I think it's finally ready to commit!

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13062375#comment-13062375 ] 

Michael McCandless commented on LUCENE-3233:
--------------------------------------------

I think this is ready to commit, but I'd like to rename existing syn filter to SlowSynonymFilter and rename the new one to SynonymFilter.

Because there are some minor diffs (deduping rules, lowercasing), for Solr to cutover I think we need some back compat logic; I'll open a separate issue for this.

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Assigned] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir reassigned LUCENE-3233:
-----------------------------------

    Assignee: Robert Muir

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Glen Stampoultzis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089319#comment-13089319 ] 

Glen Stampoultzis commented on LUCENE-3233:
-------------------------------------------

I'll try and put together a simple test case just to make sure I've got something repeatable and post to the list. I think it might have been related to synonym fields tokenized using KeywordTokenizerFactory.

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3233:
---------------------------------------

    Attachment: LUCENE-3233.patch

New patch w/ current state.

I think it's closer; the test has more cases now (but I'd still like to make a random test), fewer nocommits, etc.

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3233:
--------------------------------

    Attachment: LUCENE-3233.patch

patch with a first random test, this one currently does 10 iterations where it adds random shit to the synonym map, then it analyzes 10k random strings (each time capturing the output, and replaying it back to ensure the thing is deterministic and doesn't have reuse bugs).

i also added the ignoreCase support.

the filter might have a reuse bug, see ant test -Dtestcase=TestFSTSynonymMapFilter -Dtestmethod=testRandom -Dtests.seed=-4122723628721952592:244824441557739968


> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3233:
--------------------------------

    Attachment: LUCENE-3233.patch

just updating patch to trunk, the nocommits remain...

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3233:
---------------------------------------

    Attachment: LUCENE-3233.patch

New patch, adding dedup option to the builder, removing a couple nocommits, cutting back on iters/counts in testRandom2.

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3233:
--------------------------------

    Attachment: LUCENE-3233.patch

Updated patch: 
* renamed to SynonymFilter
* added not-so-sophisticated backwards layer
* added tests
* added parser for format="wordnet"
* removed contrib/wordnet

but i found some bugs (well one, surely is) in the new tests, so i added nocommits here.

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13062386#comment-13062386 ] 

Yonik Seeley commented on LUCENE-3233:
--------------------------------------

bq. Now, from the testing above, it looks like we are faster when syns actually match; if no syns match the two are around the same speed.

Oh cool!  I was looking at "1692" for the SynonymsFilter and a drop from "~3000ms -> ~2000ms" for the FST version.   I assumed Robert's last benchmark was building and not lookup (the 112527/22872).

bq. Separately: shouldn't we not have any syns in the default text_en field type?

I dunno... it's nice for both demonstration and testing (and it's in the current tutorial).


> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Reopened] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley reopened LUCENE-3233:
----------------------------------

      Assignee:     (was: Robert Muir)

Looks like that when Solr's synonym parsing was moved to the analysis module, it was also rewritten, introducing escaping bugs.

Examples:
a\,a is no longer treated as a single token
a\=>a is no longer treated as a single token
a\ta is treated as "ata" instead of containing a tab character

I didn't do a full review, so I'm not sure if there are other differences in behavior.

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3233:
--------------------------------

    Attachment: LUCENE-3233.patch

Updated patch:
* added a SolrSynonymsParser and test to the analyzers module, that parses the existing solr synonyms format.
* added a Solr factory for this thing (untested!) that uses this when "format=solr" (the default)

This way, the idea is the factory would be more extensible, e.g. you could load syns from a database, or we could add parsers for wordnet and nuke contrib/wordnet, etc etc.

Still need to do some basic benchmarking.

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060764#comment-13060764 ] 

Michael McCandless commented on LUCENE-3233:
--------------------------------------------

Wow, it's very important to allow arcs to be encoded as arrays (for the binary search on lookup).  I think we should just fix FST... I'll think about it.  MemoryCodec would also get big gains here.

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13062383#comment-13062383 ] 

Michael McCandless commented on LUCENE-3233:
--------------------------------------------

bq. But the lookup on the original is still faster, right?

That was before we optimized FST for this usage case.

Now, from the testing above, it looks like we are faster when syns actually match; if no syns match the two are around the same speed.

Separately: shouldn't we not have any syns in the default text_en field type?  Like we can have a synonyms.txt but comment out all the rules in there?

I don't think we should keep the old one around, ie, we should [eventually] replace it with the new one.

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3233:
---------------------------------------

    Attachment: LUCENE-3233.patch

New patch, folding in Robert's changes and the random stress test.  All tests pass.  I think it's now functionally correct, but I still need to compare perf vs existing syn filter, and there are still a few minor nocommits to work out.

Ideally, if we get perf close enough, since RAM is much much less w/ this new syn filter, I think we should replace the old one with this new one.

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3233:
---------------------------------------

    Attachment: LUCENE-3233.patch

New patch, including some optimizing to FST (which we can commit under a separate issue): array arcs can now be any size, and I re-use the BytesReader inner class that's created for parsing arcs.

> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org