You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2012/06/27 16:22:44 UTC

[jira] [Created] (LUCENE-4170) TestRandomChains fail with Shingle+CommonGrams

Robert Muir created LUCENE-4170:
-----------------------------------

             Summary: TestRandomChains fail with Shingle+CommonGrams
                 Key: LUCENE-4170
                 URL: https://issues.apache.org/jira/browse/LUCENE-4170
             Project: Lucene - Java
          Issue Type: Bug
          Components: modules/analysis
            Reporter: Robert Muir


ant test  -Dtestcase=TestRandomChains -Dtests.method=testRandomChains -Dtests.seed=12635ABB4F789F2A -Dtests.multiplier=3 -Dtests.locale=pt -Dtests.timezone=America/Argentina/Salta -Dargs="-Dfile.encoding=ISO8859-1"

This test has two shinglefilters, then a common-grams filter. I think posLen impls in commongrams and/or shingle has a bug if the input is already a graph.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-4170) TestRandomChains fail with Shingle+CommonGrams

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-4170:
--------------------------------

    Attachment: recursive.shinglefilter.output.png

This image is a (not pretty) word lattice representation of the output from the double ShingleFilter thought problem described above - should help to more easily visualize the graph.

(I wish I could make Graphviz line up the dots in a straight line, but couldn't figure out how to do that.)
                
> TestRandomChains fail with Shingle+CommonGrams
> ----------------------------------------------
>
>                 Key: LUCENE-4170
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4170
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>         Attachments: LUCENE-4170.patch, recursive.shinglefilter.output.png
>
>
> ant test  -Dtestcase=TestRandomChains -Dtests.method=testRandomChains -Dtests.seed=12635ABB4F789F2A -Dtests.multiplier=3 -Dtests.locale=pt -Dtests.timezone=America/Argentina/Salta -Dargs="-Dfile.encoding=ISO8859-1"
> This test has two shinglefilters, then a common-grams filter. I think posLen impls in commongrams and/or shingle has a bug if the input is already a graph.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-4170) TestRandomChains fail with Shingle+CommonGrams

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13402256#comment-13402256 ] 

Robert Muir commented on LUCENE-4170:
-------------------------------------

I think shingles has a similar bug: it doesn't look at the existing posLength of the input tokens at all, instead it just fills posLength with the builtGramSize.
                
> TestRandomChains fail with Shingle+CommonGrams
> ----------------------------------------------
>
>                 Key: LUCENE-4170
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4170
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>         Attachments: LUCENE-4170.patch
>
>
> ant test  -Dtestcase=TestRandomChains -Dtests.method=testRandomChains -Dtests.seed=12635ABB4F789F2A -Dtests.multiplier=3 -Dtests.locale=pt -Dtests.timezone=America/Argentina/Salta -Dargs="-Dfile.encoding=ISO8859-1"
> This test has two shinglefilters, then a common-grams filter. I think posLen impls in commongrams and/or shingle has a bug if the input is already a graph.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-4170) TestRandomChains fail with Shingle+CommonGrams

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-4170:
--------------------------------

    Attachment: LUCENE-4170.patch

first stab a patch for commongrams' posLen. But, the test still fails. So either my patch is wrong, or we need to fix shingle, too.

We could use some standalone tests here as well.
                
> TestRandomChains fail with Shingle+CommonGrams
> ----------------------------------------------
>
>                 Key: LUCENE-4170
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4170
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>         Attachments: LUCENE-4170.patch
>
>
> ant test  -Dtestcase=TestRandomChains -Dtests.method=testRandomChains -Dtests.seed=12635ABB4F789F2A -Dtests.multiplier=3 -Dtests.locale=pt -Dtests.timezone=America/Argentina/Salta -Dargs="-Dfile.encoding=ISO8859-1"
> This test has two shinglefilters, then a common-grams filter. I think posLen impls in commongrams and/or shingle has a bug if the input is already a graph.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-4170) TestRandomChains fail with Shingle+CommonGrams

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13402314#comment-13402314 ] 

Steven Rowe commented on LUCENE-4170:
-------------------------------------

bq. I'm not even sure what token ngramming should mean over an input graph.

A thought problem: run ShingleFilter with mingramsize=2, maxgramsize=3, outputUnigrams=true over input {{\[a/1] \[b/1] \[c/1] \[d/1]}} (where {{/n}} indicates poslength = {{n}}, and {{\[a b]}} indicates tokens {{a}} and {{b}} are at the same position; I'll omit the {{\[]}}'s below when only one token is at a given position), then run ShingleFilter again with the same config over the first ShingleFilter's output:

{noformat}
shinglefilter(min:2,max:3,unigrams:true) with input:  a/1  b/1  c/1  d/1 

"_" token sep: [a/1  a_b/2  a_b_c/3]  [b/1  b_c/2  b_c_d/3]  [c/1  c_d/2]  d/1

shinglefilter(2,3,unigrams) with shinglefilter output above as input:

"=" token sep: [a/1  a_b/2  a_b_c/3  a=b/2  a=b_c/3  a=b_c_d/4  a=b=c/3  a=b=c_d/4  a=b_c=d/4  a_b=c/3  a_b=c_d/4  a_b=c=d/4  a_b_c=d/4]  
               [b/1  b_c/2  b_c_d/3  b=c/2  b=c_d/3  b_c=d/3]
               [c/1  c_d/2  c=d/2]
               d/1
{noformat}

                
> TestRandomChains fail with Shingle+CommonGrams
> ----------------------------------------------
>
>                 Key: LUCENE-4170
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4170
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>         Attachments: LUCENE-4170.patch
>
>
> ant test  -Dtestcase=TestRandomChains -Dtests.method=testRandomChains -Dtests.seed=12635ABB4F789F2A -Dtests.multiplier=3 -Dtests.locale=pt -Dtests.timezone=America/Argentina/Salta -Dargs="-Dfile.encoding=ISO8859-1"
> This test has two shinglefilters, then a common-grams filter. I think posLen impls in commongrams and/or shingle has a bug if the input is already a graph.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-4170) TestRandomChains fail with Shingle+CommonGrams

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13402307#comment-13402307 ] 

Steven Rowe commented on LUCENE-4170:
-------------------------------------

bq. I think shingles has a similar bug: it doesn't look at the existing posLength of the input tokens at all, instead it just fills posLength with the builtGramSize.

I agree.

However, the problem isn't just position length: ShingleFilter has never handled input position increments of zero, so real graph compatibility will mean fixing that too.

I think Karl Wettin's ShingleMatrixFilter (deprecated in 3.6, dropped in 4.0) is an attempt to permute all combinations of overlapping (poslength=1) terms to produce shingles.  ShingleMatrixFilter wouldn't handle poslength > 1, though.

I'm not even sure what token ngramming should mean over an input graph.  The trivial case where input tokens' poslength is always zero and position increment is always one is obviously already handled.

I think both issues should be handled, since poslength > 1 will very likely be used with posincr = 0, e.g. synonyms and kuromoji de-compounding.

                
> TestRandomChains fail with Shingle+CommonGrams
> ----------------------------------------------
>
>                 Key: LUCENE-4170
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4170
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>         Attachments: LUCENE-4170.patch
>
>
> ant test  -Dtestcase=TestRandomChains -Dtests.method=testRandomChains -Dtests.seed=12635ABB4F789F2A -Dtests.multiplier=3 -Dtests.locale=pt -Dtests.timezone=America/Argentina/Salta -Dargs="-Dfile.encoding=ISO8859-1"
> This test has two shinglefilters, then a common-grams filter. I think posLen impls in commongrams and/or shingle has a bug if the input is already a graph.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org