You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Grant Ingersoll (JIRA)" <ji...@apache.org> on 2009/06/02 18:40:07 UTC

[jira] Created: (LUCENE-1676) New Token filter for adding payloads "in-stream"

New Token filter for adding payloads "in-stream"
------------------------------------------------

                 Key: LUCENE-1676
                 URL: https://issues.apache.org/jira/browse/LUCENE-1676
             Project: Lucene - Java
          Issue Type: New Feature
          Components: contrib/analyzers
            Reporter: Grant Ingersoll
            Assignee: Grant Ingersoll
            Priority: Minor
             Fix For: 2.9


This TokenFilter is able to split a token based on a delimiter and use one part as the token and the other part as a payload.  This allows someone to include payloads inline with tokens (presumably setup by a pipeline ahead of time).  An example is apropos.  Given a | delimiter, we could have a stream that looks like:
{quote}The quick|JJ red|JJ fox|NN jumped|VB over the lazy|JJ brown|JJ dogs|NN{quote}

In this case, this would produce tokens and payloads (assuming whitespace tokenization):
Token: the
Payload: null

Token: quick
Payload: JJ

Token: red
Pay: JJ.

and so on.

This patch will also support pluggable encoders for the payloads, so it can convert from the character array to byte arrays as appropriate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1676) New Token filter for adding payloads "in-stream"

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718815#action_12718815 ] 

Grant Ingersoll commented on LUCENE-1676:
-----------------------------------------

OK, I moved to contrib/CHANGES.  I'm going to commit this today.


> New Token filter for adding payloads "in-stream"
> ------------------------------------------------
>
>                 Key: LUCENE-1676
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1676
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1676.patch
>
>
> This TokenFilter is able to split a token based on a delimiter and use one part as the token and the other part as a payload.  This allows someone to include payloads inline with tokens (presumably setup by a pipeline ahead of time).  An example is apropos.  Given a | delimiter, we could have a stream that looks like:
> {quote}The quick|JJ red|JJ fox|NN jumped|VB over the lazy|JJ brown|JJ dogs|NN{quote}
> In this case, this would produce tokens and payloads (assuming whitespace tokenization):
> Token: the
> Payload: null
> Token: quick
> Payload: JJ
> Token: red
> Pay: JJ.
> and so on.
> This patch will also support pluggable encoders for the payloads, so it can convert from the character array to byte arrays as appropriate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1676) New Token filter for adding payloads "in-stream"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718584#action_12718584 ] 

Mark Miller commented on LUCENE-1676:
-------------------------------------

I think we should decide on one way or another though. Information gets lost and scattered arbitrarily otherwise. The position sensitive hit highlighting patch (spanscorer) didnt make any changes file. I don't feel its a real big deal either, but I favor consistency over scattered and somewhat arbitrary.

> New Token filter for adding payloads "in-stream"
> ------------------------------------------------
>
>                 Key: LUCENE-1676
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1676
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1676.patch
>
>
> This TokenFilter is able to split a token based on a delimiter and use one part as the token and the other part as a payload.  This allows someone to include payloads inline with tokens (presumably setup by a pipeline ahead of time).  An example is apropos.  Given a | delimiter, we could have a stream that looks like:
> {quote}The quick|JJ red|JJ fox|NN jumped|VB over the lazy|JJ brown|JJ dogs|NN{quote}
> In this case, this would produce tokens and payloads (assuming whitespace tokenization):
> Token: the
> Payload: null
> Token: quick
> Payload: JJ
> Token: red
> Pay: JJ.
> and so on.
> This patch will also support pluggable encoders for the payloads, so it can convert from the character array to byte arrays as appropriate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1676) New Token filter for adding payloads "in-stream"

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718817#action_12718817 ] 

Grant Ingersoll commented on LUCENE-1676:
-----------------------------------------

BTW, I'm curious if people have a better way to convert from char[] to byte[] for encoding the payloads (see FloatEncoder), other than going through Strings.

> New Token filter for adding payloads "in-stream"
> ------------------------------------------------
>
>                 Key: LUCENE-1676
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1676
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1676.patch
>
>
> This TokenFilter is able to split a token based on a delimiter and use one part as the token and the other part as a payload.  This allows someone to include payloads inline with tokens (presumably setup by a pipeline ahead of time).  An example is apropos.  Given a | delimiter, we could have a stream that looks like:
> {quote}The quick|JJ red|JJ fox|NN jumped|VB over the lazy|JJ brown|JJ dogs|NN{quote}
> In this case, this would produce tokens and payloads (assuming whitespace tokenization):
> Token: the
> Payload: null
> Token: quick
> Payload: JJ
> Token: red
> Pay: JJ.
> and so on.
> This patch will also support pluggable encoders for the payloads, so it can convert from the character array to byte arrays as appropriate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1676) New Token filter for adding payloads "in-stream"

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718626#action_12718626 ] 

Michael McCandless commented on LUCENE-1676:
--------------------------------------------

I agree we should decide.

I would lean towards always using contrib/CHANGES.  And then we should double-check all core CHANGES entries in 2.9 and move them to contrib if needed.

> New Token filter for adding payloads "in-stream"
> ------------------------------------------------
>
>                 Key: LUCENE-1676
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1676
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1676.patch
>
>
> This TokenFilter is able to split a token based on a delimiter and use one part as the token and the other part as a payload.  This allows someone to include payloads inline with tokens (presumably setup by a pipeline ahead of time).  An example is apropos.  Given a | delimiter, we could have a stream that looks like:
> {quote}The quick|JJ red|JJ fox|NN jumped|VB over the lazy|JJ brown|JJ dogs|NN{quote}
> In this case, this would produce tokens and payloads (assuming whitespace tokenization):
> Token: the
> Payload: null
> Token: quick
> Payload: JJ
> Token: red
> Pay: JJ.
> and so on.
> This patch will also support pluggable encoders for the payloads, so it can convert from the character array to byte arrays as appropriate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1676) New Token filter for adding payloads "in-stream"

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718581#action_12718581 ] 

Michael McCandless commented on LUCENE-1676:
--------------------------------------------

Yeah we have not been consistent about it in the past... it's very much a chicken/egg thing, though.  If we consistently use contrib's CHANGES then presumably it'd get more visibility.  But I really don't feel strongly one way or another...

> New Token filter for adding payloads "in-stream"
> ------------------------------------------------
>
>                 Key: LUCENE-1676
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1676
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1676.patch
>
>
> This TokenFilter is able to split a token based on a delimiter and use one part as the token and the other part as a payload.  This allows someone to include payloads inline with tokens (presumably setup by a pipeline ahead of time).  An example is apropos.  Given a | delimiter, we could have a stream that looks like:
> {quote}The quick|JJ red|JJ fox|NN jumped|VB over the lazy|JJ brown|JJ dogs|NN{quote}
> In this case, this would produce tokens and payloads (assuming whitespace tokenization):
> Token: the
> Payload: null
> Token: quick
> Payload: JJ
> Token: red
> Pay: JJ.
> and so on.
> This patch will also support pluggable encoders for the payloads, so it can convert from the character array to byte arrays as appropriate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1676) New Token filter for adding payloads "in-stream"

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718433#action_12718433 ] 

Michael McCandless commented on LUCENE-1676:
--------------------------------------------

Shouldn't the CHANGES entry in this patch go into contrib/CHANGES?

> New Token filter for adding payloads "in-stream"
> ------------------------------------------------
>
>                 Key: LUCENE-1676
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1676
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1676.patch
>
>
> This TokenFilter is able to split a token based on a delimiter and use one part as the token and the other part as a payload.  This allows someone to include payloads inline with tokens (presumably setup by a pipeline ahead of time).  An example is apropos.  Given a | delimiter, we could have a stream that looks like:
> {quote}The quick|JJ red|JJ fox|NN jumped|VB over the lazy|JJ brown|JJ dogs|NN{quote}
> In this case, this would produce tokens and payloads (assuming whitespace tokenization):
> Token: the
> Payload: null
> Token: quick
> Payload: JJ
> Token: red
> Pay: JJ.
> and so on.
> This patch will also support pluggable encoders for the payloads, so it can convert from the character array to byte arrays as appropriate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1676) New Token filter for adding payloads "in-stream"

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated LUCENE-1676:
------------------------------------

    Attachment: LUCENE-1676.patch

Here's a first draft of this.  See the test case for an example.

> New Token filter for adding payloads "in-stream"
> ------------------------------------------------
>
>                 Key: LUCENE-1676
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1676
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1676.patch
>
>
> This TokenFilter is able to split a token based on a delimiter and use one part as the token and the other part as a payload.  This allows someone to include payloads inline with tokens (presumably setup by a pipeline ahead of time).  An example is apropos.  Given a | delimiter, we could have a stream that looks like:
> {quote}The quick|JJ red|JJ fox|NN jumped|VB over the lazy|JJ brown|JJ dogs|NN{quote}
> In this case, this would produce tokens and payloads (assuming whitespace tokenization):
> Token: the
> Payload: null
> Token: quick
> Payload: JJ
> Token: red
> Pay: JJ.
> and so on.
> This patch will also support pluggable encoders for the payloads, so it can convert from the character array to byte arrays as appropriate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1676) New Token filter for adding payloads "in-stream"

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718447#action_12718447 ] 

Grant Ingersoll commented on LUCENE-1676:
-----------------------------------------

bq. Shouldn't the CHANGES entry in this patch go into contrib/CHANGES?

It can, I've never quite been sure.  I think more people read the top-level CHANGES, thus it is more likely to be noticed, but I'm fine either way.

> New Token filter for adding payloads "in-stream"
> ------------------------------------------------
>
>                 Key: LUCENE-1676
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1676
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1676.patch
>
>
> This TokenFilter is able to split a token based on a delimiter and use one part as the token and the other part as a payload.  This allows someone to include payloads inline with tokens (presumably setup by a pipeline ahead of time).  An example is apropos.  Given a | delimiter, we could have a stream that looks like:
> {quote}The quick|JJ red|JJ fox|NN jumped|VB over the lazy|JJ brown|JJ dogs|NN{quote}
> In this case, this would produce tokens and payloads (assuming whitespace tokenization):
> Token: the
> Payload: null
> Token: quick
> Payload: JJ
> Token: red
> Pay: JJ.
> and so on.
> This patch will also support pluggable encoders for the payloads, so it can convert from the character array to byte arrays as appropriate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1676) New Token filter for adding payloads "in-stream"

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718943#action_12718943 ] 

Grant Ingersoll commented on LUCENE-1676:
-----------------------------------------

I grabbed Apache Harmony's Integer.parseInt() code and converted it to take in a char array, which should speed up the IntegerEncoder.  However, the Float.parseInt implementation relies on some constructs that are not available in JDK 1.4, so that one is going to have to stay as it is.

The main problem lies in the reliance on the HexStringParser (https://svn.apache.org/repos/asf/harmony/enhanced/classlib/archive/java6/modules/luni/src/main/java/org/apache/harmony/luni/util/HexStringParser.java) which is in need of some Long specific attributes that are either >JDK1.4 or are Harmony specific attributes of Long (I didn't take the time to investigate)

At any rate, I added the Integer stuff to ArrayUtils and also added some tests.

For reference, see: 
https://svn.apache.org/repos/asf/harmony/enhanced/classlib/archive/java6/modules/luni/src/main/java/org/apache/harmony/luni/util/FloatingPointParser.java

https://svn.apache.org/repos/asf/harmony/enhanced/classlib/archive/java6/modules/luni/src/main/java/java/lang/Integer.java



> New Token filter for adding payloads "in-stream"
> ------------------------------------------------
>
>                 Key: LUCENE-1676
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1676
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1676.patch
>
>
> This TokenFilter is able to split a token based on a delimiter and use one part as the token and the other part as a payload.  This allows someone to include payloads inline with tokens (presumably setup by a pipeline ahead of time).  An example is apropos.  Given a | delimiter, we could have a stream that looks like:
> {quote}The quick|JJ red|JJ fox|NN jumped|VB over the lazy|JJ brown|JJ dogs|NN{quote}
> In this case, this would produce tokens and payloads (assuming whitespace tokenization):
> Token: the
> Payload: null
> Token: quick
> Payload: JJ
> Token: red
> Pay: JJ.
> and so on.
> This patch will also support pluggable encoders for the payloads, so it can convert from the character array to byte arrays as appropriate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-1676) New Token filter for adding payloads "in-stream"

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll resolved LUCENE-1676.
-------------------------------------

       Resolution: Fixed
    Lucene Fields:   (was: [New])

Committed revision 784297.

> New Token filter for adding payloads "in-stream"
> ------------------------------------------------
>
>                 Key: LUCENE-1676
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1676
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1676.patch
>
>
> This TokenFilter is able to split a token based on a delimiter and use one part as the token and the other part as a payload.  This allows someone to include payloads inline with tokens (presumably setup by a pipeline ahead of time).  An example is apropos.  Given a | delimiter, we could have a stream that looks like:
> {quote}The quick|JJ red|JJ fox|NN jumped|VB over the lazy|JJ brown|JJ dogs|NN{quote}
> In this case, this would produce tokens and payloads (assuming whitespace tokenization):
> Token: the
> Payload: null
> Token: quick
> Payload: JJ
> Token: red
> Pay: JJ.
> and so on.
> This patch will also support pluggable encoders for the payloads, so it can convert from the character array to byte arrays as appropriate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1676) New Token filter for adding payloads "in-stream"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718443#action_12718443 ] 

Mark Miller commented on LUCENE-1676:
-------------------------------------

That has been minorly inconsistent in the past. I have seen an occasion or two where contrib changes have made core changes. I think its inconsistent, and we should keep those changes in their respective changes.txt or make one for them, but it has happened.

> New Token filter for adding payloads "in-stream"
> ------------------------------------------------
>
>                 Key: LUCENE-1676
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1676
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1676.patch
>
>
> This TokenFilter is able to split a token based on a delimiter and use one part as the token and the other part as a payload.  This allows someone to include payloads inline with tokens (presumably setup by a pipeline ahead of time).  An example is apropos.  Given a | delimiter, we could have a stream that looks like:
> {quote}The quick|JJ red|JJ fox|NN jumped|VB over the lazy|JJ brown|JJ dogs|NN{quote}
> In this case, this would produce tokens and payloads (assuming whitespace tokenization):
> Token: the
> Payload: null
> Token: quick
> Payload: JJ
> Token: red
> Pay: JJ.
> and so on.
> This patch will also support pluggable encoders for the payloads, so it can convert from the character array to byte arrays as appropriate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org