You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by John Paul Sondag <js...@uiuc.edu> on 2007/07/03 19:49:48 UTC

Retrieve nearest token based off location in original Text

Hi,

I was wondering if it's possible to get the token offset based of the
position in the original text.

My problem is I'm working on my own "Snippet Generator" and I'm giving a
token index (call it t) as input and need to make a snippet of the original
text.  I want the Snippet to be some number of tokens (call it n tokens).
But to make the Snippet easier to read I want to see if it's close to the
end of a paragraph (if it is I'll make more of the Snippet before the token
than usual).  So I'm scanning the original text forward some number of
characters looking for a new line or tab.  If I find it I'd like to get the
token before that new line (and it's offset, call it y).  Once I have the
offset I know I have y - t tokens after my token, and finally I know I put
n-(y-t) tokens before my token and can successfully make my Snippet.

Thanks in advance!

--JP

Re: Retrieve nearest token based off location in original Text

Posted by Chris Hostetter <ho...@fucit.org>.

: This is index of 1, is has index 2, an has index 3 Example has index 4.
: What I have is the actual "character position" in the original text.  "This"

in that case, you'll have to do a while loop over next() calls and check
the startOffset (or endOffset) of each untill you find the one you are
looking for.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Retrieve nearest token based off location in original Text

Posted by John Paul Sondag <js...@uiuc.edu>.

I thought that went to the "index" of the token.  I may not understand it
completely but this is how I currently view the TokenStream

For example if my text was the following:

This is an Example

This is index of 1, is has index 2, an has index 3 Example has index 4.
What I have is the actual "character position" in the original text.  "This"
is characters 0-3, "is" is characters 5-6, "an" is characters 8-9, and
"Example" is characters 11-17.  I know that given Token 4 (Example) I can
get the startOffset and endOffset (11, and 17).  What I'm wondering is given
character offset can I get a tokenIndex.  (I.E.  given character offset 12,
it would return 3, because Example is the closest token that starts at
character 12).

--JP

On 7/6/07, Chris Hostetter <ho...@fucit.org> wrote:
>
> : I never got a response to this and thought maybe I was too wordy.
> :
> : I'm wondering if there's a way where given a position in the original
> text
> : you can retrieve the token index that is nearest to that position using
> the
> : StandardToken/StandardTokenizer classes?
>
> i may not be understanding the question, but wouldn't that just be...
>
>   TokenStream s = getTokenStreamForOrriginalText()
>   Token t;
>   for (i=0; i<thePositionYouKnow; i++) {
>     t = s.next();
>   }
>   return t;
>
> ?
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Retrieve nearest token based off location in original Text

Posted by Chris Hostetter <ho...@fucit.org>.

: I never got a response to this and thought maybe I was too wordy.
:
: I'm wondering if there's a way where given a position in the original text
: you can retrieve the token index that is nearest to that position using the
: StandardToken/StandardTokenizer classes?

i may not be understanding the question, but wouldn't that just be...

  TokenStream s = getTokenStreamForOrriginalText()
  Token t;
  for (i=0; i<thePositionYouKnow; i++) {
    t = s.next();
  }
  return t;

?


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Retrieve nearest token based off location in original Text

Posted by John Paul Sondag <js...@uiuc.edu>.

Hi,

I never got a response to this and thought maybe I was too wordy.

I'm wondering if there's a way where given a position in the original text
you can retrieve the token index that is nearest to that position using the
StandardToken/StandardTokenizer classes?



--JP

On 7/3/07, John Paul Sondag <js...@uiuc.edu> wrote:
>
> Hi,
>
> I was wondering if it's possible to get the token offset based of the
> position in the original text.
>
> My problem is I'm working on my own "Snippet Generator" and I'm giving a
> token index (call it t) as input and need to make a snippet of the original
> text.  I want the Snippet to be some number of tokens (call it n tokens).
> But to make the Snippet easier to read I want to see if it's close to the
> end of a paragraph (if it is I'll make more of the Snippet before the token
> than usual).  So I'm scanning the original text forward some number of
> characters looking for a new line or tab.  If I find it I'd like to get the
> token before that new line (and it's offset, call it y).  Once I have the
> offset I know I have y - t tokens after my token, and finally I know I put
> n-(y-t) tokens before my token and can successfully make my Snippet.
>
> Thanks in advance!
>
> --JP
>