You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Leonardo Oliveira <ll...@gmail.com> on 2014/05/12 01:10:56 UTC
Payload use case

Hi, everybody...  This is my first post....

a wrote an PDF text extractor able to return text in the follow format:

"The|(1,2)(3,4) quick|(5,6)(7,8) brown|(9,10)(11,12) ..."

where

each (x,y) is a coordinate on a two dimensions of the page in which the
terms are positioned, ie:

"The"
(1,2) is the upper left coordinate of the letter 'T'
(3,4) is the lower right coordinate of the letter 'e'

"quick"
(5,6) is the upper left coordinate of the letter 'q'
(7,8) is the lower right coordinate of the letter 'k'

and so on ...

For text indexing, i think to store each coordinate as
paylodas for each word/term of sentence. I already know how to store them
through a custom
DelimitedPayloadTokenFilter, but I don't know what is the best way to read
those payloads at query time, ie, i need to read the payloads terms that
match with user's query, so, with this information i'll be able to
highlight the words found in the user's screen.

I don't want to use the highlight on the text as occurs with default
Highlighter or
FastVectorHighlighter, but over the image (thumbnail), ie, i want a
2-dimensional payload based highlighter. This way I would not need to store
the original text and decrease index size,  moreover improves the user
experience with "visual highlighted text fragment"

My question is: Am I doing the proper use of payloads for my use case? Or
should I use another
strategy to store those coordinates to be able to read them at query time?

I would have some performance issue if i`ll need to read a lot of payloads
that match with
user's query?

Are payloads part of the lucene cache?

Payloads should be used only for relevance purposes with a custom
implementation of Similarity class?

How can i use coordinates as "term offsets"? because in this case, my
"offset" is a relative to global cartesian'`s axis, not based on global
offset from source text.

Thank you for listening.

Regards