You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Sergey Repnikov <re...@megaputer.ru> on 2016/10/31 11:55:17 UTC

Implementing own Analyser components.

Hello. My name is Sergeiy, I'm working on Lucene's functionality extension.

As I've read in JavaDoc for "org.apache.lucene.analysis" package, it's 
preferably to ask this email before extending, because some features 
could be done.

So I want to have opportunity to perform search by parts of speech and 
within a sentence. Is there any way to get this functionality out of the 
box? If it is, how?

If it's not, do I understand correct, that custom attributes are not 
being saved to index while writing "tokenstrean" into Directory? And the 
only way to save any metadata, associated with term is to use payload, 
and then, while searching, ask for it?

As I've found in Google, payload is being saved not alongside with term, 
but it(payload) is associated with term by position count. I haven't yet 
understood, how does index save tokens and associated metadata, maybe 
that speciality is crucial sometime. Maybe it's not. Maybe there is a 
way to extend index/IndexWriter to save and then retrieve custom 
attributes.

So can you tell me, based by your experience, what is the best way to do 
what i want?

Thank you.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Implementing own Analyser components.

Posted by Fuad Efendi <fu...@efendi.ca>.
Hi Sergey,


Here is the table of tags from http://www.nltk.org/book/ch05.html

Tag Meaning English Examples
ADJ adjective new, good, high, special, big, local
ADP adposition on, of, at, with, by, into, under
ADV adverb really, already, still, early, now
CONJ conjunction and, or, but, if, while, although
DET determiner, article the, a, some, most, every, no, which
NOUN noun year, home, costs, time, Africa
NUM numeral twenty-four, fourth, 1991, 14:24
PRT particle at, on, out, over per, that, up, with
PRON pronoun he, their, her, its, my, I, us
VERB verb is, say, told, given, playing, would
. punctuation marks . , ; !
X other ersatz, esprit, dunno, gr8, univeristy




So that DET and CONJ are stop-words for most cases Lucene tries to resolve.
For example, there is absolutely no need to search for PRON:he since it
will return 100% of documents for a fiction books site.


However, if you still ned to index tokens such as “Brand:Microsoft”,
“Sentiment:Positive”, “DET:123” and so on, you can do it in Lucene, by
defining fields: Brand, Sentiment, DET, PRON, VERB, and so on.



I hope I helped a little :) thanks,


Fuad Efendi
Search Relevancy Tuning
http://www.tokenizer.ca







On October 31, 2016 at 7:53:28 AM, Sergey Repnikov (repnikov@megaputer.ru)
wrote:

Hello. My name is Sergeiy, I'm working on Lucene's functionality extension.

As I've read in JavaDoc for "org.apache.lucene.analysis" package, it's
preferably to ask this email before extending, because some features
could be done.

So I want to have opportunity to perform search by parts of speech and
within a sentence. Is there any way to get this functionality out of the
box? If it is, how?

If it's not, do I understand correct, that custom attributes are not
being saved to index while writing "tokenstrean" into Directory? And the
only way to save any metadata, associated with term is to use payload,
and then, while searching, ask for it?

As I've found in Google, payload is being saved not alongside with term,
but it(payload) is associated with term by position count. I haven't yet
understood, how does index save tokens and associated metadata, maybe
that speciality is crucial sometime. Maybe it's not. Maybe there is a
way to extend index/IndexWriter to save and then retrieve custom
attributes.

So can you tell me, based by your experience, what is the best way to do
what i want?

Thank you.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org